These are representative sample records from Science.gov related to your search topic.
For comprehensive and current results, perform a real-time search at Science.gov.
1

Design Space Exploration for Massively Parallel Processor Arrays  

Microsoft Academic Search

In this paper, we describe an approach for the optimiza- tion of dedicated co-processors that are implemented either in hardware (ASIC) or congware (FPGA). Such massively parallel co-processors are typically part of a heterogeneous hardware\\/software-system. Each co- processor is a massive parallel system consisting of an array of processing elements (PEs). In order to decide whether to map a computational

Frank Hannig; Jürgen Teich

2001-01-01

2

Titanic: a VLSI based content addressable parallel array processor  

SciTech Connect

A design is presented for a content addressable parallel array processor (CAPAP) which is both practical and feasible. Its practicality stems from an extensive program of research into real applications of content addressability and parallelism. The feasibility of the design stems from development under a set of conservative engineering constraints tied to limitations of VLSI technology. 1 ref.

Weems, C.; Levitan, S.; Foster, C.

1982-01-01

3

Parallel processing in a host plus multiple array processor system for radar  

NASA Technical Reports Server (NTRS)

Host plus multiple array processor architecture is demonstrated to yield a modular, fast, and cost-effective system for radar processing. Software methodology for programming such a system is developed. Parallel processing with pipelined data flow among the host, array processors, and discs is implemented. Theoretical analysis of performance is made and experimentally verified. The broad class of problems to which the architecture and methodology can be applied is indicated.

Barkan, B. Z.

1983-01-01

4

Seasat Synthetic-Aperture Radar Data Reduction Using Parallel Programmable Array Processors  

Microsoft Academic Search

This paper presents a digital signal processing system that produces the SEASAT synthetic-aperture radar (SAR) imagery. The system consists of a SEL 32\\/77 host minicomputer and three AP-120B array processors. The partitioning of the SAR processing functions and the design of softwae modules is described. The rationale for selecting the parallel array processor architecture and the methodology for developing the

Chialin Wu; Budak Barkan; Walter J. Karplus; Dennis Caswell

1982-01-01

5

Massively parallel processor computer  

NASA Technical Reports Server (NTRS)

An apparatus for processing multidimensional data with strong spatial characteristics, such as raw image data, characterized by a large number of parallel data streams in an ordered array is described. It comprises a large number (e.g., 16,384 in a 128 x 128 array) of parallel processing elements operating simultaneously and independently on single bit slices of a corresponding array of incoming data streams under control of a single set of instructions. Each of the processing elements comprises a bidirectional data bus in communication with a register for storing single bit slices together with a random access memory unit and associated circuitry, including a binary counter/shift register device, for performing logical and arithmetical computations on the bit slices, and an I/O unit for interfacing the bidirectional data bus with the data stream source. The massively parallel processor architecture enables very high speed processing of large amounts of ordered parallel data, including spatial translation by shifting or sliding of bits vertically or horizontally to neighboring processing elements.

Fung, L. W. (inventor)

1983-01-01

6

Computational cost of image registration with a parallel binary array processor  

SciTech Connect

The application of a simulated binary array processor (BAP) to the rapid analysis of a sequence of images has been studied. Several algorithms have been developed which may be implemented on many existing parallel processing machines. The characteristic operations of a BAP are discussed and analyzed. A set of preprocessing algorithms are described which are designed to register two images of tv-type video data in real time. These algorithms illustrate the potential uses of a BAP and their cost is analyzed in detail. The results of applying these algorithms to flir data and to noisy optical data are given. An analysis of these algorithms illustrates the importance of an efficient global feature extraction hardware for image understanding applications. 16 references.

Reeves, A.P.; Rostampour, A.

1982-07-01

7

Parallel Parsing of Context-Free Languages on an Array of Processors   

E-print Network

Kosaraju [Kosaraju 69] and independently ten years later, Guibas, Kung and Thompson [Guibas 79] devised an algorithm (K-GKT) for solving on an array of processors a class of dynamic programming problems of which general ...

Langlois, Laurent Chevalier.

1988-01-01

8

Image processing system architecture using parallel arrays of digital signal processors  

NASA Astrophysics Data System (ADS)

The paper describes the requirements of a high definition, high speed image processing system. Different types of parallel architectures were considered for the system. Advantages and limitations of SIMD and MIMD architectures are briefly discussed for image processing applications. A parallel image processing system based on MIMD architecture has been developed using multiple digital signal processors which can communicate with each other through an interconnection network. Texas Instruments TMS320C40 digital signal processors have been selected because they have a powerful floating point CPU supported by fast parallel communication ports, a DMA coprocessor and two memory interfaces. A five processor system is described in the paper. The EISA bus is used as the host interface and VISION bus is used to transfer images between the processors. The system is being used for automated non-contact inspection in which electro-optic signals are processed to identify manufacturing problems.

Kshirsagar, Shirish P.; Hobson, Clifford A.; Hartley, David A.; Harvey, David M.

1993-10-01

9

Spaceborne Processor Array  

NASA Technical Reports Server (NTRS)

A Spaceborne Processor Array in Multifunctional Structure (SPAMS) can lower the total mass of the electronic and structural overhead of spacecraft, resulting in reduced launch costs, while increasing the science return through dynamic onboard computing. SPAMS integrates the multifunctional structure (MFS) and the Gilgamesh Memory, Intelligence, and Network Device (MIND) multi-core in-memory computer architecture into a single-system super-architecture. This transforms every inch of a spacecraft into a sharable, interconnected, smart computing element to increase computing performance while simultaneously reducing mass. The MIND in-memory architecture provides a foundation for high-performance, low-power, and fault-tolerant computing. The MIND chip has an internal structure that includes memory, processing, and communication functionality. The Gilgamesh is a scalable system comprising multiple MIND chips interconnected to operate as a single, tightly coupled, parallel computer. The array of MIND components shares a global, virtual name space for program variables and tasks that are allocated at run time to the distributed physical memory and processing resources. Individual processor- memory nodes can be activated or powered down at run time to provide active power management and to configure around faults. A SPAMS system is comprised of a distributed Gilgamesh array built into MFS, interfaces into instrument and communication subsystems, a mass storage interface, and a radiation-hardened flight computer.

Chow, Edward T.; Schatzel, Donald V.; Whitaker, William D.; Sterling, Thomas

2008-01-01

10

Optical systolic array processor using residue arithmetic  

NASA Technical Reports Server (NTRS)

The use of residue arithmetic to increase the accuracy and reduce the dynamic range requirements of optical matrix-vector processors is evaluated. It is determined that matrix-vector operations and iterative algorithms can be performed totally in residue notation. A new parallel residue quantizer circuit is developed which significantly improves the performance of the systolic array feedback processor. Results are presented of a computer simulation of this system used to solve a set of three simultaneous equations.

Jackson, J.; Casasent, D.

1983-01-01

11

An inner product processor design using novel parallel counter circuits  

Microsoft Academic Search

This paper presents a novel parallel inner product processor architecture. The proposed processor has the following features: (1) it can be easily reconfigured for computing inner products of input arrays with four or more types of structures. Typically, each input array may contain 64 8-bit items, or 16 16-bit items, or 4 32-bit items, or 1 64-bit item, with items

Rong Lin; A. S. Botha; K. E. Kerr; G. A. Brown

1999-01-01

12

Parallel processor engine model program  

NASA Technical Reports Server (NTRS)

The Parallel Processor Engine Model Program is a generalized engineering tool intended to aid in the design of parallel processing real-time simulations of turbofan engines. It is written in the FORTRAN programming language and executes as a subset of the SOAPP simulation system. Input/output and execution control are provided by SOAPP; however, the analysis, emulation and simulation functions are completely self-contained. A framework in which a wide variety of parallel processing architectures could be evaluated and tools with which the parallel implementation of a real-time simulation technique could be assessed are provided.

Mclaughlin, P.

1984-01-01

13

Parallel Analog-to-Digital Image Processor  

NASA Technical Reports Server (NTRS)

Proposed integrated-circuit network of many identical units convert analog outputs of imaging arrays of x-ray or infrared detectors to digital outputs. Converter located near imaging detectors, within cryogenic detector package. Because converter output digital, lends itself well to multiplexing and to postprocessing for correction of gain and offset errors peculiar to each picture element and its sampling and conversion circuits. Analog-to-digital image processor is massively parallel system for processing data from array of photodetectors. System built as compact integrated circuit located near local plane. Buffer amplifier for each picture element has different offset.

Lokerson, D. C.

1987-01-01

14

The Use of a Microcomputer Based Array Processor for Real Time Laser Velocimeter Data Processing  

NASA Technical Reports Server (NTRS)

The application of an array processor to laser velocimeter data processing is presented. The hardware is described along with the method of parallel programming required by the array processor. A portion of the data processing program is described in detail. The increase in computational speed of a microcomputer equipped with an array processor is illustrated by comparative testing with a minicomputer.

Meyers, James F.

1990-01-01

15

Adaptively Parallel Processor Allocation for Cilk Jobs  

E-print Network

The problem of allocating processor resources fairly and efficiently to parallel jobs has been studied extensively in the past. Most of this work, however, assumes that the instantaneous parallelism of the jobs is known ...

Sen, Siddhartha

16

Computing Flow Transition On Parallel Processors  

NASA Technical Reports Server (NTRS)

Parallel algorithm developed on multiple-microprocessor computer. Program initiated to develop computer codes capable of directly simulating and mathematically modeling transition process at mach numbers ranging from subsonic to hypersonic. Parallel computers potentially offer reduction of processing time; processing time inversely proportional to number of available processors.

Bokhari, S.; Erlebacher, G.; Hussaini, M. Y.

1993-01-01

17

Random number generators for MIMD parallel processors  

SciTech Connect

The authors discuss and analyze issues related to the design of pseudorandom number generators (prn's) for MIMD (multiple instruction stream/multiple data stream) parallel processors, which are very well suited to Monte Carlo calculations. They are concerned with ensuring reproducibility of runs, providing very long sequences, and assuring an adequate degree of independence of the parallel streams.

Percus, O.E.; Kalos, M.H. (New York Univ., NY (USA))

1989-01-01

18

Random Number Generators for MIMD Parallel Processors  

Microsoft Academic Search

The authors discuss and analyze issues related to the design of pseudorandom number generators (prn's) for MIMD (multiple instruction stream\\/multiple data stream) parallel processors, which are very well suited to Monte Carlo calculations. They are concerned with ensuring reproducibility of runs, providing very long sequences, and assuring an adequate degree of independence of the parallel streams.

Ora E. Percus; Malvin H. Kalos

1989-01-01

19

Parallel processor for fast event analysis  

SciTech Connect

Current maximum data rates from the Spin Spectrometer of approx. 5000 events/s (up to 1.3 MBytes/s) and minimum analysis requiring at least 3000 operations/event require a CPU cycle time near 70 ns. In order to achieve an effective cycle time of 70 ns, a parallel processing device is proposed where up to 4 independent processors will be implemented in parallel. The individual processors are designed around the Am2910 Microsequencer, the AM29116 ..mu..P, and the Am29517 Multiplier. Satellite histogramming in a mass memory system will be managed by a commercial 16-bit ..mu..P system.

Hensley, D.C.

1983-01-01

20

Parallel processor programs in the Federal Government  

NASA Technical Reports Server (NTRS)

In 1982, a report dealing with the nation's research needs in high-speed computing called for increased access to supercomputing resources for the research community, research in computational mathematics, and increased research in the technology base needed for the next generation of supercomputers. Since that time a number of programs addressing future generations of computers, particularly parallel processors, have been started by U.S. government agencies. The present paper provides a description of the largest government programs in parallel processing. Established in fiscal year 1985 by the Institute for Defense Analyses for the National Security Agency, the Supercomputing Research Center will pursue research to advance the state of the art in supercomputing. Attention is also given to the DOE applied mathematical sciences research program, the NYU Ultracomputer project, the DARPA multiprocessor system architectures program, NSF research on multiprocessor systems, ONR activities in parallel computing, and NASA parallel processor projects.

Schneck, P. B.; Austin, D.; Squires, S. L.; Lehmann, J.; Mizell, D.; Wallgren, K.

1985-01-01

21

Parallel processor for fast event analysis  

Microsoft Academic Search

Current maximum data rates from the Spin Spectrometer of approx. 5000 events\\/s (up to 1.3 MBytes\\/s) and minimum analysis requiring at least 3000 operations\\/event require a CPU cycle time near 70 ns. In order to achieve an effective cycle time of 70 ns, a parallel processing device is proposed where up to 4 independent processors will be implemented in parallel.

D. C. Hensley

1983-01-01

22

Parallel processor for the spin spectrometer  

Microsoft Academic Search

Current maximum data rates from the Spin Spectrometer of about 5000 events\\/s (up to 1.3 MBytes\\/s) and minimum analysis requiring at least 3000 operations\\/event require a CPU cycle time near 70 ns. In order to achieve an effective cycle time of 70 ns, a parallel processing device is proposed. Up to 4 independent processors will be implemented in parallel in

D. C. Hensley

1983-01-01

23

Assignment Of Finite Elements To Parallel Processors  

NASA Technical Reports Server (NTRS)

Elements assigned approximately optimally to subdomains. Mapping algorithm based on simulated-annealing concept used to minimize approximate time required to perform finite-element computation on hypercube computer or other network of parallel data processors. Mapping algorithm needed when shape of domain complicated or otherwise not obvious what allocation of elements to subdomains minimizes cost of computation.

Salama, Moktar A.; Flower, Jon W.; Otto, Steve W.

1990-01-01

24

Large integer multiplication on massively parallel processors  

Microsoft Academic Search

Results obtained by multiplying large integers using the Fermat number transform are presented. The effectiveness of the approach was previously limited by word-length constraints, which are not a factor with many new computer architectures. A convolution algorithm on a massively parallel processor, based on the Fermat number transform, is presented. Examples of the tradeoffs between modulus, interprocessor communication steps, and

Barry S. Fagin

1990-01-01

25

Static and Dynamic Processor Scheduling Disciplines in Heterogeneous Parallel Architectures  

Microsoft Academic Search

Most parallel jobs cannot be fully parallelized. In a homogeneous parallel machine-one in which all processors are identical-the serial fraction of the computation has to be executed at the speed of any of the identical processors, limiting the speedup that can be obtained due to parallelism. In a heterogeneous architecture, the sequential bottleneck can be greatly reduced by running the

D. A. Menasce; D. Saha; S. C. D. Porto; V. A. F. Almeida; S. K. Tripathi

1995-01-01

26

Associative massively parallel processor for video processing  

NASA Astrophysics Data System (ADS)

Massively parallel processing architectures have matured primarily through image processing and computer vision application. The similarity of processing requirements between these areas and video processing suggest that they should be very appropriate for video processing applications. This research describes the use of an associative massively parallel processing based system for video compression which includes architectural and system description, discussion of the implementation of compression tasks such as DCT/IDCT, Motion Estimation and Quantization and system evaluation. The core of the processing system is the ASP (Associative String Processor) architecture a modular massively parallel, programmable and inherently fault-tolerant fine-grain SIMD processing architecture incorporating a string of identical APEs (Associative Processing Elements), a reconfigurable inter-processor communication network and a Vector Data Buffer for fully-overlapped data input-output. For video compression applications a prototype system is developed, which is using ASP modules to implement the required compression tasks. This scheme leads to a linear speed up of the computation by simply adding more APEs to the modules.

Krikelis, Argy; Tawiah, T.

1996-03-01

27

Efficient searching and sorting applications using an associative array processor  

NASA Technical Reports Server (NTRS)

The purpose of this paper is to describe a method of searching and sorting data by using some of the unique capabilities of an associative array processor. To understand the application, the associative array processor is described in detail. In particular, the content addressable memory and flip network are discussed because these two unique elements give the associative array processor the power to rapidly sort and search. A simple alphanumeric sorting example is explained in hardware and software terms. The hardware used to explain the application is the STARAN (Goodyear Aerospace Corporation) associative array processor. The software used is the APPLE (Array Processor Programming Language) programming language. Some applications of the array processor are discussed. This summary tries to differentiate between the techniques of the sequential machine and the associative array processor.

Pace, W.; Quinn, M. J.

1978-01-01

28

Acceleration of computer-generated hologram by Greatly Reduced Array of Processor Element with Data Reduction  

NASA Astrophysics Data System (ADS)

We have implemented a computer-generated hologram (CGH) calculation on Greatly Reduced Array of Processor Element with Data Reduction (GRAPE-DR) processors. The cost of CGH calculation is enormous, but CGH calculation is well suited to parallel computation. The GRAPE-DR is a multicore processor that has 512 processor elements. The GRAPE-DR supports a double-precision floating-point operation and can perform CGH calculation with high accuracy. The calculation speed of the GRAPE-DR system is seven times faster than that of a personal computer with an Intel Core i7-950 processor.

Sugiyama, Atsushi; Masuda, Nobuyuki; Oikawa, Minoru; Okada, Naohisa; Kakue, Takashi; Shimobaba, Tomoyoshi; Ito, Tomoyoshi

2014-11-01

29

Design of an array processing unit for an array processor  

Microsoft Academic Search

An effort was made to design a small and cheap array processing unit (APU) to be used in a major distributed process array processor system for digital signal-processing applications. From the task definition and environment analysis, a design is presented. The unit, based on the Texas TMS 32020 and Intel 80186 is of the load-and-forget type and provides efficient memory

E. D. S. Moreira; R. N. Zobel

1988-01-01

30

APEmille: a parallel processor in the teraflop range  

NASA Astrophysics Data System (ADS)

APEmille is a SIMD parallel processor under development at the Italian National Institute for Nuclear Physics (INFN). It is the third machine of the APE family, following Ape and Ape100 and delivering peak performance in the Tflops range. APEmille is very well suited for Lattice QCD applications, both for its hardware characteristics and for its software and language features. APEmille is an array of custom arithmetic processors arranged on a tridimensional torus. The replicated processor is a pipelined VLIW device performing integer and single/double precision IEEE floating point operations. The processor is optimized for complex computations and has a peak performance of 528Mflop at 66MHz. Each replica has 8 Mbytes of locally addressable RAM. In principle an array of 2048 nodes is able to break the Tflops barrier. Two other custom processors are used for program flow control, global addressing and inter node communications. Fast nearest neighbour communications as well as longer distance communications and data broadcast are available. APEmille is interfaced to the external world by a PCI interface and a HIPPI channel. A network of PCs act as the host computer. The APE operating system and the cross compiler run on it. A powerful programming language named TAO is provided and is highly optimized for QCD. A C++ compiler is foreseen. The TAO language is as simple as Fortran but as powerful as object oriented languages. Specific data structures, operators and even statements can be defined by the user for each different application. Effort has been made to define the language constructs for QCD.

Panizzi, E.

1997-02-01

31

Image Rotation Correction With CORDIC Array Processor  

NASA Astrophysics Data System (ADS)

In the document analysis system or the understanding system[1,2], the rotation of the document's image will cause optical character recognition error. Then the document must be scanned and recognized again. This phenomenon will degrade the performance of the automatic document input system. In this paper, we propose a method to estimate the unexpected rotational angle of the image. And we suggest using the pipelined CORDIC array processor architecture to rotate the image back quickly. Thus the performance of the automatic document input system will increase.

Shyu, Keh-Hwa; Jeng, Bor-Shenn; Jou, I.-Chang; Ting, Pei-Yih

1988-10-01

32

Breadboard Signal Processor for Arraying DSN Antennas  

NASA Technical Reports Server (NTRS)

A recently developed breadboard version of an advanced signal processor for arraying many antennas in NASA s Deep Space Network (DSN) can accept inputs in a 500-MHz-wide frequency band from six antennas. The next breadboard version is expected to accept inputs from 16 antennas, and a following developed version is expected to be designed according to an architecture that will be scalable to accept inputs from as many as 400 antennas. These and similar signal processors could also be used for combining multiple wide-band signals in non-DSN applications, including very-long-baseline interferometry and telecommunications. This signal processor performs functions of a wide-band FX correlator and a beam-forming signal combiner. [The term "FX" signifies that the digital samples of two given signals are fast Fourier transformed (F), then the fast Fourier transforms of the two signals are multiplied (X) prior to accumulation.] In this processor, the signals from the various antennas are broken up into channels in the frequency domain (see figure). In each frequency channel, the data from each antenna are correlated against the data from each other antenna; this is done for all antenna baselines (that is, for all antenna pairs). The results of the correlations are used to obtain calibration data to align the antenna signals in both phase and delay. Data from the various antenna frequency channels are also combined and calibration corrections are applied. The frequency-domain data thus combined are then synthesized back to the time domain for passing on to a telemetry receiver

Jongeling, Andre; Sigman, Elliott; Chandra, Kumar; Trinh, Joseph; Soriano, Melissa; Navarro, Robert; Rogstad, Stephen; Goodhart, Charles; Proctor, Robert; Jourdan, Michael; Rayhrer, Benno

2008-01-01

33

Large-Scale Parallel Programming: Experience with the BBN Butterfly Parallel Processor  

E-print Network

Large-Scale Parallel Programming: Experience with the BBN Butterfly Parallel Processor Thomas J. Le of Rochester have used a collection of BBN Butterfly TM Parallel Processors to conduct research in parallel with the Butterfly we have ported three compilers, developed five major and several minor library packages, built two

Scott, Michael L.

34

Chemical network problems solved on NASA/Goddard's massively parallel processor computer  

NASA Technical Reports Server (NTRS)

The single instruction stream, multiple data stream Massively Parallel Processor (MPP) unit consists of 16,384 bit serial arithmetic processors configured as a 128 x 128 array whose speed can exceed that of current supercomputers (Cyber 205). The applicability of the MPP for solving reaction network problems is presented and discussed, including the mapping of the calculation to the architecture, and CPU timing comparisons.

Cho, Seog Y.; Carmichael, Gregory R.

1987-01-01

35

Optoelectronic Array Processors with Applications in Machine Intelligence and Database Management.  

NASA Astrophysics Data System (ADS)

Many computational problems in machine intelligence and database management can be solved using simple array manipulations which are similar to those of linear algebra. The regularity of these operations allows efficient parallel algorithms to be executed on array processors, thereby satisfying demands for increased throughput. A progression of increasingly complex parallel array architectures is presented. Algorithms which exploit the properties of these architectures are presented for various applications. Both conventional and novel neural learning algorithms are mapped onto these architectures. Mathematical reductions of fuzzy inference mechanisms to simple vector operations are presented which allow extremely efficient parallel computations on array architectures. Algorithms which use outer product operations are presented for constraint satisfaction problems. Intelligent secondary storage interfaces based on the array architectures are shown to provide data reduction for relational database applications by executing a portion of the database query directly at the interface. Several optoelectronic array processor designs are described which allow efficient implementations of the array architectures. By introducing optical interconnections, area and delay penalties associated with very-large-scale -integration (VLSI) electronic systems are diminished, since the optoelectronic layout topologies are determined by various optical interconnection strategies rather than by planar wiring requirements. Several simple optical systems are presented and experimentally demonstrated. In particular, the optical transpose interconnection system is shown to support several array architectures. The advantages of optically interconnected array processors are shown to be significant for large array dimensions due to the fundamental incompatibility between these arrays and the planar nature of VLSI systems. Optoelectronic technologies which support optically interconnected array processors are analyzed according to their effects on system performance. Optimal operational configurations of PLZT and multiple-quantum -well modulators are derived. Multiple-quantum well modulators are shown to have limited fan-out capabilities due to saturation effects. Future research directions for optical interconnections and optoelectronic array processors are also postulated.

Marsden, Gary Colt

36

A Cellular Processor Array Simulation and Hardware Prototyping Tool  

E-print Network

be a memory access bottleneck, such as vision chips [2], high-speed image processing [1][5], cellular neuralA Cellular Processor Array Simulation and Hardware Prototyping Tool David R. W. Barr and Piotr ­ We present a software environment for the efficient simulation of cellular processor arrays (CPAs

Dudek, Piotr

37

MILP model for resource disruption in parallel processor system  

NASA Astrophysics Data System (ADS)

In this paper, we consider the existence of disruption on unrelated parallel processor scheduling system. The disruption occurs due to a resource shortage where one of the parallel processors is facing breakdown problem during the task allocation, which give impact to the initial scheduling plan. Our objective is to reschedule the original unrelated parallel processor scheduling after the resource disruption that minimizes the makespan. A mixed integer linear programming model is presented for the recovery scheduling that considers the post-disruption policy. We conduct a computational experiment with different stopping time limit to see the performance of the model by using CPLEX 12.1 solver in AIMMS 3.10 software.

Nordin, Syarifah Zyurina; Caccetta, Louis

2015-02-01

38

Bispectrum signal processing on HNC`s SIMD numerical array processor (SNAP)  

SciTech Connect

Supercomputers and parallel processors are increasingly being applied to problems traditionally described as signal and image processing problems. The primary activities occurring in either processing area are detection, enhancement, and classification of signals embedded in additive noise. The bispectrum is a processing technique that can be used for improving the detection of signals in noise. It is an order N{sup 2} operation performed over a two dimensional frequency plane and, because of computational demands, has not been used much in practice. HNC has developed a commercially available SIMD Numerical Array Processor (SNAP) and implemented Tracor`s computationally demanding bispectrum signal processing code as a submission for the Gordon Bell prize. The SNAP is a SIMD array of parallel processors connected in a linear ring. A SNAP system with 32 processors (SNAP-32) demonstrated a performance of over 7.5 GIGA FLOP per million dollars.

Means, R.W.; Wallach, B.; Busby, D. [HNC, Inc., San Diego, CA (United States); Lengel, R.C. Jr. [Tracor Applied Sciences, Inc., Austin, TX (United States)

1993-12-31

39

High density packaging and interconnect of massively parallel image processors  

NASA Technical Reports Server (NTRS)

This paper presents conceptual designs for high density packaging of parallel processing systems. The systems fall into two categories: global memory systems where many processors are packaged into a stack, and distributed memory systems where a single processor and many memory chips are packaged into a stack. Thermal behavior and performance are discussed.

Carson, John C.; Indin, Ronald J.

1991-01-01

40

Parallel Catastrophe Modelling on a Cell Processor Frank Dehne1  

E-print Network

and winter storms. Catastrophe models compute consequences for single events and also compute a probabilisticParallel Catastrophe Modelling on a Cell Processor Frank Dehne1 , Glenn Hickey2 , Andrew Rau for catastrophe modelling systems that can be achieved through paral- lelization on a Cell Processor. We studied

Rau-Chaplin, Andrew

41

Efficient event-driven simulation of parallel processor architectures  

Microsoft Academic Search

In this paper we present a new approach for generating high-speed optimized event-driven instruction set level simulators for adaptive massively parallel processor architectures. The simulator generator is part of a methodology for the systematic mapping, evaluation, and exploration of massively parallel processor architectures that are designed for special purpose applications in the world of embedded computers. The generation of high-speed

Alexey Kupriyanov; Dmitrij Kissler; Frank Hannig; Jürgen Teich

2007-01-01

42

DFT algorithms for bit-serial GaAs array processor architectures  

NASA Technical Reports Server (NTRS)

Systems and Processes Engineering Corporation (SPEC) has developed an innovative array processor architecture for computing Fourier transforms and other commonly used signal processing algorithms. This architecture is designed to extract the highest possible array performance from state-of-the-art GaAs technology. SPEC's architectural design includes a high performance RISC processor implemented in GaAs, along with a Floating Point Coprocessor and a unique Array Communications Coprocessor, also implemented in GaAs technology. Together, these data processors represent the latest in technology, both from an architectural and implementation viewpoint. SPEC has examined numerous algorithms and parallel processing architectures to determine the optimum array processor architecture. SPEC has developed an array processor architecture with integral communications ability to provide maximum node connectivity. The Array Communications Coprocessor embeds communications operations directly in the core of the processor architecture. A Floating Point Coprocessor architecture has been defined that utilizes Bit-Serial arithmetic units, operating at very high frequency, to perform floating point operations. These Bit-Serial devices reduce the device integration level and complexity to a level compatible with state-of-the-art GaAs device technology.

Mcmillan, Gary B.

1988-01-01

43

Discounts for dynamic programming with applications in VLSI processor arrays  

SciTech Connect

This dissertation introduces a method for transforming certain dynamic programming problems into ones that require less space and time to solve under the logarithmic cost criterion, an appropriate complexity measure for flexible word-length machines. The mapping is based on discounts that change the costs but not the identities of optimal policies. Under the proper circumstances, the structure present in the original problem is preserved in the image so that the functional equations of dynamic programming still apply. Practical value of the theory is illustrated by demonstrating that a previously published VLSI processor array can be made asymptotically smaller and faster. The second half of this work addresses issues that arise in parallel sequence comparison. The paradigm here is deoxyribonucleic acid (DNA) which maybe considered a string over a four-character alphabet. It is shown how a number of popular sequence matching algorithms can be mapped onto linear arrays of processors. One of these, the Princeton Nucleic Acid Comparator (P-NAC), has been fabricated, tested, and found to work perfectly. Its efficient implementation is due entirely to an application of discounts; benchmark results prove that it is several hundred times faster than a minicomputer.

Lopresti, D.P.

1987-01-01

44

Dynamic processor self-scheduling for general parallel nested loops  

SciTech Connect

This paper proposes a processor self-scheduling scheme for general parallel nested loops in multiprocessor systems. Parallel loops usually constitute most of the execution time in scientific application programs. In a general parallel loop structure, parallel loops, serial loops, and If-Then-Else constructs are nested in an arbitrary order, and the execution time of the loop body may vary substantially from iteration to iteration. In the proposed scheme, programs are instrumented to allow processors to schedule loop iterations among themselves dynamically at run time without the involvement of the operating system.

Fang, Z. (Convex Computer Corp., Richardson TX (US)); Tang, P. (Dept. of Computer Science, Australia National Univ., Canberra ACT 2601 (AU)); Yew, P.C. (Illinois Univ., Urbana, IL (USA). Center for Supercomputing Research and Development); Zhu, C.Q. (Computer Center, Fudan Univ., Shanghai (CN))

1990-07-01

45

CMOS processor element for a fault-tolerant SVD array  

NASA Astrophysics Data System (ADS)

This paper describes the VLSI implementation of a CORDIC based processor element for use in a fault-reconfigurable systolic array to compute the singular value decomposition (SVD) of a matrix. The chip implements a time redundant fault tolerance scheme, which allows processors adjacent to a faulty processor to act as computation backup during the systolic idle time. Also, processors around a fault collaborate to reroute data around the faulty processor. This form of time redundancy is attractive when tolerance to a few faults needs to be achieved with little hardware overhead.

Kota, Kishore; Cavallaro, Joseph R.

1993-11-01

46

Overtaking Vehicle Detection Method and Its Implementation Using IMAPCAR Highly Parallel Image Processor  

NASA Astrophysics Data System (ADS)

This paper describes the real-time implementation of a vision-based overtaking vehicle detection method for driver assistance systems using IMAPCAR, a highly parallel SIMD linear array processor. The implemented overtaking vehicle detection method is based on optical flows detected by block matching using SAD and detection of the flows' vanishing point. The implementation is done efficiently by taking advantage of the parallel SIMD architecture of IMAPCAR. As a result, video-rate (33 frames/s) implementation could be achieved.

Sakurai, Kazuyuki; Kyo, Shorin; Okazaki, Shin'ichiro

47

Processor Self-Scheduling for Multiple-Nested Parallel Loops  

Microsoft Academic Search

Processor self-scheduling is a useful scheme in a multiprocessor system if the execution time of each iteration in a parallel loop is not known in advance and varies substantially, or if there are multiple nestings in parallel loops which makes static scheduling difficult and inefficient. By using efficient synchronization primitives, the operating system is not needed for loop scheduling. The

Peiyi Tang; Pen-chung Yew

1986-01-01

48

ProcessorEfficient Parallel Computation of Polynomial Greatest Common Divisors*  

E-print Network

Processor­Efficient Parallel Computation of Polynomial Greatest Common Divisors* Erich Kaltofen@cs.rpi.edu Preliminary Report (July 1, 1989) 1. Introduction We present a parallel algebraic PRAM algorithm that can scheme on an algebraic circuit of size = O(n !+1 log(n)) and depth = O(log(n) 2 ) This more general

Kaltofen, Erich

49

Reconfiguration schemes for fault-tolerant processor arrays  

NASA Astrophysics Data System (ADS)

This project addressed several aspects of the problem of designing highly-reliable dynamically reconfigurable processor arrays. The proposed work focused mainly on reconfiguration schemes required to implement fault-tolerant processor arrays. According to the original statement of work, the following complementary objectives were pursued: (1) a methodology for the design and evaluation of processor-switched arrays; (2) a methodology for the design and evaluation of multi-level hierarchically reconfigurable processor arrays; (3) a methodology for the design of fault-tolerant interconnection routers for processor arrays with decentralized routing control; and (4) algorithm reconfiguration strategies which, together with hardware reconfiguration schemes, can be used to achieve graceful degradation in processor arrays. The emphasis of the proposed research was on the development of optimal reconfiguration schemes for each of the above objectives by using mathematical and simulation tools. For this purpose, evaluation methods and adequate measures were also studied and developed. These measures include not only reliability but also joint measures of performance, hardware area and reliability.

Fortes, Jose A.

1992-10-01

50

Parallel processor-based raster graphics system architecture  

DOEpatents

An apparatus for generating raster graphics images from the graphics command stream includes a plurality of graphics processors connected in parallel, each adapted to receive any part of the graphics command stream for processing the command stream part into pixel data. The apparatus also includes a frame buffer for mapping the pixel data to pixel locations and an interconnection network for interconnecting the graphics processors to the frame buffer. Through the interconnection network, each graphics processor may access any part of the frame buffer concurrently with another graphics processor accessing any other part of the frame buffer. The plurality of graphics processors can thereby transmit concurrently pixel data to pixel locations in the frame buffer.

Littlefield, Richard J. (Seattle, WA)

1990-01-01

51

Method and structure for skewed block-cyclic distribution of lower-dimensional data arrays in higher-dimensional processor grids  

DOEpatents

A method and structure of distributing elements of an array of data in a computer memory to a specific processor of a multi-dimensional mesh of parallel processors includes designating a distribution of elements of at least a portion of the array to be executed by specific processors in the multi-dimensional mesh of parallel processors. The pattern of the designating includes a cyclical repetitive pattern of the parallel processor mesh, as modified to have a skew in at least one dimension so that both a row of data in the array and a column of data in the array map to respective contiguous groupings of the processors such that a dimension of the contiguous groupings is greater than one.

Chatterjee, Siddhartha (Yorktown Heights, NY); Gunnels, John A. (Brewster, NY)

2011-11-08

52

Processor Coupling: Integrating Compile Time and Runtime Scheduling for Parallelism  

Microsoft Academic Search

The technology to implement a single-chip node composed of 4high-performance floating-point ALUs will be available by 1995.This paper presents processor coupling,a mechanism for controllingmultiple ALUs to exploit both instruction-level and inter-thread parallelism,by using compile time and runtime scheduling. The compilerstatically schedules individual threads to discover availableintra-thread instruction-level parallelism. The runtime schedulingmechanism interleaves threads, exploiting ...

Stephen W. Keckler; William J. Dally

1992-01-01

53

A parallel particle-in-cell model for the massively parallel processor  

NASA Technical Reports Server (NTRS)

The availability of the nearest-neighbor communication-incorporating Massively Parallel Processor has prompted the development of a two-dimensional, particle-in-cell algorithm which loads particles in a cell randomly onto a row of processors, filling only half of them with particles. Due to the simplification of communications among processors achieved in a row by the vacant processors and the random-particle sequence, the algorithm efficiently sorts particles and performs gather/scatter procedures for collecting charge density according to their cells. The algorithm calculates electric fields at the cells by FFT.

Lin, C. S.; Thring, A. L.; Koga, J.; Seiler, E. J.

1990-01-01

54

Real-time trajectory optimization on parallel processors  

NASA Technical Reports Server (NTRS)

A parallel algorithm has been developed for rapidly solving trajectory optimization problems. The goal of the work has been to develop an algorithm that is suitable to do real-time, on-line optimal guidance through repeated solution of a trajectory optimization problem. The algorithm has been developed on an INTEL iPSC/860 message passing parallel processor. It uses a zero-order-hold discretization of a continuous-time problem and solves the resulting nonlinear programming problem using a custom-designed augmented Lagrangian nonlinear programming algorithm. The algorithm achieves parallelism of function, derivative, and search direction calculations through the principle of domain decomposition applied along the time axis. It has been encoded and tested on 3 example problems, the Goddard problem, the acceleration-limited, planar minimum-time to the origin problem, and a National Aerospace Plane minimum-fuel ascent guidance problem. Execution times as fast as 118 sec of wall clock time have been achieved for a 128-stage Goddard problem solved on 32 processors. A 32-stage minimum-time problem has been solved in 151 sec on 32 processors. A 32-stage National Aerospace Plane problem required 2 hours when solved on 32 processors. A speed-up factor of 7.2 has been achieved by using 32-nodes instead of 1-node to solve a 64-stage Goddard problem.

Psiaki, Mark L.

1993-01-01

55

Ring-array processor distribution topology for optical interconnects  

NASA Technical Reports Server (NTRS)

The existing linear and rectangular processor distribution topologies for optical interconnects, although promising in many respects, cannot solve problems such as clock skews, the lack of supporting elements for efficient optical implementation, etc. The use of a ring-array processor distribution topology, however, can overcome these problems. Here, a study of the ring-array topology is conducted with an aim of implementing various fast clock rate, high-performance, compact optical networks for digital electronic multiprocessor computers. Practical design issues are addressed. Some proof-of-principle experimental results are included.

Li, Yao; Ha, Berlin; Wang, Ting; Wang, Sunyu; Katz, A.; Lu, X. J.; Kanterakis, E.

1992-01-01

56

Staging memory for massively parallel processor  

NASA Technical Reports Server (NTRS)

The invention herein relates to a computer organization capable of rapidly processing extremely large volumes of data. A staging memory is provided having a main stager portion consisting of a large number of memory banks which are accessed in parallel to receive, store, and transfer data words simultaneous with each other. Substager portions interconnect with the main stager portion to match input and output data formats with the data format of the main stager portion. An address generator is coded for accessing the data banks for receiving or transferring the appropriate words. Input and output permutation networks arrange the lineal order of data into and out of the memory banks.

Batcher, Kenneth E. (Inventor)

1988-01-01

57

A taxonomy of reconfiguration techniques for fault-tolerant processor arrays--  

SciTech Connect

The authors overview, characterize, and classify some typical reconfiguration schemes in light of a proposed taxonomy. This taxonomy can be used as a guide for future research in design and analysis of reconfiguration schemes. Studying how to evaluate fault-tolerant arrays and how to exploit application characteristics to achieve dependable computing are important complementary directions of research towards reliable processor-array design. A related research problem is that of functional reconfiguration, that is, learning how to configure the topology of a parallel system to implement a different function or run a different application. Important directions of research include how to apply or extend processor-array reconfiguration algorithms to other topologies and how to marry functional and fault-tolerance reconfiguration requirements and solutions. The Diogenes approach discussed in this article is a case where this goal is naturally achieved.

Chean, M. (Shell Development Co., Houston, TX (USA)); Fortes, J.A.B. (Purdue Univ., Lafayette, IN (USA))

1990-01-01

58

Low Power Multiple Object Tracking and Counting using a SCAMP Cellular Processor Array  

E-print Network

- contained hardware consists of a battery, an ARM Cortex-M3 co- processor, and the sensor/processor array and an ARM Cortex-M3 control processor [6], designed with power consumption in mind. The ARM is responsibleLow Power Multiple Object Tracking and Counting using a SCAMP Cellular Processor Array David R. W

Dudek, Piotr

59

Pipeline and parallel-pipeline FFT processors for VLSI implementations  

SciTech Connect

In some signal processing applications, it is desirable to build very high performance fast Fourier transform (FFT) processors. To meet the performance requirements, these processors are typically highly pipelined. Until the advent of VLSI, it was not possible to build a single chip which could be used to construct pipeline FFT processors of a reasonable size. However, VLSI implementations have constraints which differ from those of discrete implementations, requiring another look at some of the typical FFT algorithms in the light of these constraints. In this paper, several methods for computing the FFT in hardware are reviewed. Pipeline structures for the Cooley-Tukey algorithm and the Good prime factor algorithm are presented. The various small base modules required for the construction of these processors are examined with VLSI implementations in mind. For prime bases, an algorithm due to Rader is used which is easier to implement in a pipeline than the minimum multiply algorithms of Winograd. The Winograd technique of centralizing the multiplies of several relatively prime bases is used to develop a pipeline which requires less hardware than pipelines based on the algorithms above. A notation is then presented which allows parallel-pipeline versions of FFT processors to be developed for all of these algorithms. These versions are well suited for use in VLSI implementations due to the efficient use of chip I/O bandwidth between the stages of the FFT algorithms.

Wold, E.H.; Despain, A.M.

1984-05-01

60

Dynamic processor self-scheduling for general parallel nested loops  

SciTech Connect

In this paper we present a completely dynamic processor self-scheduling approach for general nested loops. General nested loops contain both parallel and serial loops and the execution time of their iterations can vary widely. In our scheme, an instance of an innermost loop is considered to be a basic unit in a precedence graph. An instance is said to be active if the instance is ready to be executed. Completion of an instance will activate instance of other innermost loops. The basic concept in our scheme is to keep all processors busy as long as there exists any active instance. By effectively using synchronization primitive provided in a parallel processing system, the overhead of our approach can be significantly reduced. We also present the data structure needed to represent a general nested loop in our self-scheduling scheme. 19 refs., 8 figs.

Fang, Zhixi; Tang, Peiyi; Yew, Pen-Chung; Zhu, Chuan-Qi

1987-09-10

61

Task and instruction scheduling in parallel multithreaded processors  

E-print Network

. A Conceptual Model II A PARALLEL MULTITHREADING PROCESSOR ARCHI- TECTURE. . A. Hardware Organization B. Simulation Methodology III TASK AND INSTRUCTION SCHEDULING . A. Existing Thread and Instruction Scheduling Policies . 1. Round Robin Thread... Fetch Policy 2. Instruction Count-based Policies 3. FCFS Instruction Issuing 4. Other Issue Heuristics B. New ICOUNT Based Issue Policies 1. REV JCOUNT 2. WEIGHTED JCOUNT C. Proposed Critical Path Priority Based Scheduling Policies . IV...

Mishra, Amitabh

1996-01-01

62

Breaking the data encryption standard using networks of evolutionary processors with parallel string rewriting rules  

Microsoft Academic Search

In this paper we introduce a biologically inspired distributed computing model called networks of evolutionary processors with parallel string rewriting rules (NEPPS), which is a variation of the hybrid networks of evolutionary processors introduced by Martin-Vide et al. Such a network contains simple processors that are located in the nodes of a virtual graph. Each processor has strings (each string

Ashish Choudhary; Kamala Krithivasan

2009-01-01

63

Feasibility of optically interconnected parallel processors using wavelength division multiplexing  

SciTech Connect

New national security demands require enhanced computing systems for nearly ab initio simulations of extremely complex systems and analyzing unprecedented quantities of remote sensing data. This computational performance is being sought using parallel processing systems, in which many less powerful processors are ganged together to achieve high aggregate performance. Such systems require increased capability to communicate information between individual processor and memory elements. As it is likely that the limited performance of today`s electronic interconnects will prevent the system from achieving its ultimate performance, there is great interest in using fiber optic technology to improve interconnect communication. However, little information is available to quantify the requirements on fiber optical hardware technology for this application. Furthermore, we have sought to explore interconnect architectures that use the complete communication richness of the optical domain rather than using optics as a simple replacement for electronic interconnects. These considerations have led us to study the performance of a moderate size parallel processor with optical interconnects using multiple optical wavelengths. We quantify the bandwidth, latency, and concurrency requirements which allow a bus-type interconnect to achieve scalable computing performance using up to 256 nodes, each operating at GFLOP performance. Our key conclusion is that scalable performance, to {approx}150 GFLOPS, is achievable for several scientific codes using an optical bus with a small number of WDM channels (8 to 32), only one WDM channel received per node, and achievable optoelectronic bandwidth and latency requirements. 21 refs. , 10 figs.

Deri, R.J.; De Groot, A.J.; Haigh, R.E.

1996-03-01

64

Optimal evaluation of array expressions on massively parallel machines  

NASA Technical Reports Server (NTRS)

We investigate the problem of evaluating FORTRAN 90 style array expressions on massively parallel distributed-memory machines. On such machines, an elementwise operation can be performed in constant time for arrays whose corresponding elements are in the same processor. If the arrays are not aligned in this manner, the cost of aligning them is part of the cost of evaluating the expression. The choice of where to perform the operation then affects this cost. We present algorithms based on dynamic programming to solve this problem efficiently for a wide variety of interconnection schemes, including multidimensional grids and rings, hypercubes, and fat-trees. We also consider expressions containing operations that change the shape of the arrays, and show that our approach extends naturally to handle this case.

Chatterjee, Siddhartha; Gilbert, John R.; Schreiber, Robert; Teng, Shang-Hua

1992-01-01

65

Analog parallel processor hardware for high speed pattern recognition  

NASA Technical Reports Server (NTRS)

A VLSI-based analog processor for fully parallel, associative, high-speed pattern matching is reported. The processor consists of two main components: an analog memory matrix for storage of a library of patterns, and a winner-take-all (WTA) circuit for selection of the stored pattern that best matches an input pattern. An inner product is generated between the input vector and each of the stored memories. The resulting values are applied to a WTA network for determination of the closest match. Patterns with up to 22 percent overlap are successfully classified with a WTA settling time of less than 10 microsec. Applications such as star pattern recognition and mineral classification with bounded overlap patterns have been successfully demonstrated. This architecture has a potential for an overall pattern matching speed in excess of 10 exp 9 bits per second for a large memory.

Daud, T.; Tawel, R.; Langenbacher, H.; Eberhardt, S. P.; Thakoor, A. P.

1990-01-01

66

Optimal mapping of irregular finite element domains to parallel processors  

NASA Technical Reports Server (NTRS)

Mapping the solution domain of n-finite elements into N-subdomains that may be processed in parallel by N-processors is an optimal one if the subdomain decomposition results in a well-balanced workload distribution among the processors. The problem is discussed in the context of irregular finite element domains as an important aspect of the efficient utilization of the capabilities of emerging multiprocessor computers. Finding the optimal mapping is an intractable combinatorial optimization problem, for which a satisfactory approximate solution is obtained here by analogy to a method used in statistical mechanics for simulating the annealing process in solids. The simulated annealing analogy and algorithm are described, and numerical results are given for mapping an irregular two-dimensional finite element domain containing a singularity onto the Hypercube computer.

Flower, J.; Otto, S.; Salama, M.

1987-01-01

67

Design and optimization of a defect tolerant processor array  

E-print Network

optimization based on simulation of the target applications with respect to the design parameters. The simulations are performed using Proteus [6]. The second part is to perform cell yield analysis and wafer reconfiguration analysis to determme the number.... 246-258. [5] D. M. H. Walker, "Soft-Programmable bypass switch design for defect-tolerant processor arrays", Proceedings of the International Conference on Wafer Scale Integration, January 1990, pp. 236-240. [6] Anup Gupta, "Proteus - Simulator...

Lakkapragada, Bhavani S

1995-01-01

68

The Impact of Communication Style on Machine Resource Usage for the iWarp Parallel Processor  

E-print Network

The Impact of Communication Style on Machine Resource Usage for the iWarp Parallel Processor T of communication style impacts the usage of processor resources. Parallel program generators map a machine Research Projects Agency, Information Science and Technology Office, under the title ``Research on Parallel

Shewchuk, Jonathan

69

Parallel processors and nonlinear structural dynamics algorithms and software  

NASA Technical Reports Server (NTRS)

Techniques are discussed for the implementation and improvement of vectorization and concurrency in nonlinear explicit structural finite element codes. In explicit integration methods, the computation of the element internal force vector consumes the bulk of the computer time. The program can be efficiently vectorized by subdividing the elements into blocks and executing all computations in vector mode. The structuring of elements into blocks also provides a convenient way to implement concurrency by creating tasks which can be assigned to available processors for evaluation. The techniques were implemented in a 3-D nonlinear program with one-point quadrature shell elements. Concurrency and vectorization were first implemented in a single time step version of the program. Techniques were developed to minimize processor idle time and to select the optimal vector length. A comparison of run times between the program executed in scalar, serial mode and the fully vectorized code executed concurrently using eight processors shows speed-ups of over 25. Conjugate gradient methods for solving nonlinear algebraic equations are also readily adapted to a parallel environment. A new technique for improving convergence properties of conjugate gradients in nonlinear problems is developed in conjunction with other techniques such as diagonal scaling. A significant reduction in the number of iterations required for convergence is shown for a statically loaded rigid bar suspended by three equally spaced springs.

Belytschko, Ted

1990-01-01

70

Bit-parallel arithmetic in a massively-parallel associative processor  

NASA Technical Reports Server (NTRS)

A simple but powerful new architecture based on a classical associative processor model is presented. Algorithms for performing the four basic arithmetic operations both for integer and floating point operands are described. For m-bit operands, the proposed architecture makes it possible to execute complex operations in O(m) cycles as opposed to O(m exp 2) for bit-serial machines. A word-parallel, bit-parallel, massively-parallel computing system can be constructed using this architecture with VLSI technology. The operation of this system is demonstrated for the fast Fourier transform and matrix multiplication.

Scherson, Isaac D.; Kramer, David A.; Alleyne, Brian D.

1992-01-01

71

An informal introduction to program transformation and parallel processors  

SciTech Connect

In the summer of 1992, I had the opportunity to participate in a Faculty Research Program at Argonne National Laboratory. I worked under Dr. Jim Boyle on a project transforming code written in pure functional Lisp to Fortran code to run on distributed-memory parallel processors. To perform this project, I had to learn three things: the transformation system, the basics of distributed-memory parallel machines, and the Lisp programming language. Each of these topics in computer science was unfamiliar to me as a mathematician, but I found that they (especially parallel processing) are greatly impacting many fields of mathematics and science. Since most mathematicians have some exposure to computers, but.certainly are not computer scientists, I felt it was appropriate to write a paper summarizing my introduction to these areas and how they can fit together. This paper is not meant to be a full explanation of the topics, but an informal introduction for the ``mathematical layman.`` I place myself in that category as well as my previous use of computers was as a classroom demonstration tool.

Hopkins, K.W. [Southwest Baptist Univ., Bolivar, MO (United States)

1994-08-01

72

On program restructuring, scheduling, and communication for parallel processor systems  

SciTech Connect

This dissertation discusses several software and hardware aspects of program execution on large-scale, high-performance parallel processor systems. The issues covered are program restructuring, partitioning, scheduling and interprocessor communication, synchronization, and hardware design issues of specialized units. All this work was performed focusing on a single goal: to maximize program speedup, or equivalently, to minimize parallel execution time. Parafrase, a Fortran restructuring compiler was used to transform programs in a parallel form and conduct experiments. Two new program restructuring techniques are presented, loop coalescing and subscript blocking. Compile-time and run-time scheduling schemes are covered extensively. Depending on the program construct, these algorithms generate optimal or near-optimal schedules. For the case of arbitrarily nested hybrid loops, two optimal scheduling algorithms for dynamic and static scheduling are presented. Simulation results are given for a new dynamic scheduling algorithm. The performance of this algorithm is compared to that of self-scheduling. Techniques for program partitioning and minimization of interprocessor communication for idealized program models and for real Fortran programs are also discussed. The close relationship between scheduling, interprocessor communication, and synchronization becomes apparent at several points in this work. Finally, the impact of various types of overhead on program speedup and experimental results are presented. 69 refs., 74 figs., 14 tabs.

Polychronopoulos, Constantine D.

1986-08-01

73

Beam dynamics calculations and particle tracking using massively parallel processors  

SciTech Connect

During the past decade massively parallel processors (MPPs) have slowly gained acceptance within the scientific community. At present these machines typically contain a few hundred to one thousand off-the-shelf microprocessors and a total memory of up to 32 GBytes. The potential performance of these machines is illustrated by the fact that a month long job on a high end workstation might require only a few hours on an MPP. The acceptance of MPPs has been slow for a variety of reasons. For example, some algorithms are not easily parallelizable. Also, in the past these machines were difficult to program. But in recent years the development of Fortran-like languages such as CM Fortran and High Performance Fortran have made MPPs much easier to use. In the following we will describe how MPPs can be used for beam dynamics calculations and long term particle tracking.

Ryne, R.D.; Habib, S.

1995-12-31

74

Serial multiplier arrays for parallel computation  

NASA Technical Reports Server (NTRS)

Arrays of systolic serial-parallel multiplier elements are proposed as an alternative to conventional SIMD mesh serial adder arrays for applications that are multiplication intensive and require few stored operands. The design and operation of a number of multiplier and array configurations featuring locality of connection, modularity, and regularity of structure are discussed. A design methodology combining top-down and bottom-up techniques is described to facilitate development of custom high-performance CMOS multiplier element arrays as well as rapid synthesis of simulation models and semicustom prototype CMOS components. Finally, a differential version of NORA dynamic circuits requiring a single-phase uncomplemented clock signal introduced for this application.

Winters, Kel

1990-01-01

75

Dynamic process simulation of a distillation column on a shared memory parallel processor computer  

SciTech Connect

In this study, results of an investigation into applying parallel computing on a shared memory multiprocessor computer to the dynamic process simulation of a distillation column with use of a sequential modular simulator are reported. Two DYFLO process simulation models of distillation columns were parallelized and ported to a BBN Butterfly Parallel Processor computer. Computations were performed with up to 14 concurrently operating processors. General performance aspects of simulation on parallel computers are discussed and speedup as a function of humber of concurrently operating processors is reported for the two distillation column simulations.

Cera, G.D. (Central Research and Development Dept., E.I. du Pont de Nemours and Co., Wilmington, DE (US))

1988-01-01

76

Pipeline and Parallel-Pipeline FFT Processors for VLSI Implementations  

Microsoft Academic Search

In some signal processing applications, it is desirable to build very high performance fast Fourier transform (FFT) processors. To meet the performance requirements, these processors are typically highly pipelined. Until the advent of VLSI, it was not possible to build a single chip which could be used to construct pipeline FFT processors of a reasonable size. However, VLSI implementations have

Erling Wold; Alvin M. Despain

1984-01-01

77

Periodic Application of Concurrent Error Detection in Processor Array Architectures. PhD. Thesis -  

NASA Technical Reports Server (NTRS)

Processor arrays can provide an attractive architecture for some applications. Featuring modularity, regular interconnection and high parallelism, such arrays are well-suited for VLSI/WSI implementations, and applications with high computational requirements, such as real-time signal processing. Preserving the integrity of results can be of paramount importance for certain applications. In these cases, fault tolerance should be used to ensure reliable delivery of a system's service. One aspect of fault tolerance is the detection of errors caused by faults. Concurrent error detection (CED) techniques offer the advantage that transient and intermittent faults may be detected with greater probability than with off-line diagnostic tests. Applying time-redundant CED techniques can reduce hardware redundancy costs. However, most time-redundant CED techniques degrade a system's performance.

Chen, Paul Peichuan

1993-01-01

78

Array-of-arrays architecture for parallel floating point multiplication  

Microsoft Academic Search

This paper presents a new architecture style for the design of a parallel floating point multiplier. The proposed architecture is a synergy of trees and arrays. Architectural models were designed to implement the 53-bit mantissa path of the IEEE standard 754 for floating point multiplication, and tested for functionality in Verilog. The design, which was done in dual-rail domino, was

Hema Dhanesha; Katayoun Falakshahi; Mark Horowitz

1995-01-01

79

Massively parallel processor networks with optical express channels  

DOEpatents

An optical method for separating and routing local and express channel data comprises interconnecting the nodes in a network with fiber optic cables. A single fiber optic cable carries both express channel traffic and local channel traffic, e.g., in a massively parallel processor (MPP) network. Express channel traffic is placed on, or filtered from, the fiber optic cable at a light frequency or a color different from that of the local channel traffic. The express channel traffic is thus placed on a light carrier that skips over the local intermediate nodes one-by-one by reflecting off of selective mirrors placed at each local node. The local-channel-traffic light carriers pass through the selective mirrors and are not reflected. A single fiber optic cable can thus be threaded throughout a three-dimensional matrix of nodes with the x,y,z directions of propagation encoded by the color of the respective light carriers for both local and express channel traffic. Thus frequency division multiple access is used to hierarchically separate the local and express channels to eliminate the bucket brigade latencies that would otherwise result if the express traffic had to hop between every local node to reach its ultimate destination. 3 figs.

Deri, R.J.; Brooks, E.D. III; Haigh, R.E.; DeGroot, A.J.

1999-08-24

80

Massively parallel processor networks with optical express channels  

DOEpatents

An optical method for separating and routing local and express channel data comprises interconnecting the nodes in a network with fiber optic cables. A single fiber optic cable carries both express channel traffic and local channel traffic, e.g., in a massively parallel processor (MPP) network. Express channel traffic is placed on, or filtered from, the fiber optic cable at a light frequency or a color different from that of the local channel traffic. The express channel traffic is thus placed on a light carrier that skips over the local intermediate nodes one-by-one by reflecting off of selective mirrors placed at each local node. The local-channel-traffic light carriers pass through the selective mirrors and are not reflected. A single fiber optic cable can thus be threaded throughout a three-dimensional matrix of nodes with the x,y,z directions of propagation encoded by the color of the respective light carriers for both local and express channel traffic. Thus frequency division multiple access is used to hierarchically separate the local and express channels to eliminate the bucket brigade latencies that would otherwise result if the express traffic had to hop between every local node to reach its ultimate destination.

Deri, Robert J. (Pleasanton, CA); Brooks, III, Eugene D. (Livermore, CA); Haigh, Ronald E. (Tracy, CA); DeGroot, Anthony J. (Castro Valley, CA)

1999-01-01

81

Holographic optical backplane hardware implementation for parallel and distributed processors  

NASA Astrophysics Data System (ADS)

A working model of an optical backplane has been built to demonstrate the feasibility of incorporating free space, multifaceted and angularly-multiplexed, holographic interconnect technology to enhance the electronic processing architecture. This new design will allow special configurations for parallel and distributed processing and can be made compatible with standard electrical bus connections. The current demonstrator unit contains four transceiver boards in a standard 19 in. rack-mount chassis. It can support bidirectional 125 MHz transmission per channel with a loss budget (allowable optical attenuation) of 30 dB for large fan-out (> 20 boards). Interconnection holograms have been designed to compensate for the large wavelength drift of laser diodes expected to be the result of temperature fluctuations in the processor box. The design also allows a large mechanical tolerance for board misalignment and vibration. Multiple interconnection patterns, each set representing a particular architecture, can be recorded on a single substrate to provide reconfiguration. The proposed holo-backplane can interconnect multiple transmitters and/or receivers (each could support different logic and/or signal levels) per board to realize truly flexible processing schemes.

Kim, Richard C.; Lin, Freddie S.

1991-09-01

82

On nonlinear finite element analysis in single-, multi- and parallel-processors  

NASA Technical Reports Server (NTRS)

Numerical solution of nonlinear equilibrium problems of structures by means of Newton-Raphson type iterations is reviewed. Each step of the iteration is shown to correspond to the solution of a linear problem, therefore the feasibility of the finite element method for nonlinear analysis is established. Organization and flow of data for various types of digital computers, such as single-processor/single-level memory, single-processor/two-level-memory, vector-processor/two-level-memory, and parallel-processors, with and without sub-structuring (i.e. partitioning) are given. The effect of the relative costs of computation, memory and data transfer on substructuring is shown. The idea of assigning comparable size substructures to parallel processors is exploited. Under Cholesky type factorization schemes, the efficiency of parallel processing is shown to decrease due to the occasional shared data, just as that due to the shared facilities.

Utku, S.; Melosh, R.; Islam, M.; Salama, M.

1982-01-01

83

PARALLEL AND CONCURRENT SEARCH FOR FAST AND\\/OR TREE SEARCH ON MULTICORE PROCESSORS  

Microsoft Academic Search

This paper proposes a fast AND\\/OR tree search algo- rithm using a multiple paths parallel and concurrent search scheme for embedded multicore processors. Currently, not only PCs or supercomputers but also information ap- pliances such as game consoles, mobile devices and car navigation systems are equipped with multicore processors for better cost performance and lower power consumption. However, the number

Fumiyo Takano; Yoshitaka Maekawa; Hironori Kasahara

2009-01-01

84

A Mobile Robot with Onboard Parallel Processor and Large Workspace Arm  

Microsoft Academic Search

a. The MIT AI Lab's second mobile robot, MOBOT-2, has a number of unique design features. In this paper we describe two of them in detail. First, MOBOT-2 has an extremely cheap 32 processor distributed control system. The proces- sor system, called BARNACLE, runs asynchronously with no central locus of control. Unlike almost all other parallel processors this one has

Rodney A. Brooks; Jon Connell; Anita Flynn

1986-01-01

85

Design of a dataway processor for a parallel image signal processing system  

NASA Astrophysics Data System (ADS)

Recently, demands for high-speed signal processing have been increasing especially in the field of image data compression, computer graphics, and medical imaging. To achieve sufficient power for real-time image processing, we have been developing parallel signal-processing systems. This paper describes a communication processor called 'dataway processor' designed for a new scalable parallel signal-processing system. The processor has six high-speed communication links (Dataways), a data-packet routing controller, a RISC CORE, and a DMA controller. Each communication link operates at 8-bit parallel in a full duplex mode at 50 MHz. Moreover, data routing, DMA, and CORE operations are processed in parallel. Therefore, sufficient throughput is available for high-speed digital video signals. The processor is designed in a top- down fashion using a CAD system called 'PARTHENON.' The hardware is fabricated using 0.5-micrometers CMOS technology, and its hardware is about 200 K gates.

Nomura, Mitsuru; Fujii, Tetsuro; Ono, Sadayasu

1995-04-01

86

A Compact FPGA Implementation of a Bit-Serial SIMD Cellular Processor Array  

E-print Network

A Compact FPGA Implementation of a Bit-Serial SIMD Cellular Processor Array Declan Walsh and Piotr Kingdom declan.walsh@postgrad.manchester.ac.uk; p.dudek@manchester.ac.uk Abstract-- An FPGA implementation to form an array. A 32 Ã? 32 processing element array is implemented on a low-cost Xilinx XC5VLX50 FPGA

Dudek, Piotr

87

Numerically stable Jacobi array for parallel singular value decomposition (SVD) updating  

NASA Astrophysics Data System (ADS)

A novel algorithm is presented for updating the singular value decomposition in parallel. It is an improvement upon an earlier developed Jacobi-type SVD updating algorithm, where now the exact orthogonality of a certain matrix is guaranteed by means of a minimal factorization in terms of angles. Its orthogonality is known to be crucial for the numerical stability of the overall algorithm. The factored approach leads to a triangular array of rotation cells, implementing an orthogonal matrix-vector multiplication, and a novel array for SVD updating. Both arrays can be built up of CORDIC processors since the algorithms make exclusive use of orthogonal planar transformations.

Vanpoucke, Filiep J.; Moonen, Marc; Deprettere, Ed F. A.

1994-10-01

88

The execube parallel processor chip and cellular automata for tactical route planning  

SciTech Connect

Parallel processing and the methods to program, coordinate, and operate such computing systems and architectures have become a vast and highly pursued research area. These computing environments have quickly become a vital means to attack difficult compute-intensive algorithmic and heuristic problems. We present a high level description of the Execube processor. This parallel processor chip comprises eight computing engines, local memories, and a high speed message passing system on a single chip type. Chip extensibility and interconnection is a simple building-block approach. A class of computational models known as cellular automata (CA) is one of several that can effectively exploit Execube`s parallelism. We discuss CA algorithms in general and present sample applications applicable to the Execube parallel processor.

Bezek, J.D.; Stiles, P.

1995-09-01

89

Dynamic grid manipulation for PDES (partial differential equations) on hypercube parallel processors. Research report  

SciTech Connect

Adaptive methods for partial differential equations can be considered as a problem in managing a dynamic graph on a parallel processor. The properties wanted for this graph are that edges in the graph, as much as possible, map to nearest neighbor links in the parallel processor, and that changes to the graph not require a major re-arrangement of the mapping of the nodes of the graph onto the processor. A simple restricted set of transformations are discussed that are easy to implement on a message-passing parallel processor, and the design choices made are discussed in detail. These transformations are based on maintaining a graph of bounded node degree at the cost of a small amount of global communication. In designing adaptive algorithms of parallel processors, the communication speeds of the processors can have a major effect on the design. Different hypercubes can be characterized by three parameters: the floating point speed, the I/O startup time, and the I/O transfer rate. One aspect of algorithm design influenced by interprocessor communication is data structure granularity. The author discusses how this affects the choice of algorithm as a function of the three parameters for this algorithm and relates this to his experiments.

Gropp, W.D.

1986-03-01

90

A design of an associative memory array processor for ultrasonograph image acquisition and processing.  

PubMed

This paper describes a design of an associative memory array processor that can be used in the acquisition and processing of ultrasonograph images. The major concept is to design a parallel architecture that reduces task's execution time by analyzing multiple parts of the image concurrently. The architecture constitutes a distinctive type of single-instruction stream, multiple-data stream machine that is built around content-addressable associative memory slabs, that allow parallel access of multiple memory words. The basic building block of this architecture is a one-pixel processing element, which can perform the standard load (data acquisition) function and also contains some special comparison logic to enable its content to be compared with an external data. Several image processing operations are implemented in parallel, among them: component labeling, size filtering, pattern centralization, and pattern recognition. The proposed novel architecture can label specific regions into the image and isolate them intelligently. It is also capable of storing templates that may be considered as references for similar cases. The system is able to perform learning process and extract features from several input patterns and store the reference pattern in a slice. Moreover, the system is capable of comparing an input image with a pre-stored template during recognition process. The proposed architecture is of interest because it speeds up the recognition process and helps radiology specialists to write their reports confidently. PMID:10834246

Aly, G M; el-Nadi, N M; Fayed, Z T; Faheem, H M

2000-01-01

91

Preliminary study on the potential usefulness of array processor techniques for structural synthesis  

NASA Technical Reports Server (NTRS)

The effects of the use of array processor techniques within the structural analyzer program, SPAR, are simulated in order to evaluate the potential analysis speedups which may result. In particular the connection of a Floating Point System AP120 processor to the PRIME computer is discussed. Measurements of execution, input/output, and data transfer times are given. Using these data estimates are made as to the relative speedups that can be executed in a more complete implementation on an array processor maxi-mini computer system.

Feeser, L. J.

1980-01-01

92

Evaluation of fault-tolerant parallel-processor architectures over long space missions  

NASA Technical Reports Server (NTRS)

The impact of a five year space mission environment on fault-tolerant parallel processor architectures is examined. The target application is a Strategic Defense Initiative (SDI) satellite requiring 256 parallel processors to provide the computation throughput. The reliability requirements are that the system still be operational after five years with .99 probability and that the probability of system failure during one-half hour of full operation be less than 10(-7). The fault tolerance features an architecture must possess to meet these reliability requirements are presented, many potential architectures are briefly evaluated, and one candidate architecture, the Charles Stark Draper Laboratory's Fault-Tolerant Parallel Processor (FTPP) is evaluated in detail. A methodology for designing a preliminary system configuration to meet the reliability and performance requirements of the mission is then presented and demonstrated by designing an FTPP configuration.

Johnson, Sally C.

1989-01-01

93

Using algebra for massively parallel processor design and utilization  

NASA Technical Reports Server (NTRS)

This paper summarizes the author's advances in the design of dense processor networks. Within is reported a collection of recent constructions of dense symmetric networks that provide the largest know values for the number of nodes that can be placed in a network of a given degree and diameter. The constructions are in the range of current potential engineering significance and are based on groups of automorphisms of finite-dimensional vector spaces.

Campbell, Lowell; Fellows, Michael R.

1990-01-01

94

Highly parallel reconfigurable computer architecture for robotic computation having plural processor cells each having right and left ensembles of plural processors  

NASA Technical Reports Server (NTRS)

In a computer having a large number of single-instruction multiple data (SIMD) processors, each of the SIMD processors has two sets of three individual processor elements controlled by a master control unit and interconnected among a plurality of register file units where data is stored. The register files input and output data in synchronism with a minor cycle clock under control of two slave control units controlling the register file units connected to respective ones of the two sets of processor elements. Depending upon which ones of the register file units are enabled to store or transmit data during a particular minor clock cycle, the processor elements within an SIMD processor are connected in rings or in pipeline arrays, and may exchange data with the internal bus or with neighboring SIMD processors through interface units controlled by respective ones of the two slave control units.

Fijany, Amir (inventor); Bejczy, Antal K. (inventor)

1994-01-01

95

Redundant disk arrays: Reliable, parallel secondary storage. Ph.D. Thesis  

NASA Technical Reports Server (NTRS)

During the past decade, advances in processor and memory technology have given rise to increases in computational performance that far outstrip increases in the performance of secondary storage technology. Coupled with emerging small-disk technology, disk arrays provide the cost, volume, and capacity of current disk subsystems, by leveraging parallelism, many times their performance. Unfortunately, arrays of small disks may have much higher failure rates than the single large disks they replace. Redundant arrays of inexpensive disks (RAID) use simple redundancy schemes to provide high data reliability. The data encoding, performance, and reliability of redundant disk arrays are investigated. Organizing redundant data into a disk array is treated as a coding problem. Among alternatives examined, codes as simple as parity are shown to effectively correct single, self-identifying disk failures.

Gibson, Garth Alan

1990-01-01

96

Array distribution in data-parallel programs  

NASA Technical Reports Server (NTRS)

We consider distribution at compile time of the array data in a distributed-memory implementation of a data-parallel program written in a language like Fortran 90. We allow dynamic redistribution of data and define a heuristic algorithmic framework that chooses distribution parameters to minimize an estimate of program completion time. We represent the program as an alignment-distribution graph. We propose a divide-and-conquer algorithm for distribution that initially assigns a common distribution to each node of the graph and successively refines this assignment, taking computation, realignment, and redistribution costs into account. We explain how to estimate the effect of distribution on computation cost and how to choose a candidate set of distributions. We present the results of an implementation of our algorithms on several test problems.

Chatterjee, Siddhartha; Gilbert, John R.; Schreiber, Robert; Sheffler, Thomas J.

1994-01-01

97

Aligning parallel arrays to reduce communication  

NASA Technical Reports Server (NTRS)

Axis and stride alignment is an important optimization in compiling data-parallel programs for distributed-memory machines. We previously developed an optimal algorithm for aligning array expressions. Here, we examine alignment for more general program graphs. We show that optimal alignment is NP-complete in this setting, so we study heuristic methods. This paper makes two contributions. First, we show how local graph transformations can reduce the size of the problem significantly without changing the best solution. This allows more complex and effective heuristics to be used. Second, we give a heuristic that can explore the space of possible solutions in a number of ways. We show that some of these strategies can give better solutions than a simple greedy approach proposed earlier. Our algorithms have been implemented; we present experimental results showing their effect on the performance of some example programs running on the CM-5.

Sheffler, Thomas J.; Schreiber, Robert; Gilbert, John R.; Chatterjee, Siddhartha

1994-01-01

98

Parallel processors and nonlinear structural dynamics algorithms and software  

NASA Technical Reports Server (NTRS)

A nonlinear structural dynamics program with an element library that exploits parallel processing is under development. The aim is to exploit scheduling-allocation so that parallel processing and vectorization can effectively be treated in a general purpose program. As a byproduct an automatic scheme for assigning time steps was devised. A rudimentary form of the program is complete and has been tested; it shows substantial advantage can be taken of parallelism. In addition, a stability proof for the subcycling algorithm has been developed.

Belytschko, T.

1986-01-01

99

A GaAs vector processor based on parallel RISC microprocessors  

NASA Astrophysics Data System (ADS)

A vector processor architecture based on the development of a 32-bit microprocessor using gallium arsenide (GaAs) technology has been developed. The McDonnell Douglas vector processor (MVP) will be fabricated completely from GaAs digital integrated circuits. The MVP architecture includes a vector memory of 1 megabyte, a parallel bus architecture with eight processing elements connected in parallel, and a control processor. The processing elements consist of a reduced instruction set CPU (RISC) with four floating-point coprocessor units and necessary memory interface functions. This architecture has been simulated for several benchmark programs including complex fast Fourier transform (FFT), complex inner product, trigonometric functions, and sort-merge routine. The results of this study indicate that the MVP can process a 1024-point complex FFT at a speed of 112 microsec (389 megaflops) while consuming approximately 618 W of power in a volume of approximately 0.1 ft-cubed.

Misko, Tim A.; Rasset, Terry L.

100

Scheduling Two Classes of Exponential Jobs on Parallel Processors: Structural Results and Worst Case Analysis  

E-print Network

Scheduling Two Classes of Exponential Jobs on Parallel Processors: Structural Results and Worst Case Analysis Cheng-Shang Chang*, Randolph Nelson #3; , and Michael Pinedo** Feb., 1990 Revised Oct classes of jobs. We assume that all jobs are present at time 0 and there are no further arrivals

Chang, Cheng-Shang

101

128-channel spike sorting processor with a parallel-folding structure in 90nm process  

Microsoft Academic Search

An emerging class of neural prostheses aims to provide more aggressive performance by realizing advanced realtime signal processing algorithms in particular the spike sorting on chips. To support realtime spike sorting for 128 channels, the traditional fully parallel approach duplicating 128 processing units results in a large burden on chip area. The fully folding approach sharing one processor over 128

Tung-Chien Chen; Wentai Liu; Liang-Gee Chen

2009-01-01

102

Digital signal array processor for NSLS booster power supply upgrade  

SciTech Connect

The booster at the NSLS is being upgraded from 0.75 to 2 pulses per second. To accomplish this, new power supplied for the dipole, quadrupole, and sextupole have been installed. This paper will outline the design and function of the digital signal processor used as the primary control element in the power supply control system.

Olsen, R.; Dabrowski, J. [Brookhaven National Lab., Upton, NY (United States); Murray, J. [State Univ. of New York, Stony Brook, NY (United States)

1993-07-01

103

UNIVERSITY OF CALIFORNIA Speculative Parallelization on Multicore Processors  

E-print Network

been one that I will cherish forever. My deepest gratitude is to Dr. Rajiv Gupta, who leads me to today. Finally, I would like to thank my family, particularly my father Quanlai Tian, my mother Qingfang Li is dedicated to my family. v #12;ABSTRACT OF THE DISSERTATION Speculative Parallelization on Multicore

Gupta, Rajiv

104

Series-parallel method of direct solar array regulation  

NASA Technical Reports Server (NTRS)

A 40 watt experimental solar array was directly regulated by shorting out appropriate combinations of series and parallel segments of a solar array. Regulation switches were employed to control the array at various set-point voltages between 25 and 40 volts. Regulation to within + or - 0.5 volt was obtained over a range of solar array temperatures and illumination levels as an active load was varied from open circuit to maximum available power. A fourfold reduction in regulation switch power dissipation was achieved with series-parallel regulation as compared to the usual series-only switching for direct solar array regulation.

Gooder, S. T.

1976-01-01

105

Radiative heat transfer in arrays of parallel cylinders  

Microsoft Academic Search

A theoretical and experimental study of radiative heat transfer in arrays of parallel cylinders is presented. Attention is primarily directed toward two geometries common in the nuclear industry: square arrays of cylinders on a square pitch and hexagonal arrays of cylinders on an equilibrium triangular pitch. Configuration factors for cylinders on square and eqilateral triangular pitches are derived using Hottel's

R. L. Cox

1976-01-01

106

Software development on the High-Speed Systolic Array Processor (HISSAP): Lessons learned. Final report, Mar 88-Mar 91  

SciTech Connect

This report documents the lessons learned in programming the Naval Ocean System Center's (NOSC's) High-Speed Systolic Array Processor (HISSAP) testbed. The procedures used for code generation, along with the programming utilities provided in the software development environment, are discussed with regard to their impact on the efficient implementation of algorithms on a parallel processing system such as HISSAP. This information is intended for considerations pertaining to software-development environments in future Navy parallel processing systems. Many of HISSAP's software-development utilities played key roles in the implementation of two computationally intensive algorithms: the Multiple-Signal Classification algorithm (MUSIC) and a four-channel, narrowband, finite-impulse response (FIR) filter. The introduction of utilities not included with the HISSAP tools would undoubtedly have increased the speed and efficiency of software development.

Tirpak, F.M.

1991-06-01

107

Parallel processors and nonlinear structural dynamics algorithms and software  

NASA Technical Reports Server (NTRS)

The adaptation of a finite element program with explicit time integration to a massively parallel SIMD (single instruction multiple data) computer, the CONNECTION Machine is described. The adaptation required the development of a new algorithm, called the exchange algorithm, in which all nodal variables are allocated to the element with an exchange of nodal forces at each time step. The architectural and C* programming language features of the CONNECTION Machine are also summarized. Various alternate data structures and associated algorithms for nonlinear finite element analysis are discussed and compared. Results are presented which demonstrate that the CONNECTION Machine is capable of outperforming the CRAY XMP/14.

Belytschko, Ted; Gilbertsen, Noreen D.; Neal, Mark O.; Plaskacz, Edward J.

1989-01-01

108

Operation of an adaptive processor using a photorefractive parallel integrator  

NASA Astrophysics Data System (ADS)

A parallel optical technique for use in the adaptive processing of radar signals is described. We have demonstrated such a technique in an acousto-optic adaptive signal processing system designed to null the jamming of radar signals. This system features a time integrating correlator using a new type of photorefractive spatial light modulator exhibiting high levels of contrast, resolution, uniformity and sensitivity. Using this system we have demonstrated nulling of both narrow and wide band RF signals. The operation of the photorefractive integrating device, its performance and that of the overall system are discussed.

Vachss, Frederick; Hong, John; Keefer, Chris; Malowicki, John

1992-12-01

109

Efficient exploitation of parallelism in general purpose processor-based systems for image and video processing applications  

NASA Astrophysics Data System (ADS)

This paper proposes a classification of the parallelisms in general-purpose processor based systems in three main categories. One category is the intra-processor parallelism that includes multimedia instructions and superscalar and VLIW architectures. The former takes advantage of data parallelism. The latter benefit from instruction level parallelism. Another category is the inter-processor parallelism. We consider the parallelism between processors inside shared memory symmetric multiprocessor systems and in distributed memory clusters of workstations. Finally, in the last category, main features of the system level parallelism are studied including the input/output operations, the memory hierarchy and the exploitation of external processing. The potential gain is studied for each type of parallelism available in general-purpose processor based systems from a theoretical point of view as well as for existing image and video applications. The results in this paper showed that the exploitation of the different levels of parallelism available in PC workstations can lead to considerable gains in speed when optimizing a multimedia application. Finally the results of this work can be used to influence the design of new multimedia systems and media processors.

Debes, Eric; Moschetti, Fulvio

2002-01-01

110

Othello Solver based on a soft-core MIMD processor array  

Microsoft Academic Search

This report presents an Othello Solver based on a 32-bit original soft-core Multiple Instruction stream, Multiple Data stream (MIMD) processor array targeting a single field programmable gate array (FPGA), Cyclone II (EP2C70D896C6N), on a DE2 Development and Education Board (Altera Corp.). The solver can execute a move-checking operation, a disc flipping operation, a move selection operation, an evaluation operation, and

T. Mabuchi; T. Watanabe; R. Moriwaki; Y. Aoyama; A. Gundjalam; Y. Yamaji; H. Nakada; M. Watanabe

2010-01-01

111

Implementation of context independent code on a new array processor: The Super-65  

NASA Technical Reports Server (NTRS)

The feasibility of rewriting standard uniprocessor programs into code which contains no context-dependent branches is explored. Context independent code (CIC) would contain no branches that might require different processing elements to branch different ways. In order to investigate the possibilities and restrictions of CIC, several programs were recoded into CIC and a four-element array processor was built. This processor (the Super-65) consisted of three 6502 microprocessors and the Apple II microcomputer. The results obtained were somewhat dependent upon the specific architecture of the Super-65 but within bounds, the throughput of the array processor was found to increase linearly with the number of processing elements (PEs). The slope of throughput versus PEs is highly dependent on the program and varied from 0.33 to 1.00 for the sample programs.

Colbert, R. O.; Bowhill, S. A.

1981-01-01

112

Algorithm-Based Error Detection Of A Cholesky Factor Updating Systolic Array Using Cordic Processors  

NASA Astrophysics Data System (ADS)

Lincoln Laboratory has developed an architecture for a folded linear systolic array using fixed-point CORDIC processors, applicable to adaptive nulling for a radar sidelobe canceler. The algorithm implemented uses triangularization by Givens rotations to solve a least-squares problem in the voltage domain. In this paper, the implementation of an inexpensive algorithm-based error-detection scheme is proposed for this systolic array. Column average checksum encoding is intended to detect most errors caused by the failure of any single arithmetic unit. It retains or almost retains the 100% processor utilization of Lincoln Laboratory's novel design. For the case of 64 degrees of freedom, the increase in time complexity is only 3%. The increase in hardware is mainly two adders and two comparators per CORDIC processor. We believe that the small increase in cost will be amply offset by the improvement in system performance brought about by this error detection.

Chou, S. I.; Rader, Charles M.

1989-12-01

113

Interconnection arrangement of routers of processor boards in array of cabinets supporting secure physical partition  

DOEpatents

A multiple processor computing apparatus includes a physical interconnect structure that is flexibly configurable to support selective segregation of classified and unclassified users. The physical interconnect structure includes routers in service or compute processor boards distributed in an array of cabinets connected in series on each board and to respective routers in neighboring row cabinet boards with the routers in series connection coupled to routers in series connection in respective neighboring column cabinet boards. The array can include disconnect cabinets or respective routers in all boards in each cabinet connected in a toroid. The computing apparatus can include an emulator which permits applications from the same job to be launched on processors that use different operating systems.

Tomkins, James L. (Albuquerque, NM); Camp, William J. (Albuquerque, NM)

2007-07-17

114

A unified approach to VLSI layout automation and algorithm mapping on processor arrays  

NASA Technical Reports Server (NTRS)

Development of software tools for designing supercomputing systems is highly complex and cost ineffective. To tackle this a special purpose PAcube silicon compiler which integrates different design levels from cell to processor arrays has been proposed. As a part of this, we present in this paper a novel methodology which unifies the problems of Layout Automation and Algorithm Mapping.

Venkateswaran, N.; Pattabiraman, S.; Srinivasan, Vinoo N.

1993-01-01

115

Parallel collective resonances in arrays of gold nanorods.  

PubMed

In this work we discuss the excitation of parallel collective resonances in arrays of gold nanoparticles. Parallel collective resonances result from the coupling of the nanoparticles localized surface plasmons with diffraction orders traveling in the direction parallel to the polarization vector. While they provide field enhancement and delocalization as the standard collective resonances, our results suggest that parallel resonances could exhibit greater tolerance to index asymmetry in the environment surrounding the arrays. The near- and far-field properties of these resonances are analyzed, both experimentally and numerically. PMID:24645987

Vitrey, Alan; Aigouy, Lionel; Prieto, Patricia; García-Martín, José Miguel; González, María U

2014-01-01

116

High-Level Modeling and FPGA Prototyping of Produced Order Parallel Queue Processor Core  

Microsoft Academic Search

Emerging high-level hardware description and synthesis technologies in conjunction with field programmable gate arrays (FPGAs)\\u000a have significantly lowered the threshold for hardware development. Opportunities exist to integrate these technologies into\\u000a a tool for exploring and evaluating microarchitectural designs especially for newly proposed architectures. This paper presents\\u000a a prototyping of a new processor core based on Queue architecture as starting point

Ben A. Abderazek; Tsutomu Yoshinaga; Masahiro Sowa

2006-01-01

117

Run-time recognition of task parallelism within the P++ parallel array class library  

SciTech Connect

This paper explores the use of a run-time system to recognize task parallelism with a C++ array class library. Run-time systems currently support data parallelism in P++, FORTRAN 90 D, and High Performance FORTRAN. But data parallelism in insufficient for many applications, including adaptive mesh refinement. Without access to both data and task parallelism such applications exhibit several orders of magnitude more message passing and poor performance. In this work, a C++ array class library is used to implement deferred evaluation and run-time dependence for task parallelism recognition, tp obtain task parallelism through a data flow interpretation of data parallel array statements. Performance results show that that analysis and optimizations are both efficient and practical, allowing us to consider more substantial optimizations.

Parsons, R.; Quinlan, D.

1993-11-01

118

Performance Analysis of Parallel Right-Looking Sparse LU Factorization on Two Dimensional Grids of Processors  

Microsoft Academic Search

\\u000a We investigate performance characteristics for the LU factorization of large matrices with various sparsity patterns. We consider\\u000a supernodal right-looking parallel factorization on a two dimensional grid of processors, making use of static pivoting. We develop a performance model\\u000a and we validate it using the implementation in SuperLU_DIST, the real matrices and the IBM Power3 machine at NERSC. We use\\u000a this

Laura Grigori; Xiaoye S. Li

2004-01-01

119

Construction of a parallel processor for simulating manipulators and other mechanical systems  

NASA Technical Reports Server (NTRS)

This report summarizes the results of NASA Contract NAS5-30905, awarded under phase 2 of the SBIR Program, for a demonstration of the feasibility of a new high-speed parallel simulation processor, called the Real-Time Accelerator (RTA). The principal goals were met, and EAI is now proceeding with phase 3: development of a commercial product. This product is scheduled for commercial introduction in the second quarter of 1992.

Hannauer, George

1991-01-01

120

Parallel pipeline networking and signal processing with field-programmable gate arrays (FPGAs) and VCSEL-MSM smart pixels  

NASA Astrophysics Data System (ADS)

We present a networking and signal processing architecture called Transpar-TR (Translucent Smart Pixel Array-Token- Ring) that utilizes smart pixel technology to perform 2D parallel optical data transfer between digital processing nodes. Transpar-TR moves data through the network in the form of 3D packets (2D spatial and 1D time). By utilizing many spatial parallel channels, Transpar-TR can achieve high throughput, low latency communication between nodes, even with each channel operating at moderate data rates. The 2D array of optical channels is created by an array of smart pixels, each with an optical input and optical output. Each smart pixel consists of two sections, an optical network interface and ALU-based processor with local memory. The optical network interface is responsible for transmitting and receiving optical data packets using a slotted token ring network protocol. The smart pixel array operates as a single-instruction multiple-data processor when processing data. The Transpar-TR network, consisting of networked smart pixel arrays, can perform pipelined parallel processing very efficiently on 2D data structures such as images and video. This paper discusses the Transpar-TR implementation in which each node is the printed circuit board integration of a VCSEL-MSM chip, a transimpedance receiver array chip and an FPGA chip.

Kuznia, C. B.; Sawchuk, Alexander A.; Zhang, Liping; Hoanca, Bogdan; Hong, Sunkwang; Min, Chris; Pansatiankul, Dhawat E.; Alpaslan, Zahir Y.

2000-05-01

121

A 1,000 Frames/s Programmable Vision Chip with Variable Resolution and Row-Pixel-Mixed Parallel Image Processors  

PubMed Central

A programmable vision chip with variable resolution and row-pixel-mixed parallel image processors is presented. The chip consists of a CMOS sensor array, with row-parallel 6-bit Algorithmic ADCs, row-parallel gray-scale image processors, pixel-parallel SIMD Processing Element (PE) array, and instruction controller. The resolution of the image in the chip is variable: high resolution for a focused area and low resolution for general view. It implements gray-scale and binary mathematical morphology algorithms in series to carry out low-level and mid-level image processing and sends out features of the image for various applications. It can perform image processing at over 1,000 frames/s (fps). A prototype chip with 64 × 64 pixels resolution and 6-bit gray-scale image is fabricated in 0.18 ?m Standard CMOS process. The area size of chip is 1.5 mm × 3.5 mm. Each pixel size is 9.5 ?m × 9.5 ?m and each processing element size is 23 ?m × 29 ?m. The experiment results demonstrate that the chip can perform low-level and mid-level image processing and it can be applied in the real-time vision applications, such as high speed target tracking. PMID:22454565

Lin, Qingyu; Miao, Wei; Zhang, Wancheng; Fu, Qiuyu; Wu, Nanjian

2009-01-01

122

Evaluation of soft-core processors on a Xilinx Virtex-5 field programmable gate array.  

SciTech Connect

Node-based architecture (NBA) designs for future satellite projects hold the promise of decreasing system development time and costs, size, weight, and power and positioning the laboratory to address other emerging mission opportunities quickly. Reconfigurable field programmable gate array (FPGA)-based modules will comprise the core of several of the NBA nodes. Microprocessing capabilities will be necessary with varying degrees of mission-specific performance requirements on these nodes. To enable the flexibility of these reconfigurable nodes, it is advantageous to incorporate the microprocessor into the FPGA itself, either as a hard-core processor built into the FPGA or as a soft-core processor built out of FPGA elements. This document describes the evaluation of three reconfigurable FPGA-based soft-core processors for use in future NBA systems: the MicroBlaze (uB), the open-source Leon3, and the licensed Leon3. Two standard performance benchmark applications were developed for each processor. The first, Dhrystone, is a fixed-point operation metric. The second, Whetstone, is a floating-point operation metric. Several trials were run at varying code locations, loop counts, processor speeds, and cache configurations. FPGA resource utilization was recorded for each configuration.

Learn, Mark Walter

2011-04-01

123

Application of an array processor to the analysis of magnetic data for the Doublet III tokamak  

SciTech Connect

Discussed herein is a fast computational technique employing the Floating Point Systems AP-190L array processor to analyze magnetic data for the Doublet III tokamak, a fusion research device. Interpretation of the experimental data requires the repeated solution of a free-boundary nonlinear partial differential equation, which describes the magnetohydrodynamic (MHD) equilibrium of the plasma. For this particular application, we have found that the array processor is only 1.4 and 3.5 times slower than the CDC-7600 and CRAY computers, respectively. The overhead on the host DEC-10 computer was kept to a minimum by chaining the complete Poisson solver and free-boundary algorithm into one single-load module using the vector function chainer (VFC). A simple time-sharing scheme for using the MHD code is also discussed.

Wang, T.S.; Saito, M.T.

1980-08-01

124

Parallel Access of Out-Of-Core Dense Extendible Arrays  

SciTech Connect

Datasets used in scientific and engineering applications are often modeled as dense multi-dimensional arrays. For very large datasets, the corresponding array models are typically stored out-of-core as array files. The array elements are mapped onto linear consecutive locations that correspond to the linear ordering of the multi-dimensional indices. Two conventional mappings used are the row-major order and the column-major order of multi-dimensional arrays. Such conventional mappings of dense array files highly limit the performance of applications and the extendibility of the dataset. Firstly, an array file that is organized in say row-major order causes applications that subsequently access the data in column-major order, to have abysmal performance. Secondly, any subsequent expansion of the array file is limited to only one dimension. Expansions of such out-of-core conventional arrays along arbitrary dimensions, require storage reorganization that can be very expensive. Wepresent a solution for storing out-of-core dense extendible arrays that resolve the two limitations. The method uses a mapping function F*(), together with information maintained in axial vectors, to compute the linear address of an extendible array element when passed its k-dimensional index. We also give the inverse function, F-1*() for deriving the k-dimensional index when given the linear address. We show how the mapping function, in combination with MPI-IO and a parallel file system, allows for the growth of the extendible array without reorganization and no significant performance degradation of applications accessing elements in any desired order. We give methods for reading and writing sub-arrays into and out of parallel applications that run on a cluster of workstations. The axial-vectors are replicated and maintained in each node that accesses sub-array elements.

Otoo, Ekow J; Rotem, Doron

2007-07-26

125

A Coarse-Grained Array based Baseband Processor for 100Mbps+ Software Defined Radio  

Microsoft Academic Search

The Software-Defined Radio (SDR) concept aims to enabling cost- effective multi-mode baseband solutions for wireless terminals. However, the growing complexity of new communication standards applying, e.g., multi-antenna transmission techniques, together with the reduced energy budget, is challenging SDR architectures. Coarse- Grained Array (CGA) processors are strong candidates to undertake both high performance and low power. The design of a candidate

Bruno Bougard; Bjorn De Sutter; Sebastien Rabou; David Novo; Osman Allam; Steven Dupont; Liesbet Van Der Perre

2008-01-01

126

Realization of a neuronal hardware with digital signal processor and programmable gate arrays  

NASA Astrophysics Data System (ADS)

In this paper we describe how the processing speed of a radial basis neural network can be performed by the use of field programmable gate arrays (FPGA). The calculation of the very time-consuming exponential function is taken by an optimized CORDIC-processor. We determine the number of the necessary FPGAs and do a processing speed comparison between FPGA and DSP referring to an application in speech recognition.

Meyer-Baese, Anke; Meyer-Baese, Uwe; Scheich, Henning

1995-04-01

127

Technology development and circuit design for a parallel laser programmable floating point application specific processor  

NASA Astrophysics Data System (ADS)

The laser programmable floating point application specific processor (LPASP) is a new approach at rapid development of custom VLSI chips. The LPASP is a generic application specific processor that can be programmed to perform a specific function. The effort of this thesis is to develop and test the double precision floating point adder and the laser programmable read-only memory (LPROM) that are macrocells within the LPASP. In addition, the applicability of an LPASP parallel processing system is analyzed. The double precision floating point adder is an adder/subtractor macrocell designed to comply with the IEEE double precision floating point standard. An 84-pin chip of the adder was fabricated using 2 micron feature sizes. The fastest processing time was measured at 120 nanoseconds over 23 worst case test vectors. The adder uses the optimized carry multiplexed (OCM) adder that was developed at AFIT. The OCM adder is a new adder architecture that uses four parallel carry paths to attain a performance time on the order of (cubed root of M) with a gate count on the order of O(n). The redundant logic associated with the parallel propagation banks is eliminated in the OCM adder so that the largest bit-slice of the adder contains only eight 2-to-1 multiplexer gates. A 57-bit adder was fabricated using 2 micron feature sizes. The processing time for the adder is 31 nsec.

Scriber, Michael W.

1989-12-01

128

Parallel Agent-Based Simulations on Clusters of GPUs and Multi-Core Processors  

SciTech Connect

An effective latency-hiding mechanism is presented in the parallelization of agent-based model simulations (ABMS) with millions of agents. The mechanism is designed to accommodate the hierarchical organization as well as heterogeneity of current state-of-the-art parallel computing platforms. We use it to explore the computation vs. communication trade-off continuum available with the deep computational and memory hierarchies of extant platforms and present a novel analytical model of the tradeoff. We describe our implementation and report preliminary performance results on two distinct parallel platforms suitable for ABMS: CUDA threads on multiple, networked graphical processing units (GPUs), and pthreads on multi-core processors. Message Passing Interface (MPI) is used for inter-GPU as well as inter-socket communication on a cluster of multiple GPUs and multi-core processors. Results indicate the benefits of our latency-hiding scheme, delivering as much as over 100-fold improvement in runtime for certain benchmark ABMS application scenarios with several million agents. This speed improvement is obtained on our system that is already two to three orders of magnitude faster on one GPU than an equivalent CPU-based execution in a popular simulator in Java. Thus, the overall execution of our current work is over four orders of magnitude faster when executed on multiple GPUs.

Aaby, Brandon G [ORNL; Perumalla, Kalyan S [ORNL; Seal, Sudip K [ORNL

2010-01-01

129

Constant Time Algorithms for the Transitive Closure and Some Related Graph Problems on Processor Arrays with Reconfigurable Bus Systems  

Microsoft Academic Search

The transitive closure problem in O(1) time is solved by a new method that is far different from the conventional solution method. On processor arrays with reconfigurable bus systems, two O(1) time algorithms are proposed for computing the transitive closure of an undirected graph. One is designed on a three-dimensional n*n*n processor array with a reconfigurable bus system, and the

Biing-feng Wang; Gen-huey Chen

1990-01-01

130

Parallel arrays of Josephson junctions for submillimeter local oscillators  

NASA Technical Reports Server (NTRS)

In this paper we discuss the influence of the DC biasing circuit on operation of parallel biased quasioptical Josephson junction oscillator arrays. Because of nonuniform distribution of the DC biasing current along the length of the bias lines, there is a nonuniform distribution of magnetic flux in superconducting loops connecting every two junctions of the array. These DC self-field effects determine the state of the array. We present analysis and time-domain numerical simulations of these states for four biasing configurations. We find conditions for the in-phase states with maximum power output. We compare arrays with small and large inductances and determine the low inductance limit for nearly-in-phase array operation. We show how arrays can be steered in H-plane using the externally applied DC magnetic field.

Pance, Aleksandar; Wengler, Michael J.

1992-01-01

131

Block iterative restoration of astronomical images with the massively parallel processor  

NASA Technical Reports Server (NTRS)

A method is described for algebraic image restoration capable of treating astronomical images. For a typical 500 x 500 image, direct algebraic restoration would require the solution of a 250,000 x 250,000 linear system. The block iterative approach is used to reduce the problem to solving 4900 121 x 121 linear systems. The algorithm was implemented on the Goddard Massively Parallel Processor, which can solve a 121 x 121 system in approximately 0.06 seconds. Examples are shown of the results for various astronomical images.

Heap, Sara R.; Lindler, Don J.

1987-01-01

132

Estimating water flow through a hillslope using the massively parallel processor  

NASA Technical Reports Server (NTRS)

A new two-dimensional model of water flow in a hillslope has been implemented on the Massively Parallel Processor at the Goddard Space Flight Center. Flow in the soil both in the saturated and unsaturated zones, evaporation and overland flow are all modelled, and the rainfall rates are allowed to vary spatially. Previous models of this type had always been very limited computationally. This model takes less than a minute to model all the components of the hillslope water flow for a day. The model can now be used in sensitivity studies to specify which measurements should be taken and how accurate they should be to describe such flows for environmental studies.

Devaney, Judy E.; Camillo, P. J.; Gurney, R. J.

1988-01-01

133

Animated computer graphics models of space and earth sciences data generated via the massively parallel processor  

NASA Technical Reports Server (NTRS)

The capability was developed of rapidly producing visual representations of large, complex, multi-dimensional space and earth sciences data sets via the implementation of computer graphics modeling techniques on the Massively Parallel Processor (MPP) by employing techniques recently developed for typically non-scientific applications. Such capabilities can provide a new and valuable tool for the understanding of complex scientific data, and a new application of parallel computing via the MPP. A prototype system with such capabilities was developed and integrated into the National Space Science Data Center's (NSSDC) Pilot Climate Data System (PCDS) data-independent environment for computer graphics data display to provide easy access to users. While developing these capabilities, several problems had to be solved independently of the actual use of the MPP, all of which are outlined.

Treinish, Lloyd A.; Gough, Michael L.; Wildenhain, W. David

1987-01-01

134

PULSAR Users' Guide Parallel Ultra-Light Systolic Array Runtime  

E-print Network

PULSAR Users' Guide Parallel Ultra-Light Systolic Array Runtime Version 2.0 November, 2014 LAPACK for which PULSAR is Suitable . . . . . . . . . . . . . . . . . . 1 1.2 Computers for which PULSAR is Suitable . . . . . . . . . . . . . . . . . 1 1.3 So ware Components that PULSAR Requires

Tennessee, University of

135

Scattering by an Arbitrary Array of Parallel Wires  

Microsoft Academic Search

Equations are developed for the scattering pattern of an arbitrary array of parallel wires. The wires are assumed to be infinitely long, perfectly conducting, and very small in diameter in comparison with the wavelength. The incident wave is assumed to be TM with respect to the wire axis, but it may have normal or oblique incidence on the wires. The

J. H. Richmond

1965-01-01

136

Parallel optical nanolithography using nanoscale bowtie aperture array  

E-print Network

, "Matching the resolution of electron beam lithography by scanning near-field photolithography," Nano Lett. 4 arrays are used to focus a laser beam into multiple nanoscale light spots for parallel nano(8), 1381­1384 (2004). 2. M. M. Alkaisi, R. J. Blaikie, S. J. McNab, R. Cheung, and D. R. S. Cumming, "Sub-diffraction

Xu, Xianfan

137

GENOTYPING ON CUSTOM ARRAYS USING A PARALLEL DATA PIPELINE  

E-print Network

GENOTYPING ON CUSTOM ARRAYS USING A PARALLEL DATA PIPELINE BY Brian Albere B.S., University of NewIndex........................................................................24 3.7 Filters, Initial Genotype Mapping, and Dip Preparation.....................29 3.8 Dip Test.............................................................................32 #12;vi 3.9 Final Genotype Mapping.........................................................33 3

New Hampshire, University of

138

Basic data-base operations on the Butterfly Parallel Processor: experiment results. Memorandum report, January-December 1987  

SciTech Connect

The next phase in speeding up data-base queries will be through the use of highly parallel computers. This paper will discuss the basic data-base operations (select, project, natural join, and scaler aggregates) on a shared-memory multiple instruction stream, multiple data stream (MIMD) computer and the problems associated with implementing them. Some problems associated with getting maximum parallelization are improper data division and hot spots. Improper data division results when the number of tasks does not divide evenly among the processors. Hot spots or contentions occur due to locking if accesses are made to the same segment of a RAMFile and also if attempts are made to get data from the same remote processor at the same time. These algorithms have been implemented on the Butterfly Parallel Processor, and the results of our experiments are described in detail.

Rosenau, T.J.; Jajodia, S.

1988-03-04

139

Parallel Algorithms for DNA Probe Placement on Small Oligonucleotide Arrays  

E-print Network

Oligonucleotide arrays are used in a wide range of genomic analyses, such as gene expression profiling, comparative genomic hybridization, chromatin immunoprecipitation, SNP detection, etc. During fabrication, the sites of an oligonucleotide array are selectively exposed to light in order to activate oligonucleotides for further synthesis. Optical effects can cause unwanted illumination at masked sites that are adjacent to the sites intentionally exposed to light. This results in synthesis of unforeseen sequences in masked sites and compromises interpretation of experimental data. To reduce such uncertainty, one can exploit freedom in how probes are assigned to array sites. The border length minimization problem (BLMP) seeks a placement of probes that minimizes the sum of border lengths in all masks. In this paper, we propose two parallel algorithms for the BLMP. The proposed parallel algorithms have the local-search paradigm at their core, and are especially developed for the BLMP. The results reported show ...

Trinca, Dragos

2011-01-01

140

An Analog Processor for Image Compression  

NASA Technical Reports Server (NTRS)

This paper describes a novel analog Vector Array Processor (VAP) that was designed for use in real-time and ultra-low power image compression applications. This custom CMOS processor is based architectually on the Vector Quantization (VQ) algorithm in image coding, and the hardware implementation fully exploits the inherent parallelism built-in the VQ algorithm.

Tawel, R.

1992-01-01

141

Naval Research Laboratory flex processor for radar signal processing  

NASA Astrophysics Data System (ADS)

This paper describes a programmable radar signal processor architecture developed at the Naval Research Laboratory (NRL). The design incorporates T.I. TMS320C30 programmable digital signal processor devices, Xilinx programmable gate arrays, TRW FFT devices, and a parallel array of Inmos Transputer microprocessors. The architecture is extremely flexible and is applicable to a wide variety of applications.

Alter, James J.; Evins, James B.; Letellier, J. P.

1991-12-01

142

Performance Evaluation and Modeling Techniques for Parallel Processors. Ph.D. Thesis  

NASA Technical Reports Server (NTRS)

In practice, the performance evaluation of supercomputers is still substantially driven by singlepoint estimates of metrics (e.g., MFLOPS) obtained by running characteristic benchmarks or workloads. With the rapid increase in the use of time-shared multiprogramming in these systems, such measurements are clearly inadequate. This is because multiprogramming and system overhead, as well as other degradations in performance due to time varying characteristics of workloads, are not taken into account. In multiprogrammed environments, multiple jobs and users can dramatically increase the amount of system overhead and degrade the performance of the machine. Performance techniques, such as benchmarking, which characterize performance on a dedicated machine ignore this major component of true computer performance. Due to the complexity of analysis, there has been little work done in analyzing, modeling, and predicting the performance of applications in multiprogrammed environments. This is especially true for parallel processors, where the costs and benefits of multi-user workloads are exacerbated. While some may claim that the issue of multiprogramming is not a viable one in the supercomputer market, experience shows otherwise. Even in recent massively parallel machines, multiprogramming is a key component. It has even been claimed that a partial cause of the demise of the CM2 was the fact that it did not efficiently support time-sharing. In the same paper, Gordon Bell postulates that, multicomputers will evolve to multiprocessors in order to support efficient multiprogramming. Therefore, it is clear that parallel processors of the future will be required to offer the user a time-shared environment with reasonable response times for the applications. In this type of environment, the most important performance metric is the completion of response time of a given application. However, there are a few evaluation efforts addressing this issue.

Dimpsey, Robert Tod

1992-01-01

143

Mobile and replicated alignment of arrays in data-parallel programs  

NASA Technical Reports Server (NTRS)

When a data-parallel language like FORTRAN 90 is compiled for a distributed-memory machine, aggregate data objects (such as arrays) are distributed across the processor memories. The mapping determines the amount of residual communication needed to bring operands of parallel operations into alignment with each other. A common approach is to break the mapping into two stages: first, an alignment that maps all the objects to an abstract template, and then a distribution that maps the template to the processors. We solve two facets of the problem of finding alignments that reduce residual communication: we determine alignments that vary in loops, and objects that should have replicated alignments. We show that loop-dependent mobile alignment is sometimes necessary for optimum performance, and we provide algorithms with which a compiler can determine good mobile alignments for objects within do loops. We also identify situations in which replicated alignment is either required by the program itself (via spread operations) or can be used to improve performance. We propose an algorithm based on network flow that determines which objects to replicate so as to minimize the total amount of broadcast communication in replication. This work on mobile and replicated alignment extends our earlier work on determining static alignment.

Chatterjee, Siddhartha; Gilbert, John R.; Schreiber, Robert

1993-01-01

144

Parallel nanoimaging and nanolithography using a heated microcantilever array.  

PubMed

We report parallel topographic imaging and nanolithography using heated microcantilever arrays integrated into a commercial atomic force microscope (AFM). The array has five AFM cantilevers, each of which has an internal resistive heater. The temperatures of the cantilever heaters can be monitored and controlled independently and in parallel. We perform parallel AFM imaging of a region of size 550 ?m × 90 ?m, where the cantilever heat flow signals provide a measure of the nanometer-scale substrate topography. At a cantilever scan speed of 1134 ?m s(-1), we acquire a 3.1 million-pixel image in 62 s with noise-limited vertical resolution of 0.6 nm and pixels of size 351 nm × 45 nm. At a scan speed of 4030 ?m s(-1) we acquire a 26.4 million pixel image in 124 s with vertical resolution of 5.4 nm and pixels of size 44 nm × 43 nm. Finally, we demonstrate parallel nanolithography with the cantilever array, including iterations of measure-write-measure nanofabrication, with each cantilever operating independently. PMID:24334342

Somnath, Suhas; Kim, Hoe Joon; Hu, Huan; King, William P

2014-01-10

145

Feasibility of using the Massively Parallel Processor for large eddy simulations and other Computational Fluid Dynamics applications  

NASA Technical Reports Server (NTRS)

The results of an investigation into the feasibility of using the MPP for direct and large eddy simulations of the Navier-Stokes equations is presented. A major part of this study was devoted to the implementation of two of the standard numerical algorithms for CFD. These implementations were not run on the Massively Parallel Processor (MPP) since the machine delivered to NASA Goddard does not have sufficient capacity. Instead, a detailed implementation plan was designed and from these were derived estimates of the time and space requirements of the algorithms on a suitably configured MPP. In addition, other issues related to the practical implementation of these algorithms on an MPP-like architecture were considered; namely, adaptive grid generation, zonal boundary conditions, the table lookup problem, and the software interface. Performance estimates show that the architectural components of the MPP, the Staging Memory and the Array Unit, appear to be well suited to the numerical algorithms of CFD. This combined with the prospect of building a faster and larger MMP-like machine holds the promise of achieving sustained gigaflop rates that are required for the numerical simulations in CFD.

Bruno, John

1984-01-01

146

On-board landmark navigation and attitude reference parallel processor system  

NASA Technical Reports Server (NTRS)

An approach to autonomous navigation and attitude reference for earth observing spacecraft is described along with the landmark identification technique based on a sequential similarity detection algorithm (SSDA). Laboratory experiments undertaken to determine if better than one pixel accuracy in registration can be achieved consistent with onboard processor timing and capacity constraints are included. The SSDA is implemented using a multi-microprocessor system including synchronization logic and chip library. The data is processed in parallel stages, effectively reducing the time to match the small known image within a larger image as seen by the onboard image system. Shared memory is incorporated in the system to help communicate intermediate results among microprocessors. The functions include finding mean values and summation of absolute differences over the image search area. The hardware is a low power, compact unit suitable to onboard application with the flexibility to provide for different parameters depending upon the environment.

Gilbert, L. E.; Mahajan, D. T.

1978-01-01

147

High-performance parallel processors based on star-coupled wavelength division multiplexing optical interconnects  

DOEpatents

As the performance of individual elements within parallel processing systems increases, increased communication capability between distributed processor and memory elements is required. There is great interest in using fiber optics to improve interconnect communication beyond that attainable using electronic technology. Several groups have considered WDM, star-coupled optical interconnects. The invention uses a fiber optic transceiver to provide low latency, high bandwidth channels for such interconnects using a robust multimode fiber technology. Instruction-level simulation is used to quantify the bandwidth, latency, and concurrency required for such interconnects to scale to 256 nodes, each operating at 1 GFLOPS performance. Performance scales have been shown to .apprxeq.100 GFLOPS for scientific application kernels using a small number of wavelengths (8 to 32), only one wavelength received per node, and achievable optoelectronic bandwidth and latency.

Deri, Robert J. (Pleasanton, CA); DeGroot, Anthony J. (Castro Valley, CA); Haigh, Ronald E. (Arvada, CO)

2002-01-01

148

TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. XX, NO. YY, ZZZ 2006 1 Performance Models for Network Processor Design  

E-print Network

TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. XX, NO. YY, ZZZ 2006 1 Performance Models for Network Processor Design Tilman Wolf, Member, IEEE, Mark A. Franklin, Fellow, IEEE Abstract-- To provide and their associated components. An increasingly central component in router design is a chip- multiprocessor (CMP

Shenoy, Prashant

149

Microchannel cross load array with dense parallel input  

DOEpatents

An architecture or layout for microchannel arrays using T or Cross (+) loading for electrophoresis or other injection and separation chemistry that are performed in microfluidic configurations. This architecture enables a very dense layout of arrays of functionally identical shaped channels and it also solves the problem of simultaneously enabling efficient parallel shapes and biasing of the input wells, waste wells, and bias wells at the input end of the separation columns. One T load architecture uses circular holes with common rows, but not columns, which allows the flow paths for each channel to be identical in shape, using multiple mirror image pieces. Another T load architecture enables the access hole array to be formed on a biaxial, collinear grid suitable for EDM micromachining (square holes), with common rows and columns.

Swierkowski, Stefan P.

2004-04-06

150

Parallel vacuum arc discharge with microhollow array dielectric and anode  

SciTech Connect

An electrode configuration with microhollow array dielectric and anode was developed to obtain parallel vacuum arc discharge. Compared with the conventional electrodes, more than 10 parallel microhollow discharges were ignited for the new configuration, which increased the discharge area significantly and made the cathode eroded more uniformly. The vacuum discharge channel number could be increased effectively by decreasing the distances between holes or increasing the arc current. Experimental results revealed that plasmas ejected from the adjacent hollow and the relatively high arc voltage were two key factors leading to the parallel discharge. The characteristics of plasmas in the microhollow were investigated as well. The spectral line intensity and electron density of plasmas in microhollow increased obviously with the decease of the microhollow diameter.

Feng, Jinghua; Zhou, Lin; Fu, Yuecheng; Zhang, Jianhua; Xu, Rongkun; Chen, Faxin; Li, Linbo; Meng, Shijian, E-mail: mengshijian04@126.com [Institute of Nuclear Physics and Chemistry, China Academy of Engineering Physics, Mianyang 621900 (China)

2014-07-15

151

Bayesian image reconstruction for emission tomography incorporating Good's roughness prior on massively parallel processors.  

PubMed Central

Since the introduction by Shepp and Vardi [Shepp, L. A. & Vardi, Y. (1982) IEEE Trans. Med. Imaging 1, 113-121] of the expectation-maximization algorithm for the generation of maximum-likelihood images in emission tomography, a number of investigators have applied the maximum-likelihood method to imaging problems. Though this approach is promising, it is now well known that the unconstrained maximum-likelihood approach has two major drawbacks: (i) the algorithm is computationally demanding, resulting in reconstruction times that are not acceptable for routine clinical application, and (ii) the unconstrained maximum-likelihood estimator has a fundamental noise artifact that worsens as the iterative algorithm climbs the likelihood hill. In this paper the computation issue is addressed by proposing an implementation on the class of massively parallel single-instruction, multiple-data architectures. By restructuring the superposition integrals required for the expectation-maximization algorithm as the solutions of partial differential equations, the local data passage required for efficient computation on this class of machines is satisfied. For dealing with the "noise artifact" a Markov random field prior determined by Good's rotationally invariant roughness penalty is incorporated. These methods are demonstrated on the single-instruction multiple-data class of parallel processors, with the computation times compared with those on conventional and hypercube architectures. Images PMID:2014243

Miller, M I; Roysam, B

1991-01-01

152

Integration Architecture of Content Addressable Memory and Massive-Parallel Memory-Embedded SIMD Matrix for Versatile Multimedia Processor  

NASA Astrophysics Data System (ADS)

This paper presents an integration architecture of content addressable memory (CAM) and a massive-parallel memory-embedded SIMD matrix for constructing a versatile multimedia processor. The massive-parallel memory-embedded SIMD matrix has 2,048 2-bit processing elements, which are connected by a flexible switching network, and supports 2-bit 2,048-way bit-serial and word-parallel operations with a single command. The SIMD matrix architecture is verified to be a better way for processing the repeated arithmetic operation types in multimedia applications. The proposed architecture, reported in this paper, exploits in addition CAM technology and enables therefore fast pipelined table-lookup coding operations. Since both arithmetic and table-lookup operations execute extremely fast, the proposed novel architecture can realize consequently efficient and versatile multimedia data processing. Evaluation results of the proposed CAM-enhanced massive-parallel SIMD matrix processor for the example of the frequently used JPEG image-compression application show that the necessary clock cycle number can be reduced by 86% in comparison to a conventional mobile DSP architecture. The determined performances in Mpixel/mm2 are factors 3.3 and 4.4 better than with a CAM-less massive-parallel memory-embedded SIMD matrix processor and a conventional mobile DSP, respectively.

Kumaki, Takeshi; Ishizaki, Masakatsu; Koide, Tetsushi; Mattausch, Hans Jürgen; Kuroda, Yasuto; Gyohten, Takayuki; Noda, Hideyuki; Dosaka, Katsumi; Arimoto, Kazutami; Saito, Kazunori

153

Locally adaptive image sensing with the 64x64 cell MIPA4k mixed-mode image processor array  

Microsoft Academic Search

This paper presents an implementation of locally adaptive image sensing with the MIPA4k image processor array. The implemented adaptive sensor circuitry allows the extraction of image detail in the presence of large dynamic intra-scene lighting variations. The implemented circuit method is based on the use of nonlinear resistive network filtering during image integration for controlling the pixel integration times locally,

Jonne Poikonen; Mika Laiho; Ari Paasio

2009-01-01

154

Temperature modeling and emulation of an ASIC temperature monitor system for Tightly-Coupled Processor Arrays (TCPAs)  

NASA Astrophysics Data System (ADS)

This contribution provides an approach for emulating the behaviour of an ASIC temperature monitoring system (TMon) during run-time for a tightly-coupled processor array (TCPA) of a heterogeneous invasive multi-tile architecture to be used for FPGA prototyping. It is based on a thermal RC modeling approach. Also different usage scenarios of TCPA are analyzed and compared.

Glocker, E.; Boppu, S.; Chen, Q.; Schlichtmann, U.; Teich, J.; Schmitt-Landsiedel, D.

2014-11-01

155

[4] J.W. Lamont, R. H. Iveson, "Array Processor Appli-cations in Power System Planning and Operation",  

E-print Network

. 1977 [11] Power System Dynamic Analysis Phase I, EPRI Final Report, EL-484, July 1977. [12] D.W. Olive[4] J.W. Lamont, R. H. Iveson, "Array Processor Appli- cations in Power System Planning and Operation", Proc. 7th PSCC, pp. 710-717, 1981. [5] R. Pritchard, C. Pottle, "High Speed Power Flows Using

Catholic University of Chile (Universidad Católica de Chile)

156

RAL-TR-1998-060 Co-Array Fortran for parallel programming  

E-print Network

-array ................................................... 4 2.4 Finite-element example ............................................................ 5 2RAL-TR-1998-060 1 Co-Array Fortran for parallel programming by 2 3 R. W. Numrich and J. K. Reid Abstract - - Co-Array Fortran, formerly known as F , is a small extension of Fortran 95 for parallel

Mihajlovic, Milan D.

157

Xetal-II: A 107 GOPS, 600 mW Massively Parallel Processor for Video Scene Analysis  

Microsoft Academic Search

Xetal-II is a single-instruction multiple-data (SIMD) processor with 320 processing elements. It delivers a peak performance of 107 GOPS on 16-bit data while dissipating 600 mW. A 10 Mbit on-chip memory is provided which can store up to four VGA frames, allowing efficient implementation of frame-iterative algorithms. A massively parallel interconnect provides an internal bandwidth of more than 1.3 Tbit\\/s

Anteneh A. Abbo; Richard P. Kleihorst; Vishal Choudhary; Leo Sevat; Paul Wielage; Sebastien Mouy; Bart Vermeulen; Marc Heijligers

2008-01-01

158

XETAL-II: A 107 GOPS, 600mW Massively-Parallel Processor for Video Scene Analysis  

Microsoft Academic Search

Xetal-II is a SIMD processor with 320 processing elements delivering a peak performance of 107 GOPS on 16b data while dissipating 600mW. A 10Mb on-chip memory can store up to 4 VGA frames allowing efficient implementation of frame-iterative algorithms. A massively parallel interconnect provides an internal bandwidth of more than 1.3Tb\\/s to sustain the peak-performance. The 74mm2 IC is fabricated

A. Abbo; R. Kleihorst; V. Choudhary; L. Sevat; P. Wielage; S. Mouy; M. Heijligers

2007-01-01

159

Investigations on the usefulness of the Massively Parallel Processor for study of electronic properties of atomic and condensed matter systems  

NASA Technical Reports Server (NTRS)

The usefulness of the Massively Parallel Processor (MPP) for investigation of electronic structures and hyperfine properties of atomic and condensed matter systems was explored. The major effort was directed towards the preparation of algorithms for parallelization of the computational procedure being used on serial computers for electronic structure calculations in condensed matter systems. Detailed descriptions of investigations and results are reported, including MPP adaptation of self-consistent charge extended Hueckel (SCCEH) procedure, MPP adaptation of the first-principles Hartree-Fock cluster procedure for electronic structures of large molecules and solid state systems, and MPP adaptation of the many-body procedure for atomic systems.

Das, T. P.

1988-01-01

160

Computers: Massively parallel processors. (Latest citations from INSPEC the database for Physics, Electronics, and Computing). Published Search  

SciTech Connect

The bibliography contains citations concerning a concept in computers called Massively Parallel Processing. The processing power of a computer may be increased by using numerous processors in parallel and feeding data through a number of different computational paths at the same time. The citations explore these computers and their practical uses, and include case studies, specific problems solved, theory, and future possibilities and needs. Applications of neural network modeling, pattern recognition, image processing, local area routing, and genetic sequence comparison are discussed. (Contains 250 citations and includes a subject term index and title list.)

Not Available

1993-10-01

161

Computers: Massively parallel processors. (Latest citations from the INSPEC: Information Services for the Physics and Engineering Communities database). Published Search  

SciTech Connect

The bibliography contains citations concerning a concept in computers called Massively Parallel Processing. The processing power of a computer may be increased by using numerous processors in parallel and feeding data through a number of different computational paths at the same time. The citations explore these computers and their practical uses, and include case studies, specific problems solved, theory, and future possibilities and needs. Applications of neural network modeling, pattern recognition, image processing, local area routing, and genetic sequence comparison are discussed. (Contains 250 citations and includes a subject term index and title list.)

Not Available

1993-06-01

162

Real-Time Adaptive Lossless Hyperspectral Image Compression using CCSDS on Parallel GPGPU and Multicore Processor Systems  

NASA Technical Reports Server (NTRS)

The proposed CCSDS (Consultative Committee for Space Data Systems) Lossless Hyperspectral Image Compression Algorithm was designed to facilitate a fast hardware implementation. This paper analyses that algorithm with regard to available parallelism and describes fast parallel implementations in software for GPGPU and Multicore CPU architectures. We show that careful software implementation, using hardware acceleration in the form of GPGPUs or even just multicore processors, can exceed the performance of existing hardware and software implementations by up to 11x and break the real-time barrier for the first time for a typical test application.

Hopson, Ben; Benkrid, Khaled; Keymeulen, Didier; Aranki, Nazeeh; Klimesh, Matt; Kiely, Aaron

2012-01-01

163

Nanocavity crossbar arrays for parallel electrochemical sensing on a chip  

PubMed Central

Summary We introduce a novel device for the mapping of redox-active compounds at high spatial resolution based on a crossbar electrode architecture. The sensor array is formed by two sets of 16 parallel band electrodes that are arranged perpendicular to each other on the wafer surface. At each intersection, the crossing bars are separated by a ca. 65 nm high nanocavity, which is stabilized by the surrounding passivation layer. During operation, perpendicular bar electrodes are biased to potentials above and below the redox potential of species under investigation, thus, enabling repeated subsequent reactions at the two electrodes. By this means, a redox cycling current is formed across the gap that can be measured externally. As the nanocavity devices feature a very high current amplification in redox cycling mode, individual sensing spots can be addressed in parallel, enabling high-throughput electrochemical imaging. This paper introduces the design of the device, discusses the fabrication process and demonstrates its capabilities in sequential and parallel data acquisition mode by using a hexacyanoferrate probe. PMID:25161846

Kätelhön, Enno; Mayer, Dirk; Banzet, Marko; Offenhäusser, Andreas

2014-01-01

164

Proc. 17th Conf. on Advanced Research in VLSI, September 15 17, 1997. c IEEE CS Kestrel: Design of an 8-bit SIMD parallel processor  

E-print Network

of an 8-bit SIMD parallel processor David M. Dahle Je rey D. Hirschberg Kevin Karplus Hansjorg Keller Eric of applications. This work was supported in part by NSF grant MIP-9423985 and its REU supplement. Keller

Hughey, Richard

165

Mechanically verified hardware implementing an 8-bit parallel IO Byzantine agreement processor  

NASA Technical Reports Server (NTRS)

Consider a network of four processors that use the Oral Messages (Byzantine Generals) Algorithm of Pease, Shostak, and Lamport to achieve agreement in the presence of faults. Bevier and Young have published a functional description of a single processor that, when interconnected appropriately with three identical others, implements this network under the assumption that the four processors step in synchrony. By formalizing the original Pease, et al work, Bevier and Young mechanically proved that such a network achieves fault tolerance. We develop, formalize, and discuss a hardware design that has been mechanically proven to implement their processor. In particular, we formally define mapping functions from the abstract state space of the Bevier-Young processor to a concrete state space of a hardware module and state a theorem that expresses the claim that the hardware correctly implements the processor. We briefly discuss the Brock-Hunt Formal Hardware Description Language which permits designs both to be proved correct with the Boyer-Moore theorem prover and to be expressed in a commercially supported hardware description language for additional electrical analysis and layout. We briefly describe our implementation.

Moore, J. Strother

1992-01-01

166

Experimental results for a photonic time reversal processor for the adaptive control of an ultra wideband phased array antenna  

NASA Astrophysics Data System (ADS)

This paper describes a new concept for a photonic implementation of a time reversed RF antenna array beamforming system. The process does not require analog to digital conversion to implement and is therefore particularly suited for high bandwidth applications. Significantly, propagation distortion due to atmospheric effects, clutter, etc. is automatically accounted for with the time reversal process. The approach utilizes the reflection of an initial interrogation signal from off an extended target to precisely time match the radiating elements of the array so as to re-radiate signals precisely back to the target's location. The backscattered signal(s) from the desired location is captured by each antenna and used to modulate a pulsed laser. An electrooptic switch acts as a time gate to eliminate any unwanted signals such as those reflected from other targets whose range is different from that of the desired location resulting in a spatial null at that location. A chromatic dispersion processor is used to extract the exact array parameters of the received signal location. Hence, other than an approximate knowledge of the steering direction needed only to approximately establish the time gating, no knowledge of the target position is required, and hence no knowledge of the array element time delay is required. Target motion and/or array element jitter is automatically accounted for. Presented here are experimental results that demonstrate the ability of a photonic processor to perform the time-reversal operation on ultra-short electronic pulses.

Zmuda, Henry; Fanto, Michael; McEwen, Thomas

2008-04-01

167

Clocking and circuit design for a parallel I\\/O on a first-generation CELL processor  

Microsoft Academic Search

A parallel I\\/O is integrated on a first-generation CELL processor in 90nm SOI CMOS. A clock-tracking architecture suppresses reference jitter to achieve 6.4Gbit\\/s\\/link operation at 21.6mW\\/Gbit\\/s. SOI effects on analog circuits, in particular high-speed receivers, are addressed to achieve a receiver sensitivity of ±12mV at 6.4Gbit\\/s with BER <10-14 measured using 7b PRBS data.

Ken Chang; Sudhakar Pamarti; Kambiz Kaviani; Elad Alon; Xudong Shi; Jie Shen; Gary Yip; Chris Madden; Ralf Schmitt; Chuck Yuan; Fari Assaderaghi; M. Horowitz

2005-01-01

168

A parallel FPGA implementation for real-time 2D pixel clustering for the ATLAS Fast Tracker Processor  

NASA Astrophysics Data System (ADS)

The parallel 2D pixel clustering FPGA implementation used for the input system of the ATLAS Fast TracKer (FTK) processor is presented. The input system for the FTK processor will receive data from the Pixel and micro-strip detectors from inner ATLAS read out drivers (RODs) at full rate, for total of 760Gbs, as sent by the RODs after level-1 triggers. Clustering serves two purposes, the first is to reduce the high rate of the received data before further processing, the second is to determine the cluster centroid to obtain the best spatial measurement. For the pixel detectors the clustering is implemented by using a 2D-clustering algorithm that takes advantage of a moving window technique to minimize the logic required for cluster identification. The cluster detection window size can be adjusted for optimizing the cluster identification process. Additionally, the implementation can be parallelized by instantiating multiple cores to identify different clusters independently thus exploiting more FPGA resources. This flexibility makes the implementation suitable for a variety of demanding image processing applications. The implementation is robust against bit errors in the input data stream and drops all data that cannot be identified. In the unlikely event of missing control words, the implementation will ensure stable data processing by inserting the missing control words in the data stream. The 2D pixel clustering implementation is developed and tested in both single flow and parallel versions. The first parallel version with 16 parallel cluster identification engines is presented. The input data from the RODs are received through S-Links and the processing units that follow the clustering implementation also require a single data stream, therefore data parallelizing (demultiplexing) and serializing (multiplexing) modules are introduced in order to accommodate the parallelized version and restore the data stream afterwards. The results of the first hardware tests of the single flow implementation on the custom FTK input mezzanine (IM) board are presented. We report on the integration of 16 parallel engines in the same FPGA and the resulting performances. The parallel 2D-clustering implementation has sufficient processing power to meet the specification for the Pixel layers of ATLAS, for up to 80 overlapping pp collisions that correspond to the maximum LHC luminosity planned until 2022.

Sotiropoulou, C. L.; Gkaitatzis, S.; Annovi, A.; Beretta, M.; Kordas, K.; Nikolaidis, S.; Petridou, C.; Volpi, G.

2014-10-01

169

Wideband aperture array using RF channelizers and massively parallel digital 2D IIR filterbank  

NASA Astrophysics Data System (ADS)

Wideband receive-mode beamforming applications in wireless location, electronically-scanned antennas for radar, RF sensing, microwave imaging and wireless communications require digital aperture arrays that offer a relatively constant far-field beam over several octaves of bandwidth. Several beamforming schemes including the well-known true time-delay and the phased array beamformers have been realized using either finite impulse response (FIR) or fast Fourier transform (FFT) digital filter-sum based techniques. These beamforming algorithms offer the desired selectivity at the cost of a high computational complexity and frequency-dependant far-field array patterns. A novel approach to receiver beamforming is the use of massively parallel 2-D infinite impulse response (IIR) fan filterbanks for the synthesis of relatively frequency independent RF beams at an order of magnitude lower multiplier complexity compared to FFT or FIR filter based conventional algorithms. The 2-D IIR filterbanks demand fast digital processing that can support several octaves of RF bandwidth, fast analog-to-digital converters (ADCs) for RF-to-bits type direct conversion of wideband antenna element signals. Fast digital implementation platforms that can realize high-precision recursive filter structures necessary for real-time beamforming, at RF radio bandwidths, are also desired. We propose a novel technique that combines a passive RF channelizer, multichannel ADC technology, and single-phase massively parallel 2-D IIR digital fan filterbanks, realized at low complexity using FPGA and/or ASIC technology. There exists native support for a larger bandwidth than the maximum clock frequency of the digital implementation technology. We also strive to achieve More-than-Moore throughput by processing a wideband RF signal having content with N-fold (B = N Fclk/2) bandwidth compared to the maximum clock frequency Fclk Hz of the digital VLSI platform under consideration. Such increase in bandwidth is achieved without use of polyphase signal processing or time-interleaved ADC methods. That is, all digital processors operate at the same Fclk clock frequency without phasing, while wideband operation is achieved by sub-sampling of narrower sub-bands at the the RF channelizer outputs.

Sengupta, Arindam; Madanayake, Arjuna; Gómez-García, Roberto; Engeberg, Erik D.

2014-05-01

170

Scalable Unix commands for parallel processors : a high-performance implementation.  

SciTech Connect

We describe a family of MPI applications we call the Parallel Unix Commands. These commands are natural parallel versions of common Unix user commands such as ls, ps, and find, together with a few similar commands particular to the parallel environment. We describe the design and implementation of these programs and present some performance results on a 256-node Linux cluster. The Parallel Unix Commands are open source and freely available.

Ong, E.; Lusk, E.; Gropp, W.

2001-06-22

171

On Parallelization of High-Speed Processors for Elliptic Curve Cryptography  

Microsoft Academic Search

This paper discusses parallelization of elliptic curve cryptography hardware accelerators using elliptic curves over binary fields F2m. Elliptic curve point multiplication, which is the operation used in every elliptic curve cryptosystem, is hierarchical in nature, and parallelism can be utilized in different hierarchy levels as shown in many publications. However, a comprehensive analysis on the effects of parallelization has not

Kimmo U. Järvinen; Jorma Skyttä

2008-01-01

172

Evaluation of the Leon3 soft-core processor within a Xilinx radiation-hardened field-programmable gate array.  

SciTech Connect

The purpose of this document is to summarize the work done to evaluate the performance of the Leon3 soft-core processor in a radiation environment while instantiated in a radiation-hardened static random-access memory based field-programmable gate array. This evaluation will look at the differences between two soft-core processors: the open-source Leon3 core and the fault-tolerant Leon3 core. Radiation testing of these two cores was conducted at the Texas A&M University Cyclotron facility and Lawrence Berkeley National Laboratory. The results of these tests are included within the report along with designs intended to improve the mitigation of the open-source Leon3. The test setup used for evaluating both versions of the Leon3 is also included within this document.

Learn, Mark Walter

2012-01-01

173

A Parallel FPGA Implementation for Real-Time 2D Pixel Clustering for the ATLAS Fast TracKer Processor  

E-print Network

The parallel 2D pixel clustering FPGA implementation used for the input system of the ATLAS Fast TracKer (FTK) processor is presented. The input system for the FTK processor will receive data from the Pixel and micro-strip detectors from inner ATLAS read out drivers (RODs) at full rate, for total of 760Gbs, as sent by the RODs after level-1 triggers. Clustering serves two purposes, the first is to reduce the high rate of the received data before further processing, the second is to determine the cluster centroid to obtain the best spatial measurement. For the pixel detectors the clustering is implemented by using a 2D-clustering algorithm that takes advantage of a moving window technique to minimize the logic required for cluster identification. The cluster detection window size can be adjusted for optimizing the cluster identification process. Additionally, the implementation can be parallelized by instantiating multiple cores to identify different clusters independently thus exploiting more FPGA resources. ...

Sotiropoulou, C-L; The ATLAS collaboration; Annovi, A; Beretta, M; Kordas, K; Nikolaidis, S; Petridou, C; Volpi, G

2014-01-01

174

Parallel plate lens with metal hole array for terahertz wave band  

NASA Astrophysics Data System (ADS)

Optical devices for terahertz wave band from 0.1 to 10 THz are rapidly expanding and require better designs. This paper proposes and designs a parallel plate lens with metal hole array for the terahertz wave band. The fast wave effect is due to the parallel plate. For this lens, the parallel plate spacing and hole array dimensions control the phase velocity and the focusing effect. It is not necessary to control the phase through the lens shape, which is flat, itself. The periodic analysis model extracted from the full model confirms the phase control by the metal hole array dimensions. The periodic model can be used for efficient iterative design. The full wave analysis results are also obtained by ANSYS HFSS and the focusing effect is confirmed. Phase control using both the parallel plate and the hole array enhances the focusing effect over the focusing effect controlled only by the metal hole array dimensions.

Suzuki, Takehito; Yonamine, Hiroki; Konno, Takuya; Young, John C.; Takano, Keisuke; Hangyo, Masanori

2014-05-01

175

A C++-embedded Domain-Specific Language for Programming the MORA Soft Processor Array  

E-print Network

Vanderbauwhede,W. Margala,M. Chalamalasetti,S.R. Purohit,S. TBP Proc. ASAP 2010 â?? 21st IEEE International Conference on Application-specific Systems, Architectures and Processors, Rennes, France, July 2010

Vanderbauwhede, W.

176

Exploiting Fine-grain Thread Level Parallelism on the MIT Multi-ALU Processor  

Microsoft Academic Search

Much of the improvement in computer performance over the last twenty years has come from faster transistors and architectural advances that increase parallelism. Historically, parallelism has been exploited either at the instruction level with a grain-size of a single instruction or by partitioning applications into coarse threads with grain-sizes of thousands of instructions. Fine-grain threads fill the parallelism gap between

Stephen W. Keckler; William J. Dally; Daniel Maskit; Nicholas P. Carter; Andrew Chang; Whay Sing Lee

1998-01-01

177

Multimode power processor  

DOEpatents

In one embodiment, a power processor which operates in three modes: an inverter mode wherein power is delivered from a battery to an AC power grid or load; a battery charger mode wherein the battery is charged by a generator; and a parallel mode wherein the generator supplies power to the AC power grid or load in parallel with the battery. In the parallel mode, the system adapts to arbitrary non-linear loads. The power processor may operate on a per-phase basis wherein the load may be synthetically transferred from one phase to another by way of a bumpless transfer which causes no interruption of power to the load when transferring energy sources. Voltage transients and frequency transients delivered to the load when switching between the generator and battery sources are minimized, thereby providing an uninterruptible power supply. The power processor may be used as part of a hybrid electrical power source system which may contain, in one embodiment, a photovoltaic array, diesel engine, and battery power sources. 31 figs.

O'Sullivan, G.A.; O'Sullivan, J.A.

1999-07-27

178

Multimode power processor  

DOEpatents

In one embodiment, a power processor which operates in three modes: an inverter mode wherein power is delivered from a battery to an AC power grid or load; a battery charger mode wherein the battery is charged by a generator; and a parallel mode wherein the generator supplies power to the AC power grid or load in parallel with the battery. In the parallel mode, the system adapts to arbitrary non-linear loads. The power processor may operate on a per-phase basis wherein the load may be synthetically transferred from one phase to another by way of a bumpless transfer which causes no interruption of power to the load when transferring energy sources. Voltage transients and frequency transients delivered to the load when switching between the generator and battery sources are minimized, thereby providing an uninterruptible power supply. The power processor may be used as part of a hybrid electrical power source system which may contain, in one embodiment, a photovoltaic array, diesel engine, and battery power sources.

O'Sullivan, George A. (Pottersville, NJ); O'Sullivan, Joseph A. (St. Louis, MO)

1999-01-01

179

A parallel algorithm for solving complex multibody problems with stream processors  

Microsoft Academic Search

This paper describes a numerical method for the parallel solution of the differential measure inclusion problem posed by mechanical multibody systems containing bilateral and unilateral frictional constraints. The method proposed has been implemented as a set of parallel algorithms leveraging NVIDIA's Compute Unified Device Architecture (CUDA) library support for multi-core stream computing. This allows the proposed solution to run on

Alessandro Tasora; Dan Negrut

2009-01-01

180

Acoustooptic linear algebra processors - Architectures, algorithms, and applications  

NASA Technical Reports Server (NTRS)

Architectures, algorithms, and applications for systolic processors are described with attention to the realization of parallel algorithms on various optical systolic array processors. Systolic processors for matrices with special structure and matrices of general structure, and the realization of matrix-vector, matrix-matrix, and triple-matrix products and such architectures are described. Parallel algorithms for direct and indirect solutions to systems of linear algebraic equations and their implementation on optical systolic processors are detailed with attention to the pipelining and flow of data and operations. Parallel algorithms and their optical realization for LU and QR matrix decomposition are specifically detailed. These represent the fundamental operations necessary in the implementation of least squares, eigenvalue, and SVD solutions. Specific applications (e.g., the solution of partial differential equations, adaptive noise cancellation, and optimal control) are described to typify the use of matrix processors in modern advanced signal processing.

Casasent, D.

1984-01-01

181

Sparsely Faceted Arrays: A Mechanism Supporting Parallel Allocation, Communication, and Garbage Collection  

E-print Network

Conventional parallel computer architectures do not provide support for non-uniformly distributed objects. In this thesis, I introduce sparsely faceted arrays (SFAs), a new low-level mechanism for naming regions of memory, ...

Brown, Jeremy Hanford

2002-06-01

182

Sparsely faceted arrays : a mechanism supporting the parallel allocation, communication, and garbage collection  

E-print Network

Conventional parallel computer architectures do not provide support for non-uniformly distributed objects. In this thesis, I introduce sparsely faceted arrays (SFAs), a new low-level mechanism for naming regions of memory, ...

Brown, Jeremy Hanford, 1972-

2002-01-01

183

Implementation of an EPICS IOC on an Embedded Soft Core Processor Using Field Programmable Gate Arrays  

SciTech Connect

At Jefferson Lab, we have been evaluating soft core processors running an EPICS IOC over {mu}Clinux on our custom hardware. A soft core processor is a flexible CPU architecture that is configured in the FPGA as opposed to a hard core processor which is fixed in silicon. Combined with an on-board Ethernet port, the technology incorporates the IOC and digital control hardware within a single FPGA. By eliminating the general purpose computer IOC, the designer is no longer tied to a specific platform, e.g. PC, VME, or VXI, to serve as the intermediary between the high level controls and the field hardware. This paper will discuss the design and development process as well as specific applications for JLab's next generation low-level RF controls and Machine Protection Systems.

Douglas Curry; Alicia Hofler; Hai Dong; Trent Allison; J. Hovater; Kelly Mahoney

2005-09-20

184

Cellular array processor with individual cell-level data-dependent cell control and multiport input memory  

SciTech Connect

In a processor array of the type including a plurality of individual processing cells, the combination therewith of a processing cell structure for inclusion in the array. The processing cell comprising: memory means having multiple input ports each for receiving separate input data from the controller and a plurality of output ports, an arithmetic logic unit (ALU) having a plurality of input ports each separate one coupled to an output port of the memory means, with the arithmetic logic unit having an output port, first register means coupled to the output port of the ALU for determining the status of the cell and for providing status data, second register means coupled to the first register means and the controller and operative to receive instructions from the controller and status data from the first register means to store therein a code indicative of an operating condition for the cell.

Morton, S.G.

1990-03-06

185

The Panda Array I/O library on the Galley Parallel File System  

E-print Network

1 The Panda Array I/O library on the Galley Parallel File System Joel T. Thomas Dartmouth Computer Joel.T.Thomas@dartmouth.edu June 5, 1996 Abstract The Panda Array I/O library, created some time, and the Panda project is an attempt to ameliorate this problem while still providing

186

General-purpose 128 128 SIMD processor array with integrated image sensor  

E-print Network

processing elements (APEs). While these processors are implemented using analogue circuitry, performing arithmetic and logic operations on data stored in the local memory. Each APE includes nine of data. APEs can communicate and exchange data with four nearest neigh- bours. All APEs execute identical

Dudek, Piotr

187

Development of a FPGA-based high speed FFT processor for wideband Direction of Arrival applications  

Microsoft Academic Search

A parallel and pipelined Fast Fourier Transform (FFT) processor for use in the Direction of Arrival (DOA) estimation of a wideband waveform is presented. The selected DOA algorithm follows the Coherent Signal Subspace Method (CSSM). The target device for implementation is a Xilinx Virtex-5 Field Programmable Gate Array (FPGA). The FFT processor was developed in MATLAB Simulink using the Xilinx

Mohsin Jamali; Joseph Downey; Nathan Wilikins; Christopher R. Rehm; J. Tipping

2009-01-01

188

Fast String Search on Multicore Processors: Mapping fundamental algorithms onto parallel hardware  

SciTech Connect

String searching is one of these basic algorithms. It has a host of applications, including search engines, network intrusion detection, virus scanners, spam filters, and DNA analysis, among others. The Cell processor, with its multiple cores, promises to speed-up string searching a lot. In this article, we show how we mapped string searching efficiently on the Cell. We present two implementations: • The fast implementation supports a small dictionary size (approximately 100 patterns) and provides a throughput of 40 Gbps, which is 100 times faster than reference implementations on x86 architectures. • The heavy-duty implementation is slower (3.3-4.3 Gbps), but supports dictionaries with tens of thousands of strings.

Scarpazza, Daniele P.; Villa, Oreste; Petrini, Fabrizio

2008-04-01

189

Low-power, real-time digital video stabilization using the HyperX parallel processor  

NASA Astrophysics Data System (ADS)

Coherent Logix has implemented a digital video stabilization algorithm for use in soldier systems and small unmanned air / ground vehicles that focuses on significantly reducing the size, weight, and power as compared to current implementations. The stabilization application was implemented on the HyperX architecture using a dataflow programming methodology and the ANSI C programming language. The initial implementation is capable of stabilizing an 800 x 600, 30 fps, full color video stream with a 53ms frame latency using a single 100 DSP core HyperX hx3100TM processor running at less than 3 W power draw. By comparison an Intel Core2 Duo processor running the same base algorithm on a 320x240, 15 fps stream consumes on the order of 18W. The HyperX implementation is an overall 100x improvement in performance (processing bandwidth increase times power improvement) over the GPP based platform. In addition the implementation only requires a minimal number of components to interface directly to the imaging sensor and helmet mounted display or the same computing architecture can be used to generate software defined radio waveforms for communications links. In this application, the global motion due to the camera is measured using a feature based algorithm (11 x 11 Difference of Gaussian filter and Features from Accelerated Segment Test) and model fitting (Random Sample Consensus). Features are matched in consecutive frames and a control system determines the affine transform to apply to the captured frame that will remove or dampen the camera / platform motion on a frame-by-frame basis.

Hunt, Martin A.; Tong, Lin; Bindloss, Keith; Zhong, Shang; Lim, Steve; Schmid, Benjamin J.; Tidwell, J. D.; Willson, Paul D.

2011-06-01

190

Obtaining identical results with double precision global accuracy on different numbers of processors in parallel particle Monte Carlo simulations  

SciTech Connect

We describe and compare different approaches for achieving numerical reproducibility in photon Monte Carlo simulations. Reproducibility is desirable for code verification, testing, and debugging. Parallelism creates a unique problem for achieving reproducibility in Monte Carlo simulations because it changes the order in which values are summed. This is a numerical problem because double precision arithmetic is not associative. Parallel Monte Carlo, both domain replicated and decomposed simulations, will run their particles in a different order during different runs of the same simulation because the non-reproducibility of communication between processors. In addition, runs of the same simulation using different domain decompositions will also result in particles being simulated in a different order. In [1], a way of eliminating non-associative accumulations using integer tallies was described. This approach successfully achieves reproducibility at the cost of lost accuracy by rounding double precision numbers to fewer significant digits. This integer approach, and other extended and reduced precision reproducibility techniques, are described and compared in this work. Increased precision alone is not enough to ensure reproducibility of photon Monte Carlo simulations. Non-arbitrary precision approaches require a varying degree of rounding to achieve reproducibility. For the problems investigated in this work double precision global accuracy was achievable by using 100 bits of precision or greater on all unordered sums which where subsequently rounded to double precision at the end of every time-step.

Cleveland, Mathew A., E-mail: cleveland7@llnl.gov; Brunner, Thomas A.; Gentile, Nicholas A.; Keasler, Jeffrey A.

2013-10-15

191

High-speed, automatic controller design considerations for integrating array processor, multi-microprocessor, and host computer system architectures  

NASA Technical Reports Server (NTRS)

Modern control systems must typically perform real-time identification and control, as well as coordinate a host of other activities related to user interaction, online graphics, and file management. This paper discusses five global design considerations which are useful to integrate array processor, multimicroprocessor, and host computer system architectures into versatile, high-speed controllers. Such controllers are capable of very high control throughput, and can maintain constant interaction with the nonreal-time or user environment. As an application example, the architecture of a high-speed, closed-loop controller used to actively control helicopter vibration is briefly discussed. Although this system has been designed for use as the controller for real-time rotorcraft dynamics and control studies in a wind tunnel environment, the controller architecture can generally be applied to a wide range of automatic control applications.

Jacklin, S. A.; Leyland, J. A.; Warmbrodt, W.

1985-01-01

192

A fast adaptive convex hull algorithm on two-dimensional processor arrays with a reconfigurable BUS system  

NASA Technical Reports Server (NTRS)

A bus system that can change dynamically to suit computational needs is referred to as reconfigurable. We present a fast adaptive convex hull algorithm on a two-dimensional processor array with a reconfigurable bus system (2-D PARBS, for short). Specifically, we show that computing the convex hull of a planar set of n points taken O(log n/log m) time on a 2-D PARBS of size mn x n with 3 less than or equal to m less than or equal to n. Our result implies that the convex hull of n points in the plane can be computed in O(1) time in a 2-D PARBS of size n(exp 1.5) x n.

Olariu, S.; Schwing, J.; Zhang, J.

1991-01-01

193

Trajectory optimization for real-time guidance. I - Time-varying LQR on a parallel processor  

NASA Technical Reports Server (NTRS)

A key algorithmic element of a real-time trajectory optimization hardware/software implementation, the quadratic program (QP) solver element, is presented. The purpose of the effort is to make nonlinear trajectory optimization fast enough to provide real-time commands during guidance of a vehicle such as an aeromaneuvering orbiter. Many methods of nonlinear programming require the solution of a QP at each iteration. In the trajectory optimization case the QP has a special dynamic programming structure, a LQR-like structure. QP algorithm speed is increased by taking advantage of this special structure and by parallel implementation.

Psiaki, Mark L.; Park, Kihong

1990-01-01

194

Experimental verification of SNR and parallel imaging improvements using composite arrays.  

PubMed

Composite MRI arrays consist of triplets where two orthogonal upright loops are placed over the same imaging area as a standard surface coil. The optimal height of the upright coils is approximately half the width for the 7?cm coils used in this work. Resistive and magnetic coupling is shown to be negligible within each coil triplet. Experimental evaluation of imaging performance was carried out on a Philips 3?T Achieva scanner using an eight-coil composite array consisting of three surface coils and five upright loops, as well as an array of eight surface coils for comparison. The composite array offers lower overall coupling than the traditional array. The sensitivities of upright coils are complementary to those of the surface coils and therefore provide SNR gains in regions where surface coil sensitivity is low, and additional spatial information for improved parallel imaging performance. Near the surface of the phantom the eight-channel surface coil array provides higher overall SNR than the composite array, but this advantage disappears beyond a depth of approximately one coil diameter, where it is typically more challenging to improve SNR. Furthermore, parallel imaging performance is better with the composite array compared with the surface coil array, especially at high accelerations and in locations deep in the phantom. Composite arrays offer an attractive means of improving imaging performance and channel density without reducing the size, and therefore the loading regime, of surface coil elements. Additional advantages of composite arrays include minimal SNR loss using root-sum-of-squares combination compared with optimal, and the ability to switch from high to low channel density by merely selecting only the surface elements, unlike surface coil arrays, which require additional hardware. PMID:25388793

Maunder, Adam; Fallone, B Gino; Daneshmand, Mojgan; De Zanche, Nicola

2015-02-01

195

Parallel algorithms for the maxima problem using an N-cube processor configuration  

E-print Network

(r) end for procedure BTNRG3(r) see section 3. 4 37 procedure 2EXTDON(r) for i = 0 to N-I do in parallel if (D[i, l] = 1 and D[i, 2] = 1 and and D[i, k-2] = 1) X''(i) &-- X[i, l] else X''(i) &-- negative infinity end if for b = r-1 to 0 do if ib... the maxima problem. The maxima problem is def ined in the following statements: Let S be a set of N d-dimensional points and let x(i, s) represent the st coordinate of i, where 1 & s & d. Let i and j be points contained in ST The point i is said...

Coffman, Sarah Wilson

1989-01-01

196

Method of up-front load balancing for local memory parallel processors  

NASA Technical Reports Server (NTRS)

In a parallel processing computer system with multiple processing units and shared memory, a method is disclosed for uniformly balancing the aggregate computational load in, and utilizing minimal memory by, a network having identical computations to be executed at each connection therein. Read-only and read-write memory are subdivided into a plurality of process sets, which function like artificial processing units. Said plurality of process sets is iteratively merged and reduced to the number of processing units without exceeding the balance load. Said merger is based upon the value of a partition threshold, which is a measure of the memory utilization. The turnaround time and memory savings of the instant method are functions of the number of processing units available and the number of partitions into which the memory is subdivided. Typical results of the preferred embodiment yielded memory savings of from sixty to seventy five percent.

Baffes, Paul Thomas (inventor)

1990-01-01

197

High performance SPAD array detectors for parallel photon timing applications  

NASA Astrophysics Data System (ADS)

Over the past few years there has been a growing interest in monolithic arrays of single photon avalanche diodes (SPAD) for spatially resolved detection of faint ultrafast optical signals. SPADs implemented in planar technologies offer the typical advantages of microelectronic devices (small size, ruggedness, low voltage, low power, etc.). Furthermore, they have inherently higher photon detection efficiency than PMTs and are able to provide, beside sensitivities down to single-photons, very high acquisition speeds. Although currently available silicon devices reached remarkable performance, nevertheless further improvements are needed in order to meet the requirements of most demanding timeresolved techniques, it is necessary to face problems like electrical crosstalk between adjacent pixel, high detection efficiency in the red spectral range, large area, low dark counting rate. Moreover to develop array with high number of pixel became more and more important to develop all the TCSPC electronics with picosecond resolution to create a new family of detection system for TCSPC applications. Recent advances in our research on single photon time resolved array is here presented.

Rech, I.; Cammi, C.; Crotti, M.; Gulinatti, A.; Maccagnani, P.; Ghioni, M.; Cova, S.

2011-10-01

198

Method of up-front load balancing for local memory parallel processors  

SciTech Connect

This patent describes a parallel processing computer system including processing units and shared memory, and containing a network having identical computations to be executed at each connection therein, and the network further having a constant aggregate computational load, a method of up-front, load balancing. It comprises: generating a first signal representative of the plurality of processing units in the network, generating first signals with each such signal representing a corresponding one of the identical computations in the network, generating in response to the first signal and the first signals a second signal representing a balance load in the network, generating second signals functionally establishing a preselected plurality of partitions of the memory, generating in response to the first and second signals, a first sequence of signals functionally dividing the computational load into a plurality of process sets, generating in response to the second signal and the first sequence of signals, a second sequence of signals functionally allocating the process sets among the memory partitions, and generating in response to the second sequence of signals, a third sequence of signals functionally merging the process sets until they are equal in number to the processing units.

Baffes, P.T.

1990-04-24

199

AN ANALOGUE SIMD FOCAL-PLANE PROCESSOR ARRAY Piotr Dudek and Peter J. Hicks  

E-print Network

feature is a mesh-connected array of analogue processing elements (APEs). Each APE, associated on analogue samples of data, yet the APEs work in a software-programmable SIMD fashion. They execute a sequence of instructions issued by an external digital controller. The APEs support a fairly conventional

Dudek, Piotr

200

A Visual Environment for Designing and Simulating Execution of Processor Arrays  

Microsoft Academic Search

NOVIS, a visual environment which supports the interactive development and animated simulation of special-purpose parallel architectures, is presented. NOVIS lets users design networks at an abstract level by placing processing elements into a connected grid of arbitrary (user-selected) shape. The environment's underlying philosophy of maximal information hiding makes intimate familiarity on the part of the user with the details of

Charles D. Norton; Ephraim P. Glinert

1990-01-01

201

Massively parallel computation of lattice associative memory classifiers on multicore processors  

NASA Astrophysics Data System (ADS)

Over the past quarter century, concepts and theory derived from neural networks (NNs) have featured prominently in the literature of pattern recognition. Implementationally, classical NNs based on the linear inner product can present performance challenges due to the use of multiplication operations. In contrast, NNs having nonlinear kernels based on Lattice Associative Memories (LAM) theory tend to concentrate primarily on addition and maximum/minimum operations. More generally, the emergence of LAM-based NNs, with their superior information storage capacity, fast convergence and training due to relatively lower computational cost, as well as noise-tolerant classification has extended the capabilities of neural networks far beyond the limited applications potential of classical NNs. This paper explores theory and algorithmic approaches for the efficient computation of LAM-based neural networks, in particular lattice neural nets and dendritic lattice associative memories. Of particular interest are massively parallel architectures such as multicore CPUs and graphics processing units (GPUs). Originally developed for video gaming applications, GPUs hold the promise of high computational throughput without compromising numerical accuracy. Unfortunately, currently-available GPU architectures tend to have idiosyncratic memory hierarchies that can produce unacceptably high data movement latencies for relatively simple operations, unless careful design of theory and algorithms is employed. Advantageously, some GPUs (e.g., the Nvidia Fermi GPU) are optimized for efficient streaming computation (e.g., concurrent multiply and add operations). As a result, the linear or nonlinear inner product structures of NNs are inherently suited to multicore GPU computational capabilities. In this paper, the authors' recent research in lattice associative memories and their implementation on multicores is overviewed, with results that show utility for a wide variety of pattern classification applications using classical NNs or lattice-based NNs. Dataflow diagrams are presented in terms of a parameterized model of data burden and LAM partitioning.

Ritter, Gerhard X.; Schmalz, Mark S.; Hayden, Eric T.

2011-09-01

202

Experimental demonstration of a broadband adaptive processor for phased-array antennas  

NASA Astrophysics Data System (ADS)

This paper presents experimental results demonstrating adaptive beam forming and jammer nulling for phased-array antenna applications using the Broadband Efficient Adaptive Method for True-time-delay Array Processing (BEAMTAP) algorithm. The BEAMTAP algorithm has the advantage of mapping efficiently into an opto-electronic architecture that minimizes the required number of tapped-delay lines and simultaneously allows for the signals to be processed coherently, assuming that phase stabilization has been achieved. The architecture also utilizes a unique polarization-angle, read-write multiplexing system that allows for 45 dB of total jammer suppression at the output. Successful narrowband and broadband adaptive beam forming and jammer nulling results are provided in the worst-case scenario of co-site interference where both the jamming signal's angle of incidence and spectral content overlap with that of the signal of interest.

Kriehn, Gregory R.; Wagner, Kelvin

2004-10-01

203

Achieving supercomputer performance for neural net simulation with an array of digital signal processors  

SciTech Connect

Music, a DSP-based system with a parallel distributed-memory architecture, provides enormous computing power yet retains the flexibility of a general-purpose computer. Reaching a peak performance of 2.7 Gflops at a significantly lower cost, power consumption, and space requirement than conventional supercomputers, Music is well suited to computationally intensive applications such as neural network simulation. 12 refs., 9 figs., 2 tabs.

Muller, U.A.; Baumle, B.; Kohler, P.; Gunzinger, A.; Guggenbuhl, W. [Swiss Federal Inst. of Technology, Zurich (Switzerland)] [Swiss Federal Inst. of Technology, Zurich (Switzerland)

1992-10-01

204

Computers: Massively parallel processors. (Latest citations from the INSPEC: Information Services for the Physics and Engineering Communities data base). Published Search  

SciTech Connect

The bibliography contains citations concerning a concept in computers called Massively Parallel Processing. The processing power of a computer may be increased by using numerous processors in parallel and feeding data through a number of different computational paths at the same time. The citations explore these computers and their practical uses, and include case studies, specific problems solved, theory, and future possibilities and needs. Applications of neural network modeling, pattern recognition, image processing, local area routing, and genetic sequence comparison are discussed. (Contains 250 citations and includes a subject term index and title list.)

Not Available

1992-08-01

205

Parallel Solutions for Dynamic Focussing of Large Acoustic Arrays Karen P. Watkins  

E-print Network

Parallel Solutions for Dynamic Focussing of Large Acoustic Arrays Karen P. Watkins CS 392C: Methods measurements, ultrasonic medical imaging, and underwater acoustics [1]. 1.1 Interpolation Beamforming ``in focus'', much like focussing a camera. To achieve highest quality images, the coefficients must

Browne, James C.

206

Ultrafast laser parallel microprocessing using high uniformity binary Dammann grating generated beam array  

NASA Astrophysics Data System (ADS)

Ultrafast laser parallel processing using diffractive multi-beam patterns generated by a spatial light modulator (SLM) has demonstrated a great increase in processing throughput and efficiency. Applications ranging from surface thin film patterning to internal 3D refractive index modification have been recently reported with the parallel processing technology. Periodic and symmetrical geometry design (e.g. N × M beam array) of the multi-beam pattern must be avoided to guarantee the required high uniformity in these applications, which, however, limited the processing flexibility. In this paper, Dammann gratings are used to create diffractive 1 × 5 and 5 × 5 beam arrays for the parallel processing. The 0-th order, observed slightly stronger than the other higher orders, can be adjusted by superimposing a Fresnel zone lens (FZL) and tuning the degree of defocusing at the processing plane. The uniformity (presented by the variation of the machined hole diameter) is measured to be <4% after the adjustment. Additionally, a parallel surface patterning of indium tin oxide (ITO) thin film with periodic array structures was demonstrated using the Dammann grating generated beam array without requiring the complicated geometry separation and the time-consuming positioning.

Kuang, Zheng; Perrie, Walter; Liu, Dun; Edwardson, Stuart P.; Jiang, Yao; Fearon, Eamonn; Watkins, Ken G.; Dearden, Geoff

2013-05-01

207

High-performance FFT implementation on the BOPS ManArray parallel DSP  

E-print Network

High-performance FFT implementation on the BOPS ManArray parallel DSP Nikos P. pjjab and Gerald University, Durham, NC 27708 ABSTRACT We present a high performance implementation of the FFT algorithm to an FFT algorithm we use a factorization of the DFT matrix in Kronecker products, permutation and diagonal

Pitsianis, Nikos P.

208

Parallel Beam Approximation for Calculation of Detection Efficiency of Crystals in PET Detector Arrays.  

PubMed

In this work we propose a parallel beam approximation for the computation of the detection efficiency of crystals in a PET detector array. In this approximation the detection efficiency of a crystal is estimated using the distance between source and the crystal and the pre-calculated detection cross section of the crystal in a crystal array which is calculated for a uniform parallel beam of gammas. The pre-calculated detection cross sections for a few representative incident angles and gamma energies can be used to create a look-up table to be used in simulation studies or practical implementation of scatter or random correction algorithms. Utilizing the symmetries of the square crystal array, the pre-calculated look-up tables can be relatively small. The detection cross sections can be measured experimentally, calculated analytically or simulated using a Monte Carlo (MC) approach. In this work we used a MC simulation that takes into account the energy windowing, Compton scattering and factors in the "block effect". The parallel beam approximation was validated by a separate MC simulation using point sources located at different positions around a crystal array. Experimentally measured detection efficiencies were compared with Monte Carlo simulated detection efficiencies. Results suggest that the parallel beam approximation provides an efficient and accurate way to compute the crystal detection efficiency, which can be used for estimation of random and scatter coincidences for PET data corrections. PMID:25400292

Komarov, Sergey; Song, Tae Yong; Wu, Heyu; Tai, Yuan-Chuan

2011-10-01

209

A design space evaluation of grid processor architectures  

Microsoft Academic Search

In this paper, we survey the design space of a new class of architectures called Grid Processor Architectures (GPAs). These architectures are designed to scale with technology, allowing faster clock rates than conventional architectures while providing superior instruction-level parallelism on traditional workloads and high performance across a range of application classes. A GPA consists of an array of ALUs, each

Ramadass Nagarajan; Karthikeyan Sankaralingam; Doug Burger; Stephen W. Keckler

2001-01-01

210

Efficient Algorithms for Parallel Excitation and Parallel Imaging with Large Arrays  

E-print Network

in reconstructions. 2.3 Parallel Excitation The field strength of the current clinical scanners are advancing to 3 Tesla or even 7 Tesla which can tremendously improve the imaging quality. However, many high field related problems remain unsolved, for example...

Feng, Shuo

2013-08-12

211

Multithreading and Parallel Microprocessors  

E-print Network

Multithreading and Parallel Microprocessors Stephen Jenks Electrical Engineering and Computer Scalable Parallel and Distributed Systems Lab 4 Outline Parallelism in Microprocessors Multicore Processor Parallelism Parallel Programming for Shared Memory OpenMP POSIX Threads Java Threads Parallel Microprocessor

Shinozuka, Masanobu

212

Automatic Parallelization of Numerical Python Applications using the Global Arrays Toolkit  

SciTech Connect

Global Arrays is a software system from Pacific Northwest National Laboratory that enables an efficient, portable, and parallel shared-memory programming interface to manipulate distributed dense arrays. The NumPy module is the de facto standard for numerical calculation in the Python programming language, a language whose use is growing rapidly in the scientific and engineering communities. NumPy provides a powerful N-dimensional array class as well as other scientific computing capabilities. However, like the majority of the core Python modules, NumPy is inherently serial. Using a combination of Global Arrays and NumPy, we have reimplemented NumPy as a distributed drop-in replacement called Global Arrays in NumPy (GAiN). Serial NumPy applications can become parallel, scalable GAiN applications with only minor source code changes. Scalability studies of several different GAiN applications will be presented showing the utility of developing serial NumPy codes which can later run on more capable clusters or supercomputers.

Daily, Jeffrey A.; Lewis, Robert R.

2011-11-30

213

Sequence information signal processor  

NASA Technical Reports Server (NTRS)

An electronic circuit is used to compare two sequences, such as genetic sequences, to determine which alignment of the sequences produces the greatest similarity. The circuit includes a linear array of series-connected processors, each of which stores a single element from one of the sequences and compares that element with each successive element in the other sequence. For each comparison, the processor generates a scoring parameter that indicates which segment ending at those two elements produces the greatest degree of similarity between the sequences. The processor uses the scoring parameter to generate a similar scoring parameter for a comparison between the stored element and the next successive element from the other sequence. The processor also delivers the scoring parameter to the next processor in the array for use in generating a similar scoring parameter for another pair of elements. The electronic circuit determines which processor and alignment of the sequences produce the scoring parameter with the highest value.

Peterson, John C. (Inventor); Chow, Edward T. (Inventor); Waterman, Michael S. (Inventor); Hunkapillar, Timothy J. (Inventor)

1999-01-01

214

Parallel programming models for a multi-processor SoC platform applied to high-speed traffic management  

Microsoft Academic Search

In this paper, we describe the MultiFlex multi-processor SoC programming environment, with focus on two programming models: a distributed system object component (DSOC) message passing model, and a symmetrical multi-processing (SMP) model using shared memory. The MultiFlex tools map these models onto the StepNP multi-processor SoC platform, while making use of harware accelerators for message passing and task scheduling. We

Pierre G. Paulin; Chuck Pilkington; Michel Langevin; Essaid Bensoudane; Gabriela Nicolescu

2004-01-01

215

A compact gamma camera with scintillation array and parallel-hole collimator  

Microsoft Academic Search

A new compact gamma camera for small object imaging has been developed. It consists of a pixelized Nal(T1) scintillator array coupled to a position sensitive photomultiplier tube (Hamamatsu R2486) with a parallel-hole lead collimator. The compact camera has better spatial resolution than Anger camera. The average value of intrinsic spatial resolutions is 2.3 mm (FWHM). The overall spatial resolution (FWHM)

Jie ZHU; Hongguang MA; Wenyan MA; Hui ZENG; Zhaomin WANG; Zizhong XU

2008-01-01

216

Simulation of three-dimensional laminar flow and heat transfer in an array of parallel microchannels  

E-print Network

SIMULATION OF THREE-DIMENSIONAL LAMINAR FLOW AND HEAT TRANSFER IN AN ARRAY OF PARALLEL MICROCHANNELS A Thesis by JUSTIN DALE MLCAK Submitted to the Office of Graduate Studies of Texas A&M University in partial fulfillment... by JUSTIN DALE MLCAK Submitted to the Office of Graduate Studies of Texas A&M University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Approved by: Chair of Committee, N.K. Anand Committee Members, J.C. Han...

Mlcak, Justin Dale

2009-05-15

217

Large scale parallel computing simulations of wire array Z-pinches  

NASA Astrophysics Data System (ADS)

Until recently simulations of wire array Z-pinches have been undertaken in a piece-wise fashion, modelling either only part of the array volume, or modelling different aspects of the array behaviour separately. Recent simulations of a single wire in the array suggest that the short wavelength modulations of the ablating plasma observed in experiments are the result of a modified m=0 like instability. In order to simulate the growth of magneto-Rayleigh-Taylor instabilities during the implosion phase, a separate calculation is usually performed in which estimates for the structure of the modulated ablation are used to provide the initial seed perturbation for the implosion. Improvements to the parallel computing architecture of the Gorgon 3D resistive MHD code, however, mean that is now possible to run with large enough computational grids to encompass the entire volume of the array whilst retaining sufficient resolution to model the spontaneous development of the modulated ablation structure from microscopic noise. Thus we can model the evolution of the wire array from the point of initial plasma formation, right through the implosion, without imposing any predetermined perturbation or structure. A detailed comparison of synthetic diagnostic images with data from MAGPIE experiments is used to test this method. Preliminary data from similar simulations of Z experiments are also presented.

Chittenden, Jeremy; Niasse, Nicolas; Ciardi, Andrea

2008-11-01

218

Microplate-compatible biamperometry array for parallel 48-channel amperometric or coulometric measurements.  

PubMed

We report a new reusable electrochemical array for parallel biamperometric measurements that has been designed for use with standard microplates. The 48-channel array uses half of the available 96 wells and has 48 pairs of Pt wire electrodes. Applications to the quantitation of a variety of oxidizable species, including acetaminophen, ascorbic acid, hydroquinone, trolox, and uric acid, are demonstrated in assays that use potassium ferricyanide as an oxidant to produce a mixture of ferri- and ferrocyanide. Hydrogen peroxide quantitation is also demonstrated, based on an assay in which ferrocyanide is oxidized, again to produce a mixture of ferri- and ferrocyanide. Detection limits (signal-to-noise ratio (S/N) = 3) in these assays range from 1 (acetaminophen, R2 = 0.994) to 8 microM (ascorbic acid, R2 = 0.967), and linearity was observed to analyte concentrations of at least 100 microM. We also demonstrate the application of the biamperometric array to enzymatic assays, using the glucose oxidase reaction as an example; following a 20 min enzyme reaction time, a detection limit of 0.1 mM glucose was obtained. These results indicate that applications to other oxidase-based assays are feasible in this high-throughput format. The new electrochemical array employs standard, inexpensive microplates, and the biamperometric measurements are simple, precise, and rapid, requiring only 2 min for 48 parallel measurements. PMID:18341302

Mann, Thomas S; O'Hagan, Liam; Ertl, Peter; Sparkes, Douglas I; Mikkelsen, Susan R

2008-04-15

219

Coding for parallel execution of hardware-in-the-loop millimeter-wave scene generation models on multicore SIMD processor architectures  

NASA Astrophysics Data System (ADS)

Rendering of point scatterer based radar scenes for millimeter wave (mmW) seeker tests in real-time hardware-in-the-loop (HWIL) scene generation requires efficient algorithms and vector-friendly computer architectures for complex signal synthesis. New processor technology from Intel implements an extended 256-bit vector SIMD instruction set (AVX, AVX2) in a multi-core CPU design providing peak execution rates of hundreds of GigaFLOPS (GFLOPS) on one chip. Real world mmW scene generation code can approach peak SIMD execution rates only after careful algorithm and source code design. An effective software design will maintain high computing intensity emphasizing register-to-register SIMD arithmetic operations over data movement between CPU caches or off-chip memories. Engineers at the U.S. Army Aviation and Missile Research, Development and Engineering Center (AMRDEC) applied two basic parallel coding methods to assess new 256-bit SIMD multi-core architectures for mmW scene generation in HWIL. These include use of POSIX threads built on vector library functions and more portable, highlevel parallel code based on compiler technology (e.g. OpenMP pragmas and SIMD autovectorization). Since CPU technology is rapidly advancing toward high processor core counts and TeraFLOPS peak SIMD execution rates, it is imperative that coding methods be identified which produce efficient and maintainable parallel code. This paper describes the algorithms used in point scatterer target model rendering, the parallelization of those algorithms, and the execution performance achieved on an AVX multi-core machine using the two basic parallel coding methods. The paper concludes with estimates for scale-up performance on upcoming multi-core technology.

Olson, Richard F.

2013-05-01

220

Image fiber optic space-CDMA parallel transmission experiment using 8 x 8 VCSEL/PD arrays.  

PubMed

We experimentally demonstrate space-code-division multiple access (space-CDMA) based twodimensional (2-D) parallel optical interconnections by using image fibers and 8 x 8 vertical-cavity surface-emitting laser (VCSEL)/photo diode (PD) arrays. Two spatially encoded four-bit (2 x 2) parallel optical signals were emitted fiom 2-D VCSEL arrays and transmitted through image fibers. The encoded signals were multiplexed by an image-fiber coupler and detected by a 2-D PD array on the receiver side. The receiver recovered the intended parallel signal by decoding the signal. The transmission speed was 64 Mbps/ch (total throughput: 512 Mbps). Bit-error-rate (BER) measurement with a laterally misaligned PD array showed the array had a misalignment tolerance of 25 microm for a BER performance of 10(-9). PMID:12440546

Nakamura, Moriya; Kitayama, Ken-ichi; Igasaki, Yasunori; Shamoto, Naoki; Kaneda, Keiji

2002-11-10

221

Compiler for an array and vector processing language  

SciTech Connect

A compiler for a Pascal-based language Actus is described. The language is suitable for the expression of the type of parallelism offered by both array and vector processors. The implementation described is for the Cray-1 computer. An objective of the implementation has been to construct an optimizing compiler which can be readily adapted for a range of array and vector processors. As a result the machine-dependent sections of the compiler have been clearly identified. 9 references.

Perrott, R.H.; Crookes, D.; Milligan, P.; Purdy, W.R.M.

1985-05-01

222

Parallel and series FED microstrip array with high efficiency and low cross polarization  

NASA Technical Reports Server (NTRS)

A microstrip array antenna for vertically polarized fan beam (approximately 2 deg x 50 deg) for C-band SAR applications with a physical area of 1.7 m by 0.17 m comprises two rows of patch elements and employs a parallel feed to left- and right-half sections of the rows. Each section is divided into two segments that are fed in parallel with the elements in each segment fed in series through matched transmission lines for high efficiency. The inboard section has half the number of patch elements of the outboard section, and the outboard sections, which have tapered distribution with identical transmission line sections, terminated with half wavelength long open-circuit stubs so that the remaining energy is reflected and radiated in phase. The elements of the two inboard segments of the two left- and right-half sections are provided with tapered transmission lines from element to element for uniform power distribution over the central third of the entire array antenna. The two rows of array elements are excited at opposite patch feed locations with opposite (180 deg difference) phases for reduced cross-polarization.

Huang, John (inventor)

1995-01-01

223

High-performance SPAD array detectors for parallel photon timing applications  

NASA Astrophysics Data System (ADS)

Over the past few years there has been a growing interest in monolithic arrays of single photon avalanche diodes (SPAD) for spatially resolved detection of faint ultrafast optical signals. SPADs implemented in planar technologies offer the typical advantages of microelectronic devices (small size, ruggedness, low voltage, low power, etc.). Furthermore, they have inherently higher photon detection efficiency than PMTs and are able to provide, beside sensitivities down to single-photons, very high acquisition speeds. In order to make SPAD array more and more competitive in time-resolved application it is necessary to face problems like electrical crosstalk between adjacent pixel, moreover all the singlephoton timing electronics with picosecond resolution has to be developed. In this paper we present a new instrument suitable for single-photon imaging applications and made up of 32 timeresolved parallel channels. The 32x1 pixel array that includes SPAD detectors represents the system core, and an embedded data elaboration unit performs on-board data processing for single-photon counting applications. Photontiming information is exported through a custom parallel cable that can be connected to an external multichannel TCSPC system.

Rech, I.; Cuccato, A.; Antonioli, S.; Cammi, C.; Gulinatti, A.; Ghioni, M.

2012-02-01

224

High Density Single-Molecule-Bead Arrays for Parallel Single Molecule Force Spectroscopy  

PubMed Central

The assembly of a highly-parallel force spectroscopy tool requires careful placement of single-molecule targets on the substrate and the deliberate manipulation of a multitude of force probes. Since the probe must approach the target biomolecule for covalent attachment, while avoiding irreversible adhesion to the substrate, the use of the polymer microsphere as force probes to create the tethered bead array poses a problem. Therefore, the interactions between the force probe and the surface must be repulsive at very short distances (< 5 nm) and attractive at long distances. To achieve this balance, the chemistry of the substrate, force probe, and solution must be tailored to control the probe-surface interactions. In addition to an appropriately designed chemistry, it is necessary to control the surface density of the target molecule in order to ensure that only one molecule is interrogated by a single force probe. We used gold-thiol chemistry to control both the substrate’s surface chemistry and the spacing of the studied molecules, through a competitive binding of the thiol-terminated DNA and an inert thiol forming a blocking layer. For our single molecule array, we modeled the forces between the probe and the substrate using DLVO theory and measured their magnitude and direction with colloidal probe microscopy. The practicality of each system was tested using a probe binding assay to evaluate the proportion of the beads remaining adhered to the surface after application of force. We have translated the results specific for our system to general guiding principles for preparation of tethered bead arrays and demonstrated the ability of this system to produce a high yield of active force spectroscopy probes in a microwell substrate. This study outlines the characteristics of the chemistry needed to create such a force spectroscopy array. PMID:22548234

Barrett, Michael J.; Oliver, Piercen M.; Cheng, Peng; Cetin, Deniz; Vezenov, Dmitri

2012-01-01

225

A parallel implementation of MP3 decoding algorithm on Reconfigurable Computing systems  

Microsoft Academic Search

This paper describes a reconfigurable computing system, which consists of a general-purpose ARM processor and reconfigurable cells array (RCA). A novel mapping mechanism which makes data-parallelism instructions operate on RCA has been proposed to map and implement MP3 audio decoding algorithm containing intrinsic data-parallelism operations. The communication interface between ARM processor and RCA is implemented efficiently using the standard ARM

Chongyong Yin; Shouyi Yin; Shaojun Wei

2008-01-01

226

Real-time processor for staring receivers  

NASA Technical Reports Server (NTRS)

The design, fabrication, and testing of a state-of-the-art, high-throughput on-focal plane IR-image signal processor is described. The processing functions performed are frame differencing and thresholding. The final focal plane array will consist of a 128 x 128-pixel platinum-silicide detector bump-mounted to an on-chip CCD multiplexer. The processor is in a 128-channel parallel-pipeline format. Each channel consists of a pixel regenerator (charge differencer), 128-pixel frame store CCD memory, pixel differencer, second pixel regenerator, thresholder (analog comparator), and digital latch. Four parallel analog outputs and four parallel digital outputs are included. The digital outputs provide a bit map of the image. All analog clock signals (128 KHz, 256 KHz, and 5 MHz) are generated by on-chip TTL-input clock drivers. TTL clock driver inputs are generated off-chip. The technology is low-temperature surface and buried channel CCD/CMOS/indium bump. The design goal was 8-bit resolution at 77 K and 1000 frames/s. Applications include point- or extended-target motion detection with thresholding. Design trade-offs and enhancements (such as on-chip detector gain compensation and a simple window processor) are discussed.

Hanzal, Brian; Peczalski, Andrzej; Schwanebeck, James; Sanderson, Richard; Fossum, Eric

1992-01-01

227

The Imagine Stream Processor  

Microsoft Academic Search

The Imagine Stream Processor is a single-chip pro- grammable media processor with 48 parallel ALUs. At 400 MHz, this translates to a peak arithmetic rate of 16 GFLOPS on single-precision data and 32 GOPS on 16- bit fixed-point data. The scalability of Imagine's program- ming model and architecture enable it to achieve such high arithmetic rates. Imagine executes applications that

Ujval J. Kapasi; William J. Dally; Brucek Khailany; John D. Owens; Scott Rixner

2002-01-01

228

Computation and parallel implementation for early vision  

NASA Technical Reports Server (NTRS)

The problem of early vision is to transform one or more retinal illuminance images-pixel arrays-to image representations built out of such primitive visual features such as edges, regions, disparities, and clusters. These transformed representations form the input to later vision stages that perform higher level vision tasks including matching and recognition. Researchers developed algorithms for: (1) edge finding in the scale space formulation; (2) correlation methods for computing matches between pairs of images; and (3) clustering of data by neural networks. These algorithms are formulated for parallel implementation of SIMD machines, such as the Massively Parallel Processor, a 128 x 128 array processor with 1024 bits of local memory per processor. For some cases, researchers can show speedups of three orders of magnitude over serial implementations.

Gualtieri, J. Anthony

1990-01-01

229

SNAP: Parallel Processing Applied to AI  

Microsoft Academic Search

It is argued that a viable solution for building future intelligent systems is to design special-purpose parallel computer architectures. The applications are restricted to those using semantic networks for knowledge representation. Reasoning on these networks is achieved with a marker-passing model of processing. The Semantic Network Array Processor (SNAP), a marker-passing parallel computer dedicated for natural-language and other knowledge-processing applications,

Dan I. Moldovan; Wing Lee; Changhwa Lin; Minhwa Chung

1992-01-01

230

A parallel hybrid merge-select sorting scheme for K-best LSD MIMO decoder on a dynamically reconfigurable processor  

E-print Network

A parallel hybrid merge-select sorting scheme for K-best LSD MIMO decoder on a dynamically detection (LSD) multi-input multi-output (MIMO) decoder based on a recently developed novel Reconfigurable and mapped onto our proposed platform. We discuss the targeted K-best LSD algorithm as well as the sorting

Arslan, Tughrul

231

Modeling of the phase lag causing fluidelastic instability in a parallel triangular tube array  

NASA Astrophysics Data System (ADS)

Fluidelastic instability is considered a critical flow induced vibration mechanism in tube and shell heat exchangers. It is believed that a finite time lag between tube vibration and fluid response is essential to predict the phenomenon. However, the physical nature of this time lag is not fully understood. This paper presents a fundamental study of this time delay using a parallel triangular tube array with a pitch ratio of 1.54. A computational fluid dynamics (CFD) model was developed and validated experimentally in an attempt to investigate the interaction between tube vibrations and flow perturbations at lower reduced velocities Ur=1-6 and Reynolds numbers Re=2000-12 000. The numerical predictions of the phase lag are in reasonable agreement with the experimental measurements for the range of reduced velocities Ug/fd=6-7. It was found that there are two propagation mechanisms; the first is associated with the acoustic wave propagation at low reduced velocities, Ur<2, and the second mechanism for higher reduced velocities is associated with the vorticity shedding and convection. An empirical model of the two mechanisms is developed and the phase lag predictions are in reasonable agreement with the experimental and numerical measurements. The developed phase lag model is then coupled with the semi-analytical model of Lever and Weaver to predict the fluidelastic stability threshold. Improved predictions of the stability boundaries for the parallel triangular array were achieved. In addition, the present study has explained why fluidelastic instability does not occur below some threshold reduced velocity.

Khalifa, Ahmed; Weaver, David; Ziada, Samir

2013-11-01

232

A fast parallel imaging rotary phased array head coil with improved sensitivity profile deep in the center of the brain.  

PubMed

A new class of a receive-only 2T 4-element rotary phased array head coil has been proposed for MRI brain imaging applications. Coil elements of the rotary phased array head coil have "paddle-like" structures consisting of a pair of main conductors located on opposite sides, inserted equi-angularly around and over the head. Using such a unique design, the proposed rotary head coil can improve the sensitivity deep at the centre of the brain and produces highly homogeneous brain images. The rotary phased array head coil is numerically modeled using a hybrid MoM/FEM method and a prototype was constructed accordingly. In vivo MR brain imaging using the prototype rotary phased array head coil has been undertaken and the acquired brain images show high homogeneity as anticipated. In addition, parallel imaging, VD-GRAPPA, is used in conjunction with the rotary phased array head coil to enable rapid imaging. PMID:18001999

Weber, Ewald; Li, Bing Keong; Liu, Feng; Crozier, Stuart

2007-01-01

233

Analysis of radiation by linear arrays of parallel horizontal wire antennas over imperfect ground (reflection coefficient method)  

Microsoft Academic Search

This is a program suitable for analysis problems involving lossless parallel horizontal thin wires located over an imperfectly conducting horizontal ground plane. The program is equipped to treat arbitrarily spaced arrays of wires that can be of different lengths and radii. However, it is assumed the wires are all centerfed and unloaded

T. Sarkar

1976-01-01

234

12-channel parallel optical-fiber transmission using a low-drive current 1.3-?m LED array and a p-i-n PD array  

Microsoft Academic Search

Twelve-channel 14-Mb\\/s\\/channel 1-km parallel optical-fiber transmission using a 1×12 low-drive-current 1.3-?m light-emitting diode (LED) linear array and an InGaAs p-i-n photodiode linear array, with the LED drive current as low as 12 mAp-p\\/channel, is discussed. No receiver sensitivity degradation has been observed under simultaneous 12-channel operation. The skew was less than 6 ns after transmission through a 1-km-long 12-channel optical-fiber

Kazuhisa Kaede; Toshio Uji; Takeshi Nagahori; Tetsuyuki Suzaki; Toshitaka Torikai; Junji Hayashi; Isao Watanabe; Masataka Itoh; Hiroshi Honmou; Minoru Shikada

1990-01-01

235

Self-configuration of the massively defective cellular array  

SciTech Connect

The rapid advancement in VLSI technology is making it feasible to consider the construction of a parallel computer that is comprised of a large number of processors previously considered impractical due to their complexity. One promising class of such architecture is a VLSI processor array that interconnects a very large number of simple processing cells on a single chip or water. When a huge number of devices are built on the large chip, however, it will be very difficult to make the chips without many defects. With a fixed interconnection pattern between processors, the whole processor array may not be usable when defects appear on the processor array. Furthermore, the architecture with a fixed interconnection pattern is limited in the range of computations that can be supported efficiently. By providing reconfiguration mechanisms, a VLSI processor array can be designed such that it can be reconfigured for fault-tolerance and specialization for various computations. This thesis studies self-configuration of cells on the massively defective cellular array, and proposes a massively fault-tolerant cellular array that is an array of identical cells with connections only to immediate neighbors, where the cells and the connections may be defective with high probabilities. The cell can function as a processing element, as a memory, or as a switching element that connects to other cells.

Lee, M.S.

1986-01-01

236

A dynamically reconfigurable asynchronous processor  

Microsoft Academic Search

The main design requirements for high-throughput mobile applications are energy efficiency and programmability. This paper presents a novel dynamically reconfigurable processor that targets these requirements. Our processor consists of a heterogeneous array of coarse grain asynchronous cells. The architecture maintains most of the benefits of custom asynchronous design, while also providing programmability via conventional high-level languages. Results show that our

K. A. Fawaz; T. Arslan; S. Khawam; M. Muir; I. Nousias; I. Lindsay; A. Erdogan

2010-01-01

237

Efficient schemes for parallel communication  

Microsoft Academic Search

A fundamental problem in the theory of parallel computation is to find an efficient interconnection pattern between N processors that minimizes the number of lines entering or leaving each processor while enabling fast communication between the processors. A family of Balanced communication schemes for connecting N processors with only a constant number of lines entering or leaving each processor is

Eli Upfal

1982-01-01

238

Peripheral processors for high-speed simulation. [helicopter cockpit simulator  

NASA Technical Reports Server (NTRS)

This paper describes some of the results of a study directed to the specification and procurement of a new cockpit simulator for an advanced class of helicopters. A part of the study was the definition of a challenging benchmark problem, and detailed analyses of it were made to assess the suitability of a variety of simulation techniques. The analyses showed that a particularly cost-effective approach to the attainment of adequate speed for this extremely demanding application is to employ a large minicomputer acting as host and controller for a special-purpose digital peripheral processor. Various realizations of such peripheral processors, all employing state-of-the-art electronic circuitry and a high degree of parallelism and pipelining, are available or under development. The types of peripheral processors array processors, simulation-oriented processors, and arrays of processing elements - are analyzed and compared. They are particularly promising approaches which should be suitable for high-speed simulations of all kinds, the cockpit simulator being a case in point.

Karplus, W. J.

1977-01-01

239

Voxel based parallel post processor for void nucleation and growth analysis of atomistic simulations of material fracture.  

PubMed

Molecular dynamics (MD) simulations are used in the study of void nucleation and growth in crystals that are subjected to tensile deformation. These simulations are run for typically several hundred thousand time steps depending on the problem. We output the atom positions at a required frequency for post processing to determine the void nucleation, growth and coalescence due to tensile deformation. The simulation volume is broken up into voxels of size equal to the unit cell size of crystal. In this paper, we present the algorithm to identify the empty unit cells (voids), their connections (void size) and dynamic changes (growth and coalescence of voids) for MD simulations of large atomic systems (multi-million atoms). We discuss the parallel algorithms that were implemented and discuss their relative applicability in terms of their speedup and scalability. We also present the results on scalability of our algorithm when it is incorporated into MD software LAMMPS. PMID:24793054

Hemani, H; Warrier, M; Sakthivel, N; Chaturvedi, S

2014-05-01

240

The Superthreaded Processor Architecture  

Microsoft Academic Search

The common single-threaded execution model limits processors to exploiting only the relatively small amount of instruction-level parallelism available in application programs. The superthreaded pro- cessor, on the other hand, is a concurrent multithreaded architecture (CMA) that can exploit the multiple granularities of parallelism available in general-purpose application programs. Unlike other CMAs that rely primarily on hardware for run-time dependence detection

Jenn-yuan Tsai; Jian Huang; Christoffer Amlo; David J. Lilja; Pen-chung Yew

1999-01-01

241

Cytophobic surface modification of microfluidic arrays for in situ parallel peptide synthesis and cell adhesion assays  

PubMed Central

A combination of PEG-based surface passivation techniques and spatially addressable SPPS (solid phase peptide synthesis) was used to demonstrate a highly specific cell-peptide adhesion assay on a microfluidic platform. The surface of a silicon-glass microchip was modified to form a mixed self-assembled monolayer that presented PEG moieties interspersed with reactive amino terminals. The PEG provided biomolecular inertness and the reactive amino groups were used for consequent peptide synthesis. The cytophobicity of the surface was characterized by on-chip fluorescent binding assays and was found to be resistant to non-specific attachment of cells and proteins. An integrated system for parallel peptide synthesis on this reactive amino surface was developed, using photogenerated acid chemistry and digital microlithography. A constant synthesis efficiency of >98% was observed for up to 7mer peptides. To demonstrate specific cell adhesion on these synthetic peptide arrays, variations of a 7mer cell binding peptide that binds to murine B lymphoma cells were synthesized. Sequence specific binding was observed on incubation with fluorescently labeled, intact murine B lymphoma cells and key residues for binding were identified by deletional analysis. PMID:17605465

Mandal, Suparna; Rouillard, Jean Marie; Srivannavit, Onnop; Gulari, Erdogan

2008-01-01

242

Acoustic insertion loss due to two dimensional periodic arrays of circular cylinders parallel to a nearby surface.  

PubMed

The acoustical performances of regular arrays of cylindrical elements, with their axes aligned and parallel to a ground plane, have been investigated through predictions and laboratory experiments. Semi-analytical predictions based on multiple scattering theory and numerical simulations based on a boundary element formulation have been made. Measurements have been made in an anechoic chamber using arrays of (a) cylindrical acoustically-rigid scatterers (PVC pipes) and (b) thin elastic shells. Insertion loss (IL) spectra due to the arrays have been measured without and with ground planes for several receiver heights. Data and predictions have been compared. The minima in the excess attenuation spectrum i.e., attenuation maxima due to the ground alone resulting from destructive interference between direct and ground-reflected sound waves, tend to have an adverse influence on the band gaps (BG) related to a periodic array in the free field when these two effects coincide. On the other hand, the presence of rigid ground may result in an IL for an array near the ground similar to or, in the case of the first BG, greater than that resulting from a double array, equivalent to the original array plus its ground plane mirror image, in the free field. PMID:22225030

Krynkin, Anton; Umnova, Olga; Sánchez-Pérez, Juan Vicente; Chong, Alvin Yung Boon; Taherzadeh, Shahram; Attenborough, Keith

2011-12-01

243

Acoustic insertion loss due to two dimensional periodic arrays of circular cylinders parallel to a nearby surface  

E-print Network

The acoustical performances of regular arrays of cylindrical elements, with their axes aligned and parallel to a ground plane, have been investigated through predictions and laboratory experiments. Semi-analytical predictions based on multiple scattering theory and numerical simulations based on a boundary element formulation have been made. Measurements have been made in an anechoic chamber using arrays of (a) cylindrical acoustically-rigid scatterers (PVC pipes) and (b) thin elastic shells. Insertion loss (IL) spectra due to the arrays have been measured without and with ground planes for several receiver heights. Data and predictions have been compared. The minima in the excess attenuation spectrum i.e., attenuation maxima due to the ground alone resulting from destructive interference between direct and ground-reflected sound waves, tend to have an adverse influence on the band gaps (BG) related to a periodic array in the free field when these two effects coincide. On the other hand, the presence of rigid ground may result in an IL for an array near the ground similar to or, in the case of the first BG, greater than that resulting from a double array, equivalent to the original array plus its ground plane mirror image, in the free field.

Anton Krynkin; Olga Umnova; Juan Vicente Sanchez-Perez; Alvin Y. B. Chong; Shahram Taherzadeh; Keith Attenborough

2012-07-03

244

Atmospheric plasma jet array in parallel electric and gas flow fields for three-dimensional surface treatment  

NASA Astrophysics Data System (ADS)

This letter reports on electrical and optical characteristics of a ten-channel atmospheric pressure glow discharge jet array in parallel electric and gas flow fields. Challenged with complex three-dimensional substrates including surgical tissue forceps and sloped plastic plate of up to 15°, the jet array is shown to achieve excellent jet-to-jet uniformity both in time and in space. Its spatial uniformity is four times better than a comparable single jet when both are used to treat a 15° sloped substrate. These benefits are likely from an effective self-adjustment mechanism among individual jets facilitated by individualized ballast and spatial redistribution of surface charges.

Cao, Z.; Walsh, J. L.; Kong, M. G.

2009-01-01

245

Parallel self-mixing imaging system based on an array of vertical-cavity surface-emitting lasers  

SciTech Connect

In this paper we investigate the feasibility of a massively parallel self-mixing imaging system based on an array of vertical-cavity surface-emitting lasers (VCSELs) to measure surface profiles of displacement,distance, velocity, and liquid flow rate. The concept of the system is demonstrated using a prototype to measure the velocity at different radial points on a rotating disk, and the velocity profile of diluted milk in a custom built diverging-converging planar flow channel. It is envisaged that a scaled up version of the parallel self-mixing imaging system will enable real-time surface profiling, vibrometry, and flowmetry.

Tucker, John R.; Baque, Johnathon L.; Lim, Yah Leng; Zvyagin, Andrei V.; Rakic, Aleksandar D

2007-09-01

246

Simultaneous multithreading: a platform for next-generation processors  

Microsoft Academic Search

Simultaneous multithreading is a processor design which consumes both thread-level and instruction-level parallelism. In SMT processors, thread-level parallelism can come from either multithreaded, parallel programs or individual, independent programs in a multiprogramming workload. Instruction-level parallelism comes from each single program or thread. Because it successfully (and simultaneously) exploits both types of parallelism, SMT processors use resources more efficiently, and both

Susan J. Eggers; Joel S. Emer; H. M. Leby; Jack L. Lo; Rebecca Stamm; Dean M. Tullsen

1997-01-01

247

Large-scale parallel surface functionalization of goblet-type whispering gallery mode microcavity arrays for biosensing applications.  

PubMed

A novel surface functionalization technique is presented for large-scale selective molecule deposition onto whispering gallery mode microgoblet cavities. The parallel technique allows damage-free individual functionalization of the cavities, arranged on-chip in densely packaged arrays. As the stamp pad a glass slide is utilized, bearing phospholipids with different functional head groups. Coated microcavities are characterized and demonstrated as biosensors. PMID:24990526

Bog, Uwe; Brinkmann, Falko; Kalt, Heinz; Koos, Christian; Mappes, Timo; Hirtz, Michael; Fuchs, Harald; Köber, Sebastian

2014-10-15

248

Parallel detection of harmful algae using reverse transcription polymerase chain reaction labeling coupled with membrane-based DNA array.  

PubMed

Harmful algal blooms (HABs) are a global problem, which can cause economic loss to aquaculture industry's and pose a potential threat to human health. More attention must be made on the development of effective detection methods for the causative microalgae. The traditional microscopic examination has many disadvantages, such as low efficiency, inaccuracy, and requires specialized skill in identification and especially is incompetent for parallel analysis of several morphologically similar microalgae to species level at one time. This study aimed at exploring the feasibility of using membrane-based DNA array for parallel detection of several microalgae by selecting five microaglae, including Heterosigma akashiwo, Chaetoceros debilis, Skeletonema costatum, Prorocentrum donghaiense, and Nitzschia closterium as test species. Five species-specific (taxonomic) probes were designed from variable regions of the large subunit ribosomal DNA (LSU rDNA) by visualizing the alignment of LSU rDNA of related species. The specificity of the probes was confirmed by dot blot hybridization. The membrane-based DNA array was prepared by spotting the tailed taxonomic probes onto positively charged nylon membrane. Digoxigenin (Dig) labeling of target molecules was performed by multiple PCR/RT-PCR using RNA/DNA mixture of five microalgae as template. The Dig-labeled amplification products were hybridized with the membrane-based DNA array to produce visible hybridization signal indicating the presence of target algae. Detection sensitivity comparison showed that RT-PCR labeling (RPL) coupled with hybridization was tenfold more sensitive than DNA-PCR-labeling-coupled with hybridization. Finally, the effectiveness of RPL coupled with membrane-based DNA array was validated by testing with simulated and natural water samples, respectively. All of these results indicated that RPL coupled with membrane-based DNA array is specific, simple, and sensitive for parallel detection of microalgae which shows promise for monitoring natural samples in the future. PMID:24338073

Zhang, Chunyun; Chen, Guofu; Ma, Chaoshuai; Wang, Yuanyuan; Zhang, Baoyu; Wang, Guangce

2014-03-01

249

Parallel multispot smFRET analysis using an 8-pixel SPAD array  

NASA Astrophysics Data System (ADS)

Single-molecule Förster resonance energy transfer (smFRET) is a powerful tool for extracting distance information between two fluorophores (a donor and acceptor dye) on a nanometer scale. This method is commonly used to monitor binding interactions or intra- and intermolecular conformations in biomolecules freely diffusing through a focal volume or immobilized on a surface. The diffusing geometry has the advantage to not interfere with the molecules and to give access to fast time scales. However, separating photon bursts from individual molecules requires low sample concentrations. This results in long acquisition time (several minutes to an hour) to obtain sufficient statistics. It also prevents studying dynamic phenomena happening on time scales larger than the burst duration and smaller than the acquisition time. Parallelization of acquisition overcomes this limit by increasing the acquisition rate using the same low concentrations required for individual molecule burst identification. In this work we present a new two-color smFRET approach using multispot excitation and detection. The donor excitation pattern is composed of 4 spots arranged in a linear pattern. The fluorescent emission of donor and acceptor dyes is then collected and refocused on two separate areas of a custom 8-pixel SPAD array. We report smFRET measurements performed on various DNA samples synthesized with various distances between the donor and acceptor fluorophores. We demonstrate that our approach provides identical FRET efficiency values to a conventional single-spot acquisition approach, but with a reduced acquisition time. Our work thus opens the way to high-throughput smFRET analysis on freely diffusing molecules.

Ingargiola, A.; Colyer, R. A.; Kim, D.; Panzeri, F.; Lin, R.; Gulinatti, A.; Rech, I.; Ghioni, M.; Weiss, S.; Michalet, X.

2012-02-01

250

Parallel multispot smFRET analysis using an 8-pixel SPAD array  

PubMed Central

Single-molecule Förster resonance energy transfer (smFRET) is a powerful tool for extracting distance information between two fluorophores (a donor and acceptor dye) on a nanometer scale. This method is commonly used to monitor binding interactions or intra- and intermolecular conformations in biomolecules freely diffusing through a focal volume or immobilized on a surface. The diffusing geometry has the advantage to not interfere with the molecules and to give access to fast time scales. However, separating photon bursts from individual molecules requires low sample concentrations. This results in long acquisition time (several minutes to an hour) to obtain sufficient statistics. It also prevents studying dynamic phenomena happening on time scales larger than the burst duration and smaller than the acquisition time. Parallelization of acquisition overcomes this limit by increasing the acquisition rate using the same low concentrations required for individual molecule burst identification. In this work we present a new two-color smFRET approach using multispot excitation and detection. The donor excitation pattern is composed of 4 spots arranged in a linear pattern. The fluorescent emission of donor and acceptor dyes is then collected and refocused on two separate areas of a custom 8-pixel SPAD array. We report smFRET measurements performed on various DNA samples synthesized with various distances between the donor and acceptor fluorophores. We demonstrate that our approach provides identical FRET efficiency values to a conventional single-spot acquisition approach, but with a reduced acquisition time. Our work thus opens the way to high-throughput smFRET analysis on freely diffusing molecules. PMID:24382989

Ingargiola, A.; Colyer, R. A.; Kim, D.; Panzeri, F.; Lin, R.; Gulinatti, A.; Rech, I.; Ghioni, M.; Weiss, S.; Michalet, X.

2012-01-01

251

The Milstar Advanced Processor  

NASA Astrophysics Data System (ADS)

The Milstar Advanced Processor (MAP) is a 'drop-in' replacement for its predecessor which preserves existing interfaces with other Milstar satellite processors and minimizes the impact of such upgrading to already-developed application software. In addition to flight software development, and hardware development that involves the application of VHSIC technology to the electrical design, the MAP project is developing two sophisticated and similar test environments. High density RAM and ROM are employed by the MAP memory array. Attention is given to the fine-pitch VHSIC design techniques and lead designs used, as well as the tole of TQM and concurrent engineering in the development of the MAP manufacturing process.

Tjia, Khiem-Hian; Heely, Stephen D.; Morphet, John P.; Wirick, Kevin S.

252

Quadrature transmit array design using single-feed circularly polarized patch antenna for parallel transmission in MR imaging.  

PubMed

Quadrature coils are often desired in MR applications because they can improve MR sensitivity and also reduce excitation power. In this work, we propose, for the first time, a quadrature array design strategy for parallel transmission at 298 MHz using single-feed circularly polarized (CP) patch antenna technique. Each array element is a nearly square ring microstrip antenna and is fed at a point on the diagonal of the antenna to generate quadrature magnetic fields. Compared with conventional quadrature coils, the single-feed structure is much simple and compact, making the quadrature coil array design practical. Numerical simulations demonstrate that the decoupling between elements is better than -35 dB for all the elements and the RF fields are homogeneous with deep penetration and quadrature behavior in the area of interest. Bloch equation simulation is also performed to simulate the excitation procedure by using an 8-element quadrature planar patch array to demonstrate its feasibility in parallel transmission at the ultrahigh field of 7 Tesla. PMID:24649430

Pang, Yong; Yu, Baiying; Vigneron, Daniel B; Zhang, Xiaoliang

2014-02-01

253

Quadrature transmit array design using single-feed circularly polarized patch antenna for parallel transmission in MR imaging  

PubMed Central

Quadrature coils are often desired in MR applications because they can improve MR sensitivity and also reduce excitation power. In this work, we propose, for the first time, a quadrature array design strategy for parallel transmission at 298 MHz using single-feed circularly polarized (CP) patch antenna technique. Each array element is a nearly square ring microstrip antenna and is fed at a point on the diagonal of the antenna to generate quadrature magnetic fields. Compared with conventional quadrature coils, the single-feed structure is much simple and compact, making the quadrature coil array design practical. Numerical simulations demonstrate that the decoupling between elements is better than –35 dB for all the elements and the RF fields are homogeneous with deep penetration and quadrature behavior in the area of interest. Bloch equation simulation is also performed to simulate the excitation procedure by using an 8-element quadrature planar patch array to demonstrate its feasibility in parallel transmission at the ultrahigh field of 7 Tesla. PMID:24649430

Pang, Yong; Yu, Baiying; Vigneron, Daniel B.

2014-01-01

254

Adaptive Parallelism and Piranha  

Microsoft Academic Search

. Under "adaptive parallelism," the set of processors executing a parallel programmay grow or shrink as the program runs. Potential gains include the capacity to runa parallel program on the idle workstations in a conventional LAN---processors join thecomputation when they become idle, and withdraw when their owners need them---and tomanage the nodes of a dedicated multiprocessor efficiency. Experience to date

Nicholas Carriero; Eric Freeman; David Gelernter; David Kaminsky

1995-01-01

255

An associative processor for air traffic control  

Microsoft Academic Search

In recent years associative memories have been receiving an increasing amount of attention. At the same time multiprocessor and parallel processing systems have been under study to solve very large problems. An associative processor is one form of a parallel processor that seems able to provide a cost effective solution to many problems such as the air traffic control (ATC)

Kenneth James Thurber

1971-01-01

256

FAULT-TOLERANT PARALLEL ALGORITHMS FOR ADAPTIVE MATCHED-FIELD PROCESSING ON DISTRIBUTED ARRAY SYSTEMS  

E-print Network

propagation model instead of a simple plain-wave acoustic propagation model for the ocean20 . Continuous network, processor, and sensor elements, and degradation in beam power pattern. Such real-time sonar algorithms and better understanding of signal and ocean environment models have resulted in the development

George, Alan D.

257

Imer-product array processor for retrieval of stored images represented by bipolar binary (+1,-1) pixels using partial input trinary pixels represented by (+1,-1)  

NASA Technical Reports Server (NTRS)

An inner-product array processor is provided with thresholding of the inner product during each iteration to make more significant the inner product employed in estimating a vector to be used as the input vector for the next iteration. While stored vectors and estimated vectors are represented in bipolar binary (1,-1), only those elements of an initial partial input vector that are believed to be common with those of a stored vector are represented in bipolar binary; the remaining elements of a partial input vector are set to 0. This mode of representation, in which the known elements of a partial input vector are in bipolar binary form and the remaining elements are set equal to 0, is referred to as trinary representation. The initial inner products corresponding to the partial input vector will then be equal to the number of known elements. Inner-product thresholding is applied to accelerate convergence and to avoid convergence to a negative input product.

Liu, Hua-Kuang (Inventor); Awwal, Abdul A. S. (Inventor); Karim, Mohammad A. (Inventor)

1993-01-01

258

The Indirect Binary n-Cube Microprocessor Array  

Microsoft Academic Search

This paper explores the possibility of using a large-scale array of microprocessors as a computational facility for the execution of massive numerical computations with a high degree of parallelism. By microprocessor we mean a processor realized on one or a few semiconductor chips that include arithmetic and logical facilities and some memory. The current state of LSI technology makes this

Marshall C. Pease III

1977-01-01

259

Microfluidic formation of single cell array for parallel analysis of Ca 2+ release-activated Ca 2+ (CRAC) channel activation and inhibition  

Microsoft Academic Search

High-throughput single cell analysis is required for understanding and predicting the complex stochastic responses of individual cells in changing environments. We have designed a microfluidic device consisting of parallel, independent channels with cell-docking structures for the formation of an array of individual cells. The microfluidic cell array was used to quantify single cell responses and the distribution of response patterns

Tao Xu; Cheuk-Wing Li; Xinsheng Yao; Guoping Cai; Mengsu Yang

2010-01-01

260

Parallel Alternating-Direction Access Machine  

Microsoft Academic Search

. This paper presents a theoretical study of a model of parallelcomputations called Parallel Alternating-Direction Access Machine(padam). padam is an abstraction of the multiprocessor computers adena\\/adenart and a prototype architecture usc\\/omp. The main feature ofpadam is the organization of access to the global memory:(1) the memory modules are arranged as a 2-dimensional array,(2) each processor is assigned to a row

Bogdan S. Chlebus; Artur Czumaj; Leszek Gasieniec; Miroslaw Kowaluk; Wojciech Plandowski I

1996-01-01

261

FTRAID: A Fat-tree Based Parallel Storage Architecture for Very Large Disk Array  

Microsoft Academic Search

Traditional disk arrays have a centralized architecture, with a single controller through which all requests flow. Such a controller is a single point of failure, and its performance limits the maximum number of disks to which the array can scale. Fat-trees are well-adopted as the topologies of interconnection networks because of many nice properties they have. We propose a novel

Zhikun Wang; Ke Zhou; Dan Feng; Lingfang Zeng; Junping Liu

2007-01-01

262

Hardware multiplier processor  

DOEpatents

A hardware processor is disclosed which in the described embodiment is a memory mapped multiplier processor that can operate in parallel with a 16 bit microcomputer. The multiplier processor decodes the address bus to receive specific instructions so that in one access it can write and automatically perform single or double precision multiplication involving a number written to it with or without addition or subtraction with a previously stored number. It can also, on a single read command automatically round and scale a previously stored number. The multiplier processor includes two concatenated 16 bit multiplier registers, two 16 bit concatenated 16 bit multipliers, and four 16 bit product registers connected to an internal 16 bit data bus. A high level address decoder determines when the multiplier processor is being addressed and first and second low level address decoders generate control signals. In addition, certain low order address lines are used to carry uncoded control signals. First and second control circuits coupled to the decoders generate further control signals and generate a plurality of clocking pulse trains in response to the decoded and address control signals.

Pierce, Paul E. (Albuquerque, NM)

1986-01-01

263

3D optical interconnect mesh network for on-board parallel multiprocessor system based on EOPCB  

NASA Astrophysics Data System (ADS)

A three-dimensional (3-D) 4×4×4 optical interconnect Mesh network scheme for parallel multiprocessor system based on polymer light waveguide electro-optical printed circuit board(EOPCB) is proposed in this paper. The Mesh topological structures of light waveguide interconnects for processor element chip-to-chip on a board, and board-toboard on backplane is constructed. The system consists of 64 processor element chips interconnected in a 3-D Mesh network configuration. Every processor board comprises 4x4 processor element chips with Mesh interconnection. Board-to-board Mesh interconnects are established on a backplane through light waveguide Mesh interconnect topological structure. An additional optical layer with light waveguide structure is used in conventional PCB to construct EOPCB. Vertical cavity surface emitting laser (VCSEL) array is used as optical transmitter array. PIN photodiode array is used as optical receiver array. A MT-compatible direct coupling method is presented to couple light beam between optical transmitter/receiver with light waveguide layer. The optical signals from a processor element chip on a board can transmit to another processor element chip on another board through light waveguide interconnection in the backplane. So 3-D optical interconnection Mesh network for parallel multiprocessor system can be reailzed by EOPCB.

Luo, Fengguang; Cao, Mingcui; Zhou, Xinjun; Xu, Jun; Luo, Zhixiang; Yuan, Jing; Zong, Liangjia; Feng, Yonghua; Chen, Chao; Zhang, Conghui

2007-11-01

264

Biological Information Signal Processor  

NASA Technical Reports Server (NTRS)

Biological Information Signal Processor (BISP) is computing system analyzing data on deoxyribonucleic acid (DNA) sequences for molecular genetic analysis. Includes coprocessors, specialized microprocessors complementing present and future computers by performing rapidly most-time-consuming DNA-sequence-analyzing functions, establishing relationships (alignments) between both global sequences and defining patterns in multiple sequences. Also includes state-of-art software and data-base systems on both conventional and parallel computer systems to augment analytical abilities of developmental coprocessors.

Chow, Edward T.; Peterson, John C.; Yoo, Michael M.

1993-01-01

265

Tiled Multicore Processors  

NASA Astrophysics Data System (ADS)

For the last few decades Moore’s Law has continually provided exponential growth in the number of transistors on a single chip. This chapter describes a class of architectures, called tiled multicore architectures, that are designed to exploit massive quantities of on-chip resources in an efficient, scalable manner. Tiled multicore architectures combine each processor core with a switch to create a modular element called a tile. Tiles are replicated on a chip as needed to create multicores with any number of tiles. The Raw processor, a pioneering example of a tiled multicore processor, is examined in detail to explain the philosophy, design, and strengths of such architectures. Raw addresses the challenge of building a general-purpose architecture that performs well on a larger class of stream and embedded computing applications than existing microprocessors, while still running existing ILP-based sequential programs with reasonable performance. Central to achieving this goal is Raw’s ability to exploit all forms of parallelism, including ILP, DLP, TLP, and Stream parallelism. Raw approaches this challenge by implementing plenty of on-chip resources - including logic, wires, and pins - in a tiled arrangement, and exposing them through a new ISA, so that the software can take advantage of these resources for parallel applications. Compared to a traditional superscalar processor, Raw performs within a factor of 2x for sequential applications with a very low degree of ILP, about 2x-9x better for higher levels of ILP, and 10x-100x better when highly parallel applications are coded in a stream language or optimized by hand.

Taylor, Michael B.; Lee, Walter; Miller, Jason E.; Wentzlaff, David; Bratt, Ian; Greenwald, Ben; Hoffmann, Henry; Johnson, Paul R.; Kim, Jason S.; Psota, James; Saraf, Arvind; Shnidman, Nathan; Strumpen, Volker; Frank, Matthew I.; Amarasinghe, Saman; Agarwal, Anant

266

ATAC: A Manycore Processor with On-Chip Optical Network  

E-print Network

Ever since industry has turned to parallelism instead of frequency scaling to improve processor performance, multicore processors have continued to scale to larger and larger numbers of cores. Some believe that multicores ...

Liu, Jifeng

2009-05-05

267

A novel polymeric microelectrode array for highly parallel, long-term neuronal culture and stimulation  

E-print Network

Cell-based high-throughput screening is emerging as a disruptive technology in drug discovery; however, massively parallel electrical assaying of neurons and cardiomyocites has until now been prohibitively expensive. To ...

Talei Franzesi, Giovanni

2008-01-01

268

Comparative Analysis on the Performance of a Short String of Series-Connected and Parallel-Connected Photovoltaic Array Under Partial Shading  

NASA Astrophysics Data System (ADS)

The output power from the photovoltaic (PV) array decreases and the array exhibit multiple peaks when it is subjected to partial shading (PS). The power loss in the PV array varies with the array configuration, physical location and the shading pattern. This paper compares the relative performance of a PV array consisting of a short string of three PV modules for two different configurations. The mismatch loss, shading loss, fill factor and the power loss due to the failure in tracking of the global maximum power point, of a series string with bypass diodes and short parallel string are analysed using MATLAB/Simulink model. The performance of the system is investigated for three different conditions of solar insolation for the same shading pattern. Results indicate that there is considerable power loss due to shading in a series string during PS than in a parallel string with same number of modules.

Vijayalekshmy, S.; Rama Iyer, S.; Beevi, Bisharathu

2014-07-01

269

Template-directed atomically precise self-organization of perfectly ordered parallel cerium silicide nanowire arrays on Si(110)-16?×?2 surfaces  

PubMed Central

The perfectly ordered parallel arrays of periodic Ce silicide nanowires can self-organize with atomic precision on single-domain Si(110)-16?×?2 surfaces. The growth evolution of self-ordered parallel Ce silicide nanowire arrays is investigated over a broad range of Ce coverages on single-domain Si(110)-16?×?2 surfaces by scanning tunneling microscopy (STM). Three different types of well-ordered parallel arrays, consisting of uniformly spaced and atomically identical Ce silicide nanowires, are self-organized through the heteroepitaxial growth of Ce silicides on a long-range grating-like 16?×?2 reconstruction at the deposition of various Ce coverages. Each atomically precise Ce silicide nanowire consists of a bundle of chains and rows with different atomic structures. The atomic-resolution dual-polarity STM images reveal that the interchain coupling leads to the formation of the registry-aligned chain bundles within individual Ce silicide nanowire. The nanowire width and the interchain coupling can be adjusted systematically by varying the Ce coverage on a Si(110) surface. This natural template-directed self-organization of perfectly regular parallel nanowire arrays allows for the precise control of the feature size and positions within ±0.2 nm over a large area. Thus, it is a promising route to produce parallel nanowire arrays in a straightforward, low-cost, high-throughput process. PMID:24188092

2013-01-01

270

Photorefractive processing for large adaptive phased arrays  

NASA Astrophysics Data System (ADS)

An adaptive null-steering phased-array optical processor that utilizes a photorefractive crystal to time integrate the adaptive weights and null out correlated jammers is described. This is a beam-steering processor in which the temporal waveform of the desired signal is known but the look direction is not. The processor computes the angle(s) of arrival of the desired signal and steers the array to look in that direction while rotating the nulls of the antenna pattern toward any narrow-band jammers that may be present. We have experimentally demonstrated a simplified version of this adaptive phased-array-radar processor that nulls out the narrow-band jammers by using feedback-correlation detection. In this processor it is assumed that we know a priori only that the signal is broadband and the jammers are narrow band. These are examples of a class of optical processors that use the angular selectivity of volume holograms to form the nulls and look directions in an adaptive phased-array-radar pattern and thereby to harness the computational abilities of three-dimensional parallelism in the volume of photorefractive crystals. The development of this processing in volume holographic system has led to a new algorithm for phased-array-radar processing that uses fewer tapped-delay lines than does the classic time-domain beam former. The optical implementation of the new algorithm has the further advantage of utilization of a single photorefractive crystal to implement as many as a million adaptive weights, allowing the radar system to scale to large size with no increase in processing hardware.

Weverka, Robert T.; Wagner, Kelvin; Sarto, Anthony

1996-03-01

271

Development and characterization of hollow microprobe array as a potential tool for versatile and massively parallel manipulation of single cells.  

PubMed

Parallel manipulation of single cells is important for reconstructing in vivo cellular microenvironments and studying cell functions. To manipulate single cells and reconstruct their environments, development of a versatile manipulation tool is necessary. In this study, we developed an array of hollow probes using microelectromechanical systems fabrication technology and demonstrated the manipulation of single cells. We conducted a cell aspiration experiment with a glass pipette and modeled a cell using a standard linear solid model, which provided information for designing hollow stepped probes for minimally invasive single-cell manipulation. We etched a silicon wafer on both sides and formed through holes with stepped structures. The inner diameters of the holes were reduced by SiO2 deposition of plasma-enhanced chemical vapor deposition to trap cells on the tips. This fabrication process makes it possible to control the wall thickness, inner diameter, and outer diameter of the probes. With the fabricated probes, single cells were manipulated and placed in microwells at a single-cell level in a parallel manner. We studied the capture, release, and survival rates of cells at different suction and release pressures and found that the cell trapping rate was directly proportional to the suction pressure, whereas the release rate and viability decreased with increasing the suction pressure. The proposed manipulation system makes it possible to place cells in a well array and observe the adherence, spreading, culture, and death of the cells. This system has potential as a tool for massively parallel manipulation and for three-dimensional hetero cellular assays. PMID:25749639

Nagai, Moeto; Oohara, Kiyotaka; Kato, Keita; Kawashima, Takahiro; Shibata, Takayuki

2015-04-01

272

Foreword for the Patterns for Parallel Software Design Book by Jorge Ortega Arjona The steady increases in processor speeds associated with Moore's law have improved software  

E-print Network

increases in processor speeds associated with Moore's law have improved software performance for decades, however, the exponential growth in CPU speed has stalled. In- creases in software performance now stem and deliver value to users in a wide range of application domains, including high-performance scientific compu

Schmidt, Douglas C.

273

Dual flux-to-voltage response of YBa2Cu3O7-? asymmetric parallel arrays of Josephson junctions  

NASA Astrophysics Data System (ADS)

We fabricated a parallel array of 440 YBa2Cu3O7-? bicrystal grain boundary Josephson junctions having an inductive asymmetric loop configuration within the array. Families of current-voltage characteristics (IVCs) have been measured in the temperature range (4.7-92) K for various values of a magnetic flux applied via a control current Ictrl. For both positive and negative current biases, I current-driven chains of magnetic vortices are propagating along the array producing flux-flow current resonances on the IVCs. However, at 77 K and above, due to the system’s inductive asymmetry the flux flow is suppressed (enhanced) for negative (positive) I. Consequently, the system shows a dual flux-to-voltage response. For negative I it operates like a flux-interferometer having a rather sinusoidal V (Ictrl) response. In contrast, for positive I the device’s response V (Ictrl) remains periodic but highly non-sinusoidal due to the interplay between multiple flux-flow modes. Below 60 K such a dual behaviour is far less pronounced as a result of flux-flow modes being suppressed due to a decrease of the dissipation coefficient with temperature.

Chesca, Boris; John, Daniel; Mellor, Christopher J.

2014-05-01

274

Variation in bandwidths among solutions to shaped beam synthesis problems concerning linear arrays of parallel dipoles  

Microsoft Academic Search

The problem of synthesizing a linear array generating a shaped beam pattern with M filled s has 2M alternative solutions. In this study we examined their bandwidths as regards compliance with pattern quality or input impedance requirements in the presence and absence of a backing ground plane. Placing a ground plane behind the antenna almost doubles sidelobe level bandwidth.

J. C. Bregains; F. Ares

2005-01-01

275

Implementation and Assessment of Advanced Analog Vector-Matrix Processor  

NASA Technical Reports Server (NTRS)

This paper discusses the design and implementation of an analog optical vecto-rmatrix coprocessor with a throughput of 128 Mops for a personal computer. Vector matrix calculations are inherently parallel, providing a promising domain for the use of optical calculators. However, to date, digital optical systems have proven too cumbersome to replace electronics, and analog processors have not demonstrated sufficient accuracy in large scale systems. The goal of the work described in this paper is to demonstrate a viable optical coprocessor for linear operations. The analog optical processor presented has been integrated with a personal computer to provide full functionality and is the first demonstration of an optical linear algebra processor with a throughput greater than 100 Mops. The optical vector matrix processor consists of a laser diode source, an acoustooptical modulator array to input the vector information, a liquid crystal spatial light modulator to input the matrix information, an avalanche photodiode array to read out the result vector of the vector matrix multiplication, as well as transport optics and the electronics necessary to drive the optical modulators and interface to the computer. The intent of this research is to provide a low cost, highly energy efficient coprocessor for linear operations. Measurements of the analog accuracy of the processor performing 128 Mops are presented along with an assessment of the implications for future systems. A range of noise sources, including cross-talk, source amplitude fluctuations, shot noise at the detector, and non-linearities of the optoelectronic components are measured and compared to determine the most significant source of error. The possibilities for reducing these sources of error are discussed. Also, the total error is compared with that expected from a statistical analysis of the individual components and their relation to the vector-matrix operation. The sufficiency of the measured accuracy of the processor is compared with that required for a range of typical problems. Calculations resolving alloy concentrations from spectral plume data of rocket engines are implemented on the optical processor, demonstrating its sufficiency for this problem. We also show how this technology can be easily extended to a 100 x 100 10 MHz (200 Cops) processor.

Gary, Charles K.; Bualat, Maria G.; Lum, Henry, Jr. (Technical Monitor)

1994-01-01

276

Optimal expression evaluation for data parallel architectures  

NASA Technical Reports Server (NTRS)

A data parallel machine represents an array or other composite data structure by allocating one processor (at least conceptually) per data item. A pointwise operation can be performed between two such arrays in unit time, provided their corresponding elements are allocated in the same processors. If the arrays are not aligned in this fashion, the cost of moving one or both of them is part of the cost of the operation. The choice of where to perform the operation then affects this cost. If an expression with several operands is to be evaluated, there may be many choices of where to perform the intermediate operations. An efficient algorithm is given to find the minimum-cost way to evaluate an expression, for several different data parallel architectures. This algorithm applies to any architecture in which the metric describing the cost of moving an array is robust. This encompasses most of the common data parallel communication architectures, including meshes of arbitrary dimension and hypercubes. Remarks are made on several variations of the problem, some of which are solved and some of which remain open.

Gilbert, John R.; Schreiber, Robert

1990-01-01

277

Communication efficient parallel algorithms for nonnumerical computations  

SciTech Connect

The broad goal of this research is to develop a set of paradigms for mapping data-dependent symbolic computations on realistic models of parallel architectures. Within this goal, the thesis represents the initial effort to achieve efficient parallel solutions for a number of non-numerical problems on networks of processors. The specific contributions of the thesis are new parallel algorithms, exhibiting linear speedup on architectures consisting of fixed numbers of processors (i.e., bounded models). The following problems have been considered in the thesis: (1) Determine the minimum spanning tree (MST), and identify the bridges and articulation points (APs) of an undirected weighted graph represented by an n x n adjacency matrix. (2) The pattern matching problem: Given two strings of characters, of lengths m and n ({number sign}m) respectively, mark all positions in the second string where there appears an instance of the first string. (3) Sort n elements. For each problem, the author uses a processor-network consisting of p processors. The network model used in the solution of the first set of problems is the linear array; while that used in the solutions of the second and third problems is a butterfly-connected system. The solutions on the butterfly-connected system apply also on a pipelined hypercube. The performances of the solutions are summarized.

Doshi, K.A.

1988-01-01

278

Arrays  

NSDL National Science Digital Library

This interactive Flash applet helps students develop the concept of equal groups as a foundation for multiplication and division. The applet displays an array of dots, some of which are covered by a card. Student use the visible number of rows and columns to determine the total number of dots. Clicking on the card reveals the full array, and a voice announces the total.

2011-01-01

279

Parallel array InAs nanowire transistors for mechanically bendable, ultrahigh frequency electronics.  

PubMed

The radio frequency response of InAs nanowire array transistors on mechanically flexible substrates is characterized. For the first time, GHz device operation of nanowire arrays is demonstrated, despite the relatively long channel lengths of ?1.5 ?m used in this work. Specifically, the transistors exhibit an impressive maximum frequency of oscillation, f(max) ? 1.8 GHz, and a cutoff frequency, f(t) ? 1 GHz. The high-frequency response of the devices is due to the high saturation velocity of electrons in high-mobility InAs nanowires. The work presents a new platform for flexible, ultrahigh frequency devices with potential applications in high-performance digital and analog circuitry. PMID:20845916

Takahashi, Toshitake; Takei, Kuniharu; Adabi, Ehsan; Fan, Zhiyong; Niknejad, Ali M; Javey, Ali

2010-10-26

280

Electro-optical microwave signal processor for high-frequency wideband frequency channelization  

NASA Astrophysics Data System (ADS)

An electro-optic microwave signal processor for activity monitoring in an electronic warfare receiver, offering wideband operation, parallel output in real time and 100 percent probability of intercept is presented, along with results from a prototype system. Requirements on electronic warfare receiver system are demanding, because they have to defect and identify potential threats across a large frequency bandwidth and in the high pulse density expected of the battlefield environment. A technique of processing signals across a wide bandwidth is to use a channelizer in the receiver front-end, in order to produce a number of narrow band outputs that can be individually processed. In the presented signal processor, received microwave signals ar unconverted onto an optical carrier using an electro- optic modulator and then spatially separated into a series of spots. The position and intensity of the spots is determined by the received signal(s) frequency and strength. Finally a photodiode array can be used for fast parallel data readout. Thus the signal processor output is fully channelized according to frequency. A prototype signal processor has been constructed, which can process microwave frequencies from 500MHz to 8GHz. A standard telecommunications electro-optic intensity modulator with a 3dB bandwidth of approximately 2.5GHz provides frequency upconversion. Readout is achieved using either a near IR camera or a 16 element linear photodiode array.

Dawber, William N.; Webster, Ken

1998-08-01

281

Parallel image-acquisition in continuous-wave electron paramagnetic resonance imaging with a surface coil array: Proof-of-concept experiments  

NASA Astrophysics Data System (ADS)

This article describes a feasibility study of parallel image-acquisition using a two-channel surface coil array in continuous-wave electron paramagnetic resonance (CW-EPR) imaging. Parallel EPR imaging was performed by multiplexing of EPR detection in the frequency domain. The parallel acquisition system consists of two surface coil resonators and radiofrequency (RF) bridges for EPR detection. To demonstrate the feasibility of this method of parallel image-acquisition with a surface coil array, three-dimensional EPR imaging was carried out using a tube phantom. Technical issues in the multiplexing method of EPR detection were also clarified. We found that degradation in the signal-to-noise ratio due to the interference of RF carriers is a key problem to be solved.

Enomoto, Ayano; Hirata, Hiroshi

2014-02-01

282

Array Privatization for Parallel Execution of Loops Department of Computer Science  

E-print Network

Before After Array Priv. MDG 1.1 5.5 Yes OCEAN 1.42 8.3 Yes TRACK 0.90 5.1 Yes TRFD 2.36 13.2 Yes Table 2 Spdup Seq TRACK 300 nl lt 5.2 40% 300 fptrak 6.0 9% 400 extend 7.0 34% MDG 1000 interf 6.0 90% 2000

Li, Zhiyuan

283

Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays  

Microsoft Academic Search

We describe a novel sequencing approach that combines non-gel-based signature sequencing with in vitro cloning of millions of templates on separate 5 ?m diameter microbeads. After constructing a microbead library of DNA templates by in vitro cloning, we assembled a planar array of a million template-containing microbeads in a flow cell at a density greater than 3 × 106 microbeads\\/cm2.

Maria Johnson; John Bridgham; George Golda; David H. Lloyd; Davida Johnson; Shujun Luo; Sarah McCurdy; Michael Foy; Mark Ewan; Rithy Roth; Dave George; Sam Eletr; Glenn Albrecht; Eric Vermaas; Steven R. Williams; Keith Moon; Timothy Burcham; Michael Pallas; Robert B. DuBridge; James Kirchner; Karen Fearon; Jen-i Mao; Kevin Corcoran; Sydney Brenner

2000-01-01

284

Development of parallel architectures for sensor array-processing algorithms. Semi-Annual report  

SciTech Connect

The high resolution direction of arrival (DOA) estimation has been an important area of research for a number of years. Many researchers have developed a variety of algorithms to estimate the direction of arrival. Another important aspect of the DOA estimation area is the development of high speed hardware capable of computing the DOA in real time. In this research the authors have first focussed on the development of parallel architecture for multiple signal classification (MUSIC) and estimation of signal parameters by rotational invariance technique (ESPRIT) algorithms for the narrow band sources. These algorithms are substituted with computationally efficient modules and converted to pipelined and parallel algorithms. For example one important computation of eigendecomposition of the covariance matrix has been performed using Householders transformations and QR method.

Jamali, M.M.; Kwatra, S.C.; Djoudi, A.; Sheelvant, R.; Rao, M.

1991-08-01

285

Field Programmable Gate Array Based Parallel Strapdown Algorithm Design for Strapdown Inertial Navigation Systems  

PubMed Central

A new generalized optimum strapdown algorithm with coning and sculling compensation is presented, in which the position, velocity and attitude updating operations are carried out based on the single-speed structure in which all computations are executed at a single updating rate that is sufficiently high to accurately account for high frequency angular rate and acceleration rectification effects. Different from existing algorithms, the updating rates of the coning and sculling compensations are unrelated with the number of the gyro incremental angle samples and the number of the accelerometer incremental velocity samples. When the output sampling rate of inertial sensors remains constant, this algorithm allows increasing the updating rate of the coning and sculling compensation, yet with more numbers of gyro incremental angle and accelerometer incremental velocity in order to improve the accuracy of system. Then, in order to implement the new strapdown algorithm in a single FPGA chip, the parallelization of the algorithm is designed and its computational complexity is analyzed. The performance of the proposed parallel strapdown algorithm is tested on the Xilinx ISE 12.3 software platform and the FPGA device XC6VLX550T hardware platform on the basis of some fighter data. It is shown that this parallel strapdown algorithm on the FPGA platform can greatly decrease the execution time of algorithm to meet the real-time and high precision requirements of system on the high dynamic environment, relative to the existing implemented on the DSP platform. PMID:22164058

Li, Zong-Tao; Wu, Tie-Jun; Lin, Can-Long; Ma, Long-Hua

2011-01-01

286

Data parallel algorithms  

Microsoft Academic Search

Parallel computers with tens of thousands of processors are typically programmed in a data parallel style, as opposed to the control parallel style used in multiprocessing. The success of data parallel algorithms—even on problems that at first glance seem inherently serial—suggests that this style of programming has much wider applicability than was previously thought.

W. Daniel Hillis; Guy L. Steele Jr.

1986-01-01

287

Parallel Job Scheduling and Workloads  

E-print Network

Parallel Job Scheduling and Workloads Dror Feitelson Hebrew University #12;Parallel Jobs · A set · On multicores: probably more dynamic #12;MPP Parallel Job Scheduling · Each job is a rectangle in processorsXtime space · Given many jobs, we must schedule them to run on available processors · This is like packing

Segall, Adrian

288

Gamma-ray imaging using a CdZnTe pixel array and a high-resolution, parallel-hole collimator  

Microsoft Academic Search

The poor performance of current parallel-hole collimators is an impediment to planar high-resolution gamma-ray imaging, even when high-resolution semiconductor detector arrays are available. High-resolution parallel-hole collimators are possible but have not been fabricated because conventional collimator construction techniques severely limit achievable bore size and septal thickness. We describe development and testing of a high-resolution collimator with 4096 260-?m square bores

G. A. Kastis; H. B. Barber; H. H. Barrett; S. J. Balzer; D. Lu; D. G. Marks; G. Stevenson; J. M. Woolfenden; M. Appleby; J. Tueller

2000-01-01

289

Gamma-ray imaging using a CdZnTe pixel array and a high-resolution, parallel-hole collimator  

Microsoft Academic Search

The poor performance of current parallel-hole collimators is an impediment to planar high-resolution gamma-ray imaging, even when high-resolution semiconductor detector arrays are available. High-resolution parallel-hole collimators are possible but have not been fabricated because current collimator construction techniques severely limit achievable bore size and septal thickness. We describe development and testing of a high-resolution collimator with 4096 260-?m square bores

G. A. Kastis; H. B. Barber; H. H. Barrett; S. J. Balzer; D. Lu; D. G. Marks; G. Stevenson; J. M. Woolfenden; M. Appleby; J. Tueller

1999-01-01

290

Parallel array of nanochannels grafted with polymer-brushes-stabilized Au nanoparticles for flow-through catalysis.  

PubMed

Smart systems on the nanometer scale for continuous flow-through reaction present fascinating advantages in heterogeneous catalysis, in which a parallel array of straight nanochannels offers a platform with high surface area for assembling and stabilizing metallic nanoparticles working as catalysts. Herein we demonstrate a method for finely modifying the nanoporous anodic aluminum oxide (AAO), and further integration of nanoreactors. By using atomic transfer radical polymerization (ATRP), polymer brushes were successfully grafted on the inner wall of the nanochannels of the AAO membrane, followed by exchanging counter ions with a precursor for nanoparticles (NPs), and used as the template for deposition of well-defined Au NPs. The membrane was used as a functional nanochannel for novel flow-through catalysis. High catalytic performance and instantaneous separation of products from the reaction system was achieved in reduction of 4-nitrophenol. PMID:24129356

Liu, Jianxi; Ma, Shuanhong; Wei, Qiangbing; Jia, Lei; Yu, Bo; Wang, Daoai; Zhou, Feng

2013-12-01

291

Parallel array of nanochannels grafted with polymer-brushes-stabilized Au nanoparticles for flow-through catalysis  

NASA Astrophysics Data System (ADS)

Smart systems on the nanometer scale for continuous flow-through reaction present fascinating advantages in heterogeneous catalysis, in which a parallel array of straight nanochannels offers a platform with high surface area for assembling and stabilizing metallic nanoparticles working as catalysts. Herein we demonstrate a method for finely modifying the nanoporous anodic aluminum oxide (AAO), and further integration of nanoreactors. By using atomic transfer radical polymerization (ATRP), polymer brushes were successfully grafted on the inner wall of the nanochannels of the AAO membrane, followed by exchanging counter ions with a precursor for nanoparticles (NPs), and used as the template for deposition of well-defined Au NPs. The membrane was used as a functional nanochannel for novel flow-through catalysis. High catalytic performance and instantaneous separation of products from the reaction system was achieved in reduction of 4-nitrophenol.

Liu, Jianxi; Ma, Shuanhong; Wei, Qiangbing; Jia, Lei; Yu, Bo; Wang, Daoai; Zhou, Feng

2013-11-01

292

Interactive animation of fault-tolerant parallel algorithms  

SciTech Connect

Animation of algorithms makes understanding them intuitively easier. This paper describes the software tool Raft (Robust Animator of Fault Tolerant Algorithms). The Raft system allows the user to animate a number of parallel algorithms which achieve fault tolerant execution. In particular, we use it to illustrate the key Write-All problem. It has an extensive user-interface which allows a choice of the number of processors, the number of elements in the Write-All array, and the adversary to control the processor failures. The novelty of the system is that the interface allows the user to create new on-line adversaries as the algorithm executes.

Apgar, S.W.

1992-02-01

293

Parallel multi-step nanolithography by nanoscale Cu-covered h-PDMS tip array  

NASA Astrophysics Data System (ADS)

Tip-based nanolithography provides a flexible nanolithographic technology. Tip fabrication is one of the main challenges. In this paper, we propose to combine the dry etching of photoresist and electro-chemical machining to reduce the size of the tip opening. We successfully fabricate a tip opening with a diameter of 200?nm. After lithography and lift-off, gold dot patterns with a diameter of 280?nm are demonstrated. Moreover, a home-made multi-step exposure system is built and both the successful 14- and 44-step nanolithography by a tip array are also demonstrated in the paper.

Chang, Yuan-Jen; Huang, Han-Kuan

2014-09-01

294

Orthogonal and parallel lattice plasmon resonance in core-shell SiO2/Au nanocylinder arrays.  

PubMed

Height induced coupling behavior between the plasmonic modes and diffraction orders were studied in the core-shell SiO2/Au nanocylinder arrays (NCAs) using finite difference time domain (FDTD) simulations. New lattice plasmon modes (LPMs) are observed in the structures with high aspect ratio. Specifically, parallel coupling between the plasmonic modes and diffraction orders is obtained here, which shows different coupling behavior from orthogonal LPMs. Electromagnetic (EM) field distributions indicate that horizontal propagation of the magnetic or electric field component is responsible for the generation of these orthogonal and parallel LPMs, respectively. Radiative loss could be effectively suppressed when the height increases. This is important for the applications of fluorescence enhancement and nano laser. Further studies confirm that the LPMs associated with the superstrate diffraction orders could be well maintained even when the Au coating is imperfect. The interference from the substrate associated LPMs could be eliminated by cutting off the corresponding diffraction waves by inducing a Si3N4 substrate. This study of coupling behavior in the core-shell NCAs enables a novel route to design and optimize the LPMs for applications of bio-sensing and nano laser. PMID:25835660

Lin, Linhan; Yi, Yasha

2015-01-12

295

A dynamically reconfigurable asynchronous processor for low power applications  

Microsoft Academic Search

There is an increasing demand in high-throughput mobile applications for programmability and energy efficiency. Conventional mobile Central Processing Units (CPUs) and Very Long Instruction Word (VLIW) processors cannot meet these demands. In this paper, we present a novel dynamically reconfigurable processor that targets these requirements. The processor consists of a heterogeneous array of coarse grain asynchronous cells. The architecture maintains

Khodor Ahmad Fawaz; Tughrul Arslan; Sami Khawam; Mark Muir; Ioannis Nousias; Iain Lindsay; Ahmet T. Erdogan

2010-01-01

296

Adaptive Parallelism with Piranha  

Microsoft Academic Search

"Adaptive parallelism" refers to parallel computations on a dynamically changingset of processors: processors may join or withdraw from the computation as it proceeds.Networks of fast workstations are the most important setting for adaptive parallelism atpresent. Workstations at most sites are typically idle for significant fractions of the day,and those idle cycles may constitute in the aggregate a powerful computing resource.For

Nicholas Carriero; David Gelernter; David Kaminsky; Jeffery Westbrook

297

Customization of application specific heterogeneous multi-pipeline processors  

Microsoft Academic Search

In this paper we propose application specific instruction set processors with heterogeneous multiple pipelines to efficiently exploit the available parallelism at instruction level. We have developed a design system based on the Thumb processor architecture. Given an application specified in C language, the design system can generate a processor with a number of pipelines specifically suitable to the application, and

Swarnalatha Radhakrishnan; Hui Guo; Sri Parameswaran

2006-01-01

298

Magnetic arrays  

DOEpatents

Electromagnet arrays are disclosed which can provide selected field patterns in either two or three dimensions, and in particular, which can provide single-sided field patterns in two or three dimensions. These features are achieved by providing arrays which have current densities that vary in the windings both parallel to the array and in the direction of array thickness. 12 figs.

Trumper, D.L.; Kim, W.; Williams, M.E.

1997-05-20

299

Magnetic arrays  

DOEpatents

Electromagnet arrays which can provide selected field patterns in either two or three dimensions, and in particular, which can provide single-sided field patterns in two or three dimensions. These features are achieved by providing arrays which have current densities that vary in the windings both parallel to the array and in the direction of array thickness.

Trumper, David L. (Plaistow, NH); Kim, Won-jong (Cambridge, MA); Williams, Mark E. (Pelham, NH)

1997-05-20

300

A periodic array of nano-scale parallel slats for high-efficiency electroosmotic pumping.  

PubMed

It is known that the eletroosmotic (EO) flow rate through a nano-scale channel is extremely small. A channel made of a periodic array of slats is proposed to effectively promote the EO pumping, and thus greatly improve the EO flow rate. The geometrically simple array is complicated enough that four length scales are involved: the vertical period 2L, lateral period 2aL, width of the slat 2cL as well as the Debye length ?D. The EO pumping rate is determined by the normalized lengths: a, c, or the perforation fraction of slats ?=1-(c/a) and the dimensionless electrokinetic width K=L/?D. In a nano-scale channel, K is of order unity or less. EO pumping in both longitudinal and transverse directions (denoted as longitudinal EO pumping (LEOP) and transverse EO pumping (TEOP), respectively) is investigated by solving the Debye-Hückel approximation and viscous electro-kinetic equation. The main findings include that (i) the EO pumping rates of LEOP for small K are remarkably improved (by one order of magnitude) when we have longer slats (a?1) and a large perforation fraction of slats (? > 0.7); (ii) the EO pumping rates of TEOP for small K can also be much improved but less significantly with longer slats and a large perforation fraction of slats. Nevertheless, it must be noted that in practice K cannot be made arbitrarily small as the criterion of ?c?0 for the reference potential at the channel center put lower bounds on K; in other words, there are geometrical limits for the use of the Poisson-Boltzmann equation. PMID:24105905

Kung, Chun-Fei; Wang, Chang-Yi; Chang, Chien-Cheng

2013-12-01

301

Massively parallel electron beam direct writing (MPEBDW) system based on micro-electro-mechanical system (MEMS)/nanocrystalineSi emitter array  

NASA Astrophysics Data System (ADS)

The characteristics of a prototype massively parallel electron beam direct writing (MPEBDW) system are demonstrated. The electron optics consist of an emitter array, a micro-electro-mechanical system (MEMS) condenser lens array, auxiliary lenses, a stigmator, three-stage deflectors to align and scan the parallel beams, and an objective lens acting as a reduction lens. The emitter array produces 10000 programmable 10 ?m square beams. The electron emitter is a nanocrystalline silicon (nc-Si) ballistic electron emitter array integrated with an active matrix driver LSI for high-speed emission current control. Because the LSI also has a field curvature correction function, the system can use a large electron emitter array. In this system, beams that are incident on the outside of the paraxial region of the reduction lens can also be used through use of the optical aberration correction functions. The exposure pattern is stored in the active matrix LSI's memory. Alignment between the emitter array and the condenser lens array is performed by moving the emitter stage that slides along the x- and y-axes, and rotates around the z-theta axis. The electrons of all beams are accelerated, and pass through the anode array. The stigmator and the two-stage deflectors perform fine adjustments to the beam positions. The other deflector simultaneously scans all parallel beams to synchronize the moving target stage. Exposure is carried out by moving the target stage that holds the wafer. The reduction lens focuses all beams on the target wafer surface, and the electron optics of the column reduces the electron image to 0.1% of its original size.

Kojima, A.; Ikegami, N.; Yoshida, T.; Miyaguchi, H.; Muroyama, M.; Nishino, H.; Yoshida, S.; Sugata, M.; Ohyi, H.; Koshida, N.; Esashi, M.

2014-03-01

302

Optimal graph algorithms on a fixed-size linear array  

SciTech Connect

Parallel algorithms for computing the minimum spanning tree of a weighted undirected graph, and the bridges and articulation points of an undirected graphs on a fixed-size linear array of processors are presented. For a graph of n vertices, the algorithms operate on a linear array of rho processors and require O(n/sup 2//rho) time for all rho, 1 less than or equal to rho -- n. In particular, using n processors the algorithms require O(n) time which is optimal on this model. The paper describes two approaches to limit the communication requirements for solving the problems. The first is a divide-and-conquer strategy applied to Sollin's algorithm for finding the minimum spanning tree of a graph. The second uses a novel data-reduction technique that constructs an auxiliary graph with no more that 2n - 2 edges, whose bridges and articulation points are the bridges and articulation points of the original graph.

Doshi, K.A.; Varman, P.J.

1987-04-01

303

FFT Computation with Systolic Arrays, A New Architecture  

NASA Technical Reports Server (NTRS)

The use of the Cooley-Tukey algorithm for computing the l-d FFT lends itself to a particular matrix factorization which suggests direct implementation by linearly-connected systolic arrays. Here we present a new systolic architecture that embodies this algorithm. This implementation requires a smaller number of processors and a smaller number of memory cells than other recent implementations, as well as having all the advantages of systolic arrays. For the implementation of the decimation-in-frequency case, word-serial data input allows continuous real-time operation without the need of a serial-to-parallel conversion device. No control or data stream switching is necessary. Computer simulation of this architecture was done in the context of a 1024 point DFT with a fixed point processor, and CMOS processor implementation has started.

Boriakoff, Valentin

1994-01-01

304

Parallel recognition of cancer cells using an addressable array of solid-state micropores.  

PubMed

Early stage detection and precise quantification of circulating tumor cells (CTCs) in the peripheral blood of cancer patients are important for early diagnosis. Early diagnosis improves the effectiveness of the therapy and results in better prognosis. Several techniques have been used for CTC detection but are limited by their need for dye tagging, low throughput and lack of statistical reliability at single cell level. Solid-state micropores can characterize each cell in a sample providing interesting information about cellular populations. We report a multi-channel device which utilized solid-state micropores array assembly for simultaneous measurement of cell translocation. This increased the throughput of measurement and as the cells passed the micropores, tumor cells showed distinctive current blockade pulses, when compared to leukocytes. The ionic current across each micropore channel was continuously monitored and recorded. The measurement system not only increased throughput but also provided on-chip cross-relation. The whole blood was lysed to get rid of red blood cells, so the blood dilution was not needed. The approach facilitated faster processing of blood samples with tumor cell detection efficiency of about 70%. The design provided a simple and inexpensive method for rapid and reliable detection of tumor cells without any cell staining or surface functionalization. The device can also be used for high throughput electrophysiological analysis of other cell types. PMID:25038540

Ilyas, Azhar; Asghar, Waseem; Kim, Young-tae; Iqbal, Samir M

2014-12-15

305

An SoC combining a 132dB QVGA pixel array and a 32b DSP\\/MCU processor for vision applications  

Microsoft Academic Search

Key elements for machine vision are the intra-scene dynamic range of the optical front-end, and a data representation that is as independent as possible from the illumination level. Furthermore, combining an optical front-end and a processor on the same chip enables a single-chip vision system to perform image acquisition, analysis and decision-making. This paper presents a system-on-chip which combines a

Pierre-François Rüedi; Pascal Heim; Stève Gyger; François Kaess; Claude Arm; Ricardo Caseiro; Jean-Luc Nagel; Silvio Todeschini

2009-01-01

306

Upset Characterization of the PowerPC405 Hard-core Processor Embedded in Virtex-II Pro Field Programmable Gate Arrays  

NASA Technical Reports Server (NTRS)

Shown in this presentation are recent results for the upset susceptibility of the various types of memory elements in the embedded PowerPC405 in the Xilinx V2P40 FPGA. For critical flight designs where configuration upsets are mitigated effectively through appropriate design triplication and configuration scrubbing, these upsets of processor elements can dominate the system error rate. Data from irradiations with both protons and heavy ions are given and compared using available models.

Swift, Gary M.; Allen, Gregory S.; Farmanesh, Farhad; George, Jeffrey; Petrick, David J.; Chayab, Fayez

2006-01-01

307

Calculating electronic tunnel currents in networks of disordered irregularly shaped nanoparticles by mapping networks to arrays of parallel nonlinear resistors  

NASA Astrophysics Data System (ADS)

We have shown both theoretically and experimentally that tunnel currents in networks of disordered irregularly shaped nanoparticles (NPs) can be calculated by considering the networks as arrays of parallel nonlinear resistors. Each resistor is described by a one-dimensional or a two-dimensional array of equal size nanoparticles that the tunnel junction gaps between nanoparticles in each resistor is assumed to be equal. The number of tunnel junctions between two contact electrodes and the tunnel junction gaps between nanoparticles are found to be functions of Coulomb blockade energies. In addition, the tunnel barriers between nanoparticles were considered to be tilted at high voltages. Furthermore, the role of thermal expansion coefficient of the tunnel junction gaps on the tunnel current is taken into account. The model calculations fit very well to the experimental data of a network of disordered gold nanoparticles, a forest of multi-wall carbon nanotubes, and a network of few-layer graphene nanoplates over a wide temperature range (5-300 K) at low and high DC bias voltages (0.001 mV-50 V). Our investigations indicate, although electron cotunneling in networks of disordered irregularly shaped NPs may occur, non-Arrhenius behavior at low temperatures cannot be described by the cotunneling model due to size distribution in the networks and irregular shape of nanoparticles. Non-Arrhenius behavior of the samples at zero bias voltage limit was attributed to the disorder in the samples. Unlike the electron cotunneling model, we found that the crossover from Arrhenius to non-Arrhenius behavior occurs at two temperatures, one at a high temperature and the other at a low temperature.

Aghili Yajadda, Mir Massoud

2014-10-01

308

Calculating electronic tunnel currents in networks of disordered irregularly shaped nanoparticles by mapping networks to arrays of parallel nonlinear resistors  

SciTech Connect

We have shown both theoretically and experimentally that tunnel currents in networks of disordered irregularly shaped nanoparticles (NPs) can be calculated by considering the networks as arrays of parallel nonlinear resistors. Each resistor is described by a one-dimensional or a two-dimensional array of equal size nanoparticles that the tunnel junction gaps between nanoparticles in each resistor is assumed to be equal. The number of tunnel junctions between two contact electrodes and the tunnel junction gaps between nanoparticles are found to be functions of Coulomb blockade energies. In addition, the tunnel barriers between nanoparticles were considered to be tilted at high voltages. Furthermore, the role of thermal expansion coefficient of the tunnel junction gaps on the tunnel current is taken into account. The model calculations fit very well to the experimental data of a network of disordered gold nanoparticles, a forest of multi-wall carbon nanotubes, and a network of few-layer graphene nanoplates over a wide temperature range (5-300?K) at low and high DC bias voltages (0.001?mV–50?V). Our investigations indicate, although electron cotunneling in networks of disordered irregularly shaped NPs may occur, non-Arrhenius behavior at low temperatures cannot be described by the cotunneling model due to size distribution in the networks and irregular shape of nanoparticles. Non-Arrhenius behavior of the samples at zero bias voltage limit was attributed to the disorder in the samples. Unlike the electron cotunneling model, we found that the crossover from Arrhenius to non-Arrhenius behavior occurs at two temperatures, one at a high temperature and the other at a low temperature.

Aghili Yajadda, Mir Massoud [CSIRO Manufacturing Flagship, P.O. Box 218, Lindfield NSW 2070 (Australia)

2014-10-21

309

Electrically reconfigurable logic array  

NASA Technical Reports Server (NTRS)

To compose the complicated systems using algorithmically specialized logic circuits or processors, one solution is to perform relational computations such as union, division and intersection directly on hardware. These relations can be pipelined efficiently on a network of processors having an array configuration. These processors can be designed and implemented with a few simple cells. In order to determine the state-of-the-art in Electrically Reconfigurable Logic Array (ERLA), a survey of the available programmable logic array (PLA) and the logic circuit elements used in such arrays was conducted. Based on this survey some recommendations are made for ERLA devices.

Agarwal, R. K.

1982-01-01

310

SPECT reconstruction using a backpropagation neural network implemented on a massively parallel SIMD computer  

SciTech Connect

In this paper, the feasibility of reconstructing a single photon emission computed tomography (SPECT) image via the parallel implementation of a backpropagation neural network is shown. The MasPar, MP-1 is a single instruction multiple data (SIMD) massively parallel machine. It is composed of a 128 x 128 array of 4-bit processors. The neural network is distributed on the array by dedicating a processor to each node and each interconnection of the network. An 8 x 8 SPECT image slice section is projected into eight planes. It is shown that based on the projections, the neural network can produce the original SPECT slice image exactly. Likewise, when trained on two parallel slices, separated by one slice, the neural network is able to reproduce the center, untrained image to an RMS error of 0.001928.

Kerr, J.P.; Bartlett, E.B. [Iowa State Univ., Ames, IA (United States). Biomedical Engineering Program

1992-12-31

311

Architectures for a CORDIC SVD processor  

NASA Astrophysics Data System (ADS)

Architectures for systolic array processor elements for calculating the singular value decomposition (SVD) are proposed. These special purpose VLSI structures incorporate the coordinate rotation (coRDic) algorithms to diagonalize 2X 2 submatrices of a large array. The area-time complexity of the proposed architectures is analyzed along with topics related to a prototype implementation.

Cavallaro, Joseph R.; Luk, Franklin T.

1986-03-01

312

Implementing Access to Data Distributed on Many Processors  

NASA Technical Reports Server (NTRS)

A reference architecture is defined for an object-oriented implementation of domains, arrays, and distributions written in the programming language Chapel. This technology primarily addresses domains that contain arrays that have regular index sets with the low-level implementation details being beyond the scope of this discussion. What is defined is a complete set of object-oriented operators that allows one to perform data distributions for domain arrays involving regular arithmetic index sets. What is unique is that these operators allow for the arbitrary regions of the arrays to be fragmented and distributed across multiple processors with a single point of access giving the programmer the illusion that all the elements are collocated on a single processor. Today's massively parallel High Productivity Computing Systems (HPCS) are characterized by a modular structure, with a large number of processing and memory units connected by a high-speed network. Locality of access as well as load balancing are primary concerns in these systems that are typically used for high-performance scientific computation. Data distributions address these issues by providing a range of methods for spreading large data sets across the components of a system. Over the past two decades, many languages, systems, tools, and libraries have been developed for the support of distributions. Since the performance of data parallel applications is directly influenced by the distribution strategy, users often resort to low-level programming models that allow fine-tuning of the distribution aspects affecting performance, but, at the same time, are tedious and error-prone. This technology presents a reusable design of a data-distribution framework for data parallel high-performance applications. Distributions are a means to express locality in systems composed of large numbers of processor and memory components connected by a network. Since distributions have a great effect on the performance of applications, it is important that the distribution strategy is flexible, so its behavior can change depending on the needs of the application. At the same time, high productivity concerns require that the user be shielded from error-prone, tedious details such as communication and synchronization.

James, Mark

2006-01-01

313

Highly Parallel Computing Architectures by using Arrays of Quantum-dot Cellular Automata (QCA): Opportunities, Challenges, and Recent Results  

NASA Technical Reports Server (NTRS)

There has been significant improvement in the performance of VLSI devices, in terms of size, power consumption, and speed, in recent years and this trend may also continue for some near future. However, it is a well known fact that there are major obstacles, i.e., physical limitation of feature size reduction and ever increasing cost of foundry, that would prevent the long term continuation of this trend. This has motivated the exploration of some fundamentally new technologies that are not dependent on the conventional feature size approach. Such technologies are expected to enable scaling to continue to the ultimate level, i.e., molecular and atomistic size. Quantum computing, quantum dot-based computing, DNA based computing, biologically inspired computing, etc., are examples of such new technologies. In particular, quantum-dots based computing by using Quantum-dot Cellular Automata (QCA) has recently been intensely investigated as a promising new technology capable of offering significant improvement over conventional VLSI in terms of reduction of feature size (and hence increase in integration level), reduction of power consumption, and increase of switching speed. Quantum dot-based computing and memory in general and QCA specifically, are intriguing to NASA due to their high packing density (10(exp 11) - 10(exp 12) per square cm ) and low power consumption (no transfer of current) and potentially higher radiation tolerant. Under Revolutionary Computing Technology (RTC) Program at the NASA/JPL Center for Integrated Space Microelectronics (CISM), we have been investigating the potential applications of QCA for the space program. To this end, exploiting the intrinsic features of QCA, we have designed novel QCA-based circuits for co-planner (i.e., single layer) and compact implementation of a class of data permutation matrices, a class of interconnection networks, and a bit-serial processor. Building upon these circuits, we have developed novel algorithms and QCA-based architectures for highly parallel and systolic computation of signal/image processing applications, such as FFT and Wavelet and Wlash-Hadamard Transforms.

Fijany, Amir; Toomarian, Benny N.

2000-01-01

314

Quantitative analysis of RNA-protein interactions on a massively parallel array reveals biophysical and evolutionary landscapes.  

PubMed

RNA-protein interactions drive fundamental biological processes and are targets for molecular engineering, yet quantitative and comprehensive understanding of the sequence determinants of affinity remains limited. Here we repurpose a high-throughput sequencing instrument to quantitatively measure binding and dissociation of a fluorescently labeled protein to >10(7) RNA targets generated on a flow cell surface by in situ transcription and intermolecular tethering of RNA to DNA. Studying the MS2 coat protein, we decompose the binding energy contributions from primary and secondary RNA structure, and observe that differences in affinity are often driven by sequence-specific changes in both association and dissociation rates. By analyzing the biophysical constraints and modeling mutational paths describing the molecular evolution of MS2 from low- to high-affinity hairpins, we quantify widespread molecular epistasis and a long-hypothesized, structure-dependent preference for G:U base pairs over C:A intermediates in evolutionary trajectories. Our results suggest that quantitative analysis of RNA on a massively parallel array (RNA-MaP) provides generalizable insight into the biophysical basis and evolutionary consequences of sequence-function relationships. PMID:24727714

Buenrostro, Jason D; Araya, Carlos L; Chircus, Lauren M; Layton, Curtis J; Chang, Howard Y; Snyder, Michael P; Greenleaf, William J

2014-06-01

315

Kanji Character Recognition Unit With Hand-Scanner Using SIMD Processor  

NASA Astrophysics Data System (ADS)

A prototype OCR is constructed using a very compact parallel processing unit. This unit, designed for interactive character recognition applications, is equipped with a hand-scanner for input and a personal computer for word and/or image processing. The heart of this unit is a bit-serial Single Instruction Multiple Data stream (SIMD) array processor constructed with four identical cellular array LSIs (AAP2). The processor is fully programmable and the complex pro-cess of Japanese character recognition can be carried out with a single program package. Its architecture permits flexible and high-speed SIMD operations to process bitline data such as local fields of scanned documents. The processor components were integrated into one board and confirmed to be more than ten times faster than present image processors of the same size through various image processing tests. High character recognition performance is obtained at a reading speed of 8 Japanese characters per second which is sufficient for hand-scanning data input operations. The recognition rate is higher than 98% for about 3,300 Japanese characters.

Kondo, Toshio; Tada, Shunkichi; Miyahara, Sueharu

1988-10-01

316

Architecture design of a FPGA-based wavefront processor for correlating a Shack-Hartmann sensor  

NASA Astrophysics Data System (ADS)

During solar observation, atmosphere turbulence usually blur the solar image coming from solar telescope. In order to improve the quality of solar image, solar Adaptive Optical (AO) system is equipped. In a typical solar AO system, Correlating Shack-Hartmann (SH) wavefront sensor is used to detect the aberration of the blurred image. To detect the aberration as well as possible, frame rate of CCD working after the SH sensor must be fast enough to keep pace with the variation of turbulence. CCD with 1000 Hz frame rate is very common in solar adaptive optical system. What's more, next generation telescope is so large that resolution of CCD becomes higher and higher. So it requires the wavefront processor a huge amount of processing power. As FPGA (Field Programmable Gate Array) technology becomes more powerful, they can provide amazing processing ability by high speed and parallel processing. This paper gives out a design of FPGA-based wavefront processor in solar adaptive optical system. It is characterized by pipeline and parallel architecture. The peak operation speed is over 86G/s and calculation latency is 7.04 us in a system with 16×16 sub-aperture array, which is 16×16 pixel in size each and for which the reference image is 8×8 pixel. Using this processor, frame rate of the CCD can be up to 8800 fps. Built in a single FPGA, it is low-cost, compact and easy to be upgraded.

Peng, Xiaofeng; Li, Mei; Rao, ChangHui

2008-12-01

317

Parallel algorithms and architecture for computation of manipulator forward dynamics  

NASA Technical Reports Server (NTRS)

Parallel computation of manipulator forward dynamics is investigated. Considering three classes of algorithms for the solution of the problem, that is, the O(n), the O(n exp 2), and the O(n exp 3) algorithms, parallelism in the problem is analyzed. It is shown that the problem belongs to the class of NC and that the time and processors bounds are of O(log2/2n) and O(n exp 4), respectively. However, the fastest stable parallel algorithms achieve the computation time of O(n) and can be derived by parallelization of the O(n exp 3) serial algorithms. Parallel computation of the O(n exp 3) algorithms requires the development of parallel algorithms for a set of fundamentally different problems, that is, the Newton-Euler formulation, the computation of the inertia matrix, decomposition of the symmetric, positive definite matrix, and the solution of triangular systems. Parallel algorithms for this set of problems are developed which can be efficiently implemented on a unique architecture, a triangular array of n(n+2)/2 processors with a simple nearest-neighbor interconnection. This architecture is particularly suitable for VLSI and WSI implementations. The developed parallel algorithm, compared to the best serial O(n) algorithm, achieves an asymptotic speedup of more than two orders-of-magnitude in the computation the forward dynamics.

Fijany, Amir; Bejczy, Antal K.

1989-01-01

318

Opto-electronic morphological processor  

NASA Technical Reports Server (NTRS)

The opto-electronic morphological processor of the present invention is capable of receiving optical inputs and emitting optical outputs. The use of optics allows implementation of parallel input/output, thereby overcoming a major bottleneck in prior art image processing systems. The processor consists of three components, namely, detectors, morphological operators and modulators. The detectors and operators are fabricated on a silicon VLSI chip and implement the optical input and morphological operations. A layer of ferro-electric liquid crystals is integrated with a silicon chip to provide the optical modulation. The implementation of the image processing operators in electronics leads to a wide range of applications and the use of optical connections allows cascadability of these parallel opto-electronic image processing components and high speed operation. Such an opto-electronic morphological processor may be used as the pre-processing stage in an image recognition system. In one example disclosed herein, the optical input/optical output morphological processor of the invention is interfaced with a binary phase-only correlator to produce an image recognition system.

Yu, Jeffrey W. (Inventor); Chao, Tien-Hsin (Inventor); Cheng, Li J. (Inventor); Psaltis, Demetri (Inventor)

1993-01-01

319

Parallel Optimisation  

NSDL National Science Digital Library

An introduction to optimisation techniques that may improve parallel performance and scaling on HECToR. It assumes that the reader has some experience of parallel programming including basic MPI and OpenMP. Scaling is a measurement of the ability for a parallel code to use increasing numbers of cores efficiently. A scalable application is one that, when the number of processors is increased, performs better by a factor which justifies the additional resource employed. Making a parallel application scale to many thousands of processes requires not only careful attention to the communication, data and work distribution but also to the choice of the algorithms to use. Since the choice of algorithm is too broad a subject and very particular to application domain to include in this brief guide we concentrate on general good practices towards parallel optimisation on HECToR.

320

Final Report, Center for Programming Models for Scalable Parallel Computing: Co-Array Fortran, Grant Number DE-FC02-01ER25505  

SciTech Connect

The major accomplishment of this project is the production of CafLib, an 'object-oriented' parallel numerical library written in Co-Array Fortran. CafLib contains distributed objects such as block vectors and block matrices along with procedures, attached to each object, that perform basic linear algebra operations such as matrix multiplication, matrix transpose and LU decomposition. It also contains constructors and destructors for each object that hide the details of data decomposition from the programmer, and it contains collective operations that allow the programmer to calculate global reductions, such as global sums, global minima and global maxima, as well as vector and matrix norms of several kinds. CafLib is designed to be extensible in such a way that programmers can define distributed grid and field objects, based on vector and matrix objects from the library, for finite difference algorithms to solve partial differential equations. A very important extra benefit that resulted from the project is the inclusion of the co-array programming model in the next Fortran standard called Fortran 2008. It is the first parallel programming model ever included as a standard part of the language. Co-arrays will be a supported feature in all Fortran compilers, and the portability provided by standardization will encourage a large number of programmers to adopt it for new parallel application development. The combination of object-oriented programming in Fortran 2003 with co-arrays in Fortran 2008 provides a very powerful programming model for high-performance scientific computing. Additional benefits from the project, beyond the original goal, include a programto provide access to the co-array model through access to the Cray compiler as a resource for teaching and research. Several academics, for the first time, included the co-array model as a topic in their courses on parallel computing. A separate collaborative project with LANL and PNNL showed how to extend the co-array model to other languages in a small experimental version of Co-array Python. Another collaborative project defined a Fortran 95 interface to ARMCI to encourage Fortran programmers to use the one-sided communication model in anticipation of their conversion to the co-array model later. A collaborative project with the Earth Sciences community at NASA Goddard and GFDL experimented with the co-array model within computational kernels related to their climate models, first using CafLib and then extending the co-array model to use design patterns. Future work will build on the design-pattern idea with a redesign of CafLib as a true object-oriented library using Fortran 2003 and as a parallel numerical library using Fortran 2008.

Robert W. Numrich

2008-04-22

321

High performance parallel computers for science: New developments at the Fermilab advanced computer program  

SciTech Connect

Fermilab's Advanced Computer Program (ACP) has been developing highly cost effective, yet practical, parallel computers for high energy physics since 1984. The ACP's latest developments are proceeding in two directions. A Second Generation ACP Multiprocessor System for experiments will include $3500 RISC processors each with performance over 15 VAX MIPS. To support such high performance, the new system allows parallel I/O, parallel interprocess communication, and parallel host processes. The ACP Multi-Array Processor, has been developed for theoretical physics. Each $4000 node is a FORTRAN or C programmable pipelined 20 MFlops (peak), 10 MByte single board computer. These are plugged into a 16 port crossbar switch crate which handles both inter and intra crate communication. The crates are connected in a hypercube. Site oriented applications like lattice gauge theory are supported by system software called CANOPY, which makes the hardware virtually transparent to users. A 256 node, 5 GFlop, system is under construction. 10 refs., 7 figs.

Nash, T.; Areti, H.; Atac, R.; Biel, J.; Cook, A.; Deppe, J.; Edel, M.; Fischler, M.; Gaines, I.; Hance, R.

1988-08-01

322

MicroPhotonic Reconfigurable RF Signal Processor  

Microsoft Academic Search

In this paper, we discuss the use of MicroPhotonic processors to control the optical power distribution in photonic signal processing structures, achieving adaptive photonic RF filtering with arbitrary transfer functions. A new MicroPhotonics-based photonic signal processing architecture is presented, in which fibre collimator arrays, Opto-VLSl processors, and a WDM combiner are integrated within an optical substrate to control the gains

Kamal E. Alameh; Selam T. Ahderom; Mehrdad Raisi; Rong Zheng; Kamran Eshraghian

2004-01-01

323

Generic implementations of parallel prefix sums and its applications  

E-print Network

synchronization as the number of processors increases. As part of the applications for parallel prefix sums, parallel radix sort and four parallel tree applications are built on top of the implementation. These applications are also fundamental parallel algorithms...

Huang, Tao

2009-05-15

324

MULTI-CORE AND OPTICAL PROCESSOR RELATED APPLICATIONS RESEARCH AT OAK RIDGE NATIONAL LABORATORY  

SciTech Connect

High-speed parallelization of common tasks holds great promise as a low-risk approach to achieving the significant increases in signal processing and computational performance required for next generation innovations in reconfigurable radio systems. Researchers at the Oak Ridge National Laboratory have been working on exploiting the parallelization offered by this emerging technology and applying it to a variety of problems. This paper will highlight recent experience with four different parallel processors applied to signal processing tasks that are directly relevant to signal processing required for SDR/CR waveforms. The first is the EnLight Optical Core Processor applied to matched filter (MF) correlation processing via fast Fourier transform (FFT) of broadband Dopplersensitive waveforms (DSW) using active sonar arrays for target tracking. The second is the IBM CELL Broadband Engine applied to 2-D discrete Fourier transform (DFT) kernel for image processing and frequency domain processing. And the third is the NVIDIA graphical processor applied to document feature clustering. EnLight Optical Core Processor. Optical processing is inherently capable of high-parallelism that can be translated to very high performance, low power dissipation computing. The EnLight 256 is a small form factor signal processing chip (5x5 cm2) with a digital optical core that is being developed by an Israeli startup company. As part of its evaluation of foreign technology, ORNL's Center for Engineering Science Advanced Research (CESAR) had access to a precursor EnLight 64 Alpha hardware for a preliminary assessment of capabilities in terms of large Fourier transforms for matched filter banks and on applications related to Doppler-sensitive waveforms. This processor is optimized for array operations, which it performs in fixed-point arithmetic at the rate of 16 TeraOPS at 8-bit precision. This is approximately 1000 times faster than the fastest DSP available today. The optical core performs the matrix-vector multiplications, where the nominal matrix size is 256x256. The system clock is 125MHz. At each clock cycle, 128K multiply-and-add operations per second (OPS) are carried out, which yields a peak performance of 16 TeraOPS. IBM Cell Broadband Engine. The Cell processor is the extraordinary resulting product of 5 years of sustained, intensive R&D collaboration (involving over $400M investment) between IBM, Sony, and Toshiba. Its architecture comprises one multithreaded 64-bit PowerPC processor element (PPE) with VMX capabilities and two levels of globally coherent cache, and 8 synergistic processor elements (SPEs). Each SPE consists of a processor (SPU) designed for streaming workloads, local memory, and a globally coherent direct memory access (DMA) engine. Computations are performed in 128-bit wide single instruction multiple data streams (SIMD). An integrated high-bandwidth element interconnect bus (EIB) connects the nine processors and their ports to external memory and to system I/O. The Applied Software Engineering Research (ASER) Group at the ORNL is applying the Cell to a variety of text and image analysis applications. Research on Cell-equipped PlayStation3 (PS3) consoles has led to the development of a correlation-based image recognition engine that enables a single PS3 to process images at more than 10X the speed of state-of-the-art single-core processors. NVIDIA Graphics Processing Units. The ASER group is also employing the latest NVIDIA graphical processing units (GPUs) to accelerate clustering of thousands of text documents using recently developed clustering algorithms such as document flocking and affinity propagation.

Barhen, Jacob [ORNL; Kerekes, Ryan A [ORNL; ST Charles, Jesse Lee [ORNL; Buckner, Mark A [ORNL

2008-01-01

325

A VLSI design concept for parallel iterative algorithms  

NASA Astrophysics Data System (ADS)

Modern VLSI manufacturing technology has kept shrinking down to the nanoscale level with a very fast trend. Integration with the advanced nano-technology now makes it possible to realize advanced parallel iterative algorithms directly which was almost impossible 10 years ago. In this paper, we want to discuss the influences of evolving VLSI technologies for iterative algorithms and present design strategies from an algorithmic and architectural point of view. Implementing an iterative algorithm on a multiprocessor array, there is a trade-off between the performance/complexity of processors and the load/throughput of interconnects. This is due to the behavior of iterative algorithms. For example, we could simplify the parallel implementation of the iterative algorithm (i.e., processor elements of the multiprocessor array) in any way as long as the convergence is guaranteed. However, the modification of the algorithm (processors) usually increases the number of required iterations which also means that the switch activity of interconnects is increasing. As an example we show that a 25×25 full Jacobi EVD array could be realized into one single FPGA device with the simplified ?-rotation CORDIC architecture.

Sun, C. C.; Götze, J.

2009-05-01

326

Parallel processing ITS  

SciTech Connect

This report provides a users` guide for parallel processing ITS on a UNIX workstation network, a shared-memory multiprocessor or a massively-parallel processor. The parallelized version of ITS is based on a master/slave model with message passing. Parallel issues such as random number generation, load balancing, and communication software are briefly discussed. Timing results for example problems are presented for demonstration purposes.

Fan, W.C.; Halbleib, J.A. Sr.

1996-09-01

327

Efficient design space exploration of high performance embedded out-of-order processors  

Microsoft Academic Search

Previous work on efficient customized processor design primarily focused on in-order architectures. However, with the recent introduction of out-of-order processors for high- end high-performance embedded applications, researchers and designers need to address how to automate the design process of customized out-of-order processors. Because of the parallel execution of independent instructions in out- of-order processors, in-order processor design methodolo- gies which

Stijn Eyerman; Lieven Eeckhout; Koen De Bosschere

2006-01-01

328

Parallel I/O Systems  

NSDL National Science Digital Library

* Redundant disk array architectures,* Fault tolerance issues in parallel I/O systems,* Caching and prefetching,* Parallel file systems,* Parallel I/O systems, * Parallel I/O programming paradigms, * Parallel I/O applications and environments, * Parallel programming with parallel I/O

Amy Apon

329

Spatio-temporal operator formalism for holographic recording and diffraction in a photorefractive-based true-time-delay phased-array processor  

NASA Astrophysics Data System (ADS)

We present a spatio-temporal operator formalism and beam propagation simulations that describe the broadband efficient adaptive method for a true-time-delay array processing (BEAMTAP) algorithm for an optical beamformer by use of a photorefractive crystal. The optical system consists of a tapped-delay line implemented with an acoustooptic Bragg cell, an accumulating scrolling time-delay detector achieved with a traveling-fringes detector, and a photorefractive crystal to store the adaptive spatio-temporal weights as volume holographic gratings. In this analysis, linear shift-invariant integral operators are used to describe the propagation, interference, grating accumulation, and volume holographic diffraction of the spatio-temporally modulated optical fields in the system to compute the adaptive array processing operation. In addition, it is shown that the random fluctuation in time and phase delays of the optically modulated and transmitted array signals produced by fiber perturbations (temperature fluctuations, vibrations, or bending) are dynamically compensated for through the process of holographic wavefront reconstruction as a byproduct of the adaptive beam-forming and jammer-excision operation. The complexity of the cascaded spatial-temporal integrals describing the holographic formation, and subsequent readout processes, is shown to collapse to a simple imaging condition through standard operator manipulation. We also present spatio-temporal beam propagation simulation results as an illustrative demonstration of our analysis and the operation of a BEAMTAP beamformer.

Kiruluta, Andrew; Pati, Gour S.; Kriehn, Gregory; Silveira, Paulo E. X.; Sarto, Anthony W.; Wagner, Kelvin

2003-09-01

330

Spatio-temporal operator formalism for holographic recording and diffraction in a photorefractive-based true-time-delay phased-array processor.  

PubMed

We present a spatio-temporal operator formalism and beam propagation simulations that describe the broadband efficient adaptive method for a true-time-delay array processing (BEAMTAP) algorithm for an optical beamformer by use of a photorefractive crystal. The optical system consists of a tapped-delay line implemented with an acoustooptic Bragg cell, an accumulating scrolling time-delay detector achieved with a traveling-fringes detector, and a photorefractive crystal to store the adaptive spatio-temporal weights as volume holographic gratings. In this analysis, linear shift-invariant integral operators are used to describe the propagation, interference, grating accumulation, and volume holographic diffraction of the spatio-temporally modulated optical fields in the system to compute the adaptive array processing operation. In addition, it is shown that the random fluctuation in time and phase delays of the optically modulated and transmitted array signals produced by fiber perturbations (temperature fluctuations, vibrations, or bending) are dynamically compensated for through the process of holographic wavefront reconstruction as a byproduct of the adaptive beam-forming and jammer-excision operation. The complexity of the cascaded spatial-temporal integrals describing the holographic formation, and subsequent readout processes, is shown to collapse to a simple imaging condition through standard operator manipulation. We also present spatio-temporal beam propagation simulation results as an illustrative demonstration of our analysis and the operation of a BEAMTAP beamformer. PMID:14503701

Kiruluta, Andrew; Pati, Gour S; Kriehn, Gregory; Silveira, Paulo E X; Sarto, Anthony W; Wagner, Kelvin

2003-09-10

331

Optimal processor assignment for pipeline computations  

NASA Technical Reports Server (NTRS)

The availability of large scale multitasked parallel architectures introduces the following processor assignment problem for pipelined computations. Given a set of tasks and their precedence constraints, along with their experimentally determined individual responses times for different processor sizes, find an assignment of processor to tasks. Two objectives are of interest: minimal response given a throughput requirement, and maximal throughput given a response time requirement. These assignment problems differ considerably from the classical mapping problem in which several tasks share a processor; instead, it is assumed that a large number of processors are to be assigned to a relatively small number of tasks. Efficient assignment algorithms were developed for different classes of task structures. For a p processor system and a series parallel precedence graph with n constituent tasks, an O(np2) algorithm is provided that finds the optimal assignment for the response time optimization problem; it was found that the assignment optimizing the constrained throughput in O(np2log p) time. Special cases of linear, independent, and tree graphs are also considered.

Nicol, David M.; Simha, Rahul; Choudhury, Alok N.; Narahari, Bhagirath

1991-01-01

332

A Pseudo Network Approach to Inter-processor Communication on a Shared-memory Multi-processor MacELIS  

Microsoft Academic Search

MacELIS is a workbench for experimental distributed parallel list processing under development at NTT ECL (Electrical Communications Laboratories). It provides a coherent processor abstraction mechanism by a Pseudo Network Model. In this model, the shared-memory in MacELIS appears as a network medium and the same network access methods can be used for inter-processor communication between any processors independently of location

Ken-ichiro Murakami

1989-01-01

333

Broadcasting collective operation contributions throughout a parallel computer  

DOEpatents

Methods, systems, and products are disclosed for broadcasting collective operation contributions throughout a parallel computer. The parallel computer includes a plurality of compute nodes connected together through a data communications network. Each compute node has a plurality of processors for use in collective parallel operations on the parallel computer. Broadcasting collective operation contributions throughout a parallel computer according to embodiments of the present invention includes: transmitting, by each processor on each compute node, that processor's collective operation contribution to the other processors on that compute node using intra-node communications; and transmitting on a designated network link, by each processor on each compute node according to a serial processor transmission sequence, that processor's collective operation contribution to the other processors on the other compute nodes using inter-node communications.

Faraj, Ahmad (Rochester, MN)

2012-02-21

334

Customization of application specific heterogeneous multi-pipeline processors  

Microsoft Academic Search

In this paper we propose Application Specic Instruction Set Pro- cessors with heterogeneous multiple pipelines to efciently exploit the available parallelism at instruction level. We have developed a design system based on the Thumb processor architecture. Given an application specied in C language, the design system can generate a processor with a number of pipelines specically suitable to the ap-

Swarnalatha Radhakrishnan; Hui Guo; Sri Parameswaran

2006-01-01

335

Rapid geodesic mapping of brain functional connectivity: implementation of a dedicated co-processor in a field-programmable gate array (FPGA) and application to resting state functional MRI.  

PubMed

Graph theory-based analyses of brain network topology can be used to model the spatiotemporal correlations in neural activity detected through fMRI, and such approaches have wide-ranging potential, from detection of alterations in preclinical Alzheimer's disease through to command identification in brain-machine interfaces. However, due to prohibitive computational costs, graph-based analyses to date have principally focused on measuring connection density rather than mapping the topological architecture in full by exhaustive shortest-path determination. This paper outlines a solution to this problem through parallel implementation of Dijkstra's algorithm in programmable logic. The processor design is optimized for large, sparse graphs and provided in full as synthesizable VHDL code. An acceleration factor between 15 and 18 is obtained on a representative resting-state fMRI dataset, and maps of Euclidean path length reveal the anticipated heterogeneous cortical involvement in long-range integrative processing. These results enable high-resolution geodesic connectivity mapping for resting-state fMRI in patient populations and real-time geodesic mapping to support identification of imagined actions for fMRI-based brain-machine interfaces. PMID:23746911

Minati, Ludovico; Cercignani, Mara; Chan, Dennis

2013-10-01

336

Transitive closure on the imagine stream processor  

SciTech Connect

The increasing gap between processor and memory speeds is a well-known problem in modern computer architecture. The Imagine system is designed to address the processor-memory gap through streaming technology. Stream processors are best-suited for computationally intensive applications characterized by high data parallelism and producer-consumer locality with minimal data dependencies. This work examines an efficient streaming implementation of the computationally intensive Transitive Closure (TC) algorithm on the Imagine platform. We develop a tiled TC algorithm specifically for the Imagine environment, which efficiently reuses streams to minimize expensive off-chip data transfers. The implementation requires complex stream programming since the memory hierarchy and cluster organization of the underlying architecture are exposed to the Imagine programmer. Results demonstrate that limited performance of TC is achieved primarily due to the complicated data-dependencies of the blocked algorithm. This work is an ongoing effort to identify classes of scientific problems well-suited for streaming processors.

Griem, Gorden; Oliker, Leonid

2003-11-11

337

A novel picoliter droplet array for parallel real-time polymerase chain reaction based on double-inkjet printing.  

PubMed

We developed and characterized a novel picoliter droplet-in-oil array generated by a double-inkjet printing method on a uniform hydrophobic silicon chip specifically designed for quantitative polymerase chain reaction (qPCR) analysis. Double-inkjet printing was proposed to efficiently address the evaporation issues of picoliter droplets during array generation on a planar substrate without the assistance of a humidifier or glycerol. The method utilizes piezoelectric inkjet printing equipment to precisely eject a reagent droplet into an oil droplet, which had first been dispensed on a hydrophobic and oleophobic substrate. No evaporation, random movement, or cross-contamination was observed during array fabrication and thermal cycling. We demonstrated the feasibility and effectiveness of this novel double-inkjet method for real-time PCR analysis. This method can readily produce multivolume droplet-in-oil arrays with volume variations ranging from picoliters to nanoliters. This feature would be useful for simultaneous multivolume PCR experiments aimed at wide and tunable dynamic ranges. These double-inkjet-based picoliter droplet arrays may have potential for multiplexed applications that require isolated containers for single-cell cultures, single molecular enzymatic assays, or digital PCR and provide an alternative option for generating droplet arrays on planar substrates without chemical patterning. PMID:25070461

Sun, Yingnan; Zhou, Xiaoguang; Yu, Yude

2014-09-21

338

Parallel processing of natural language  

SciTech Connect

Two types of parallel natural language processing are studied in this work: (1) the parallelism between syntactic and nonsyntactic processing and (2) the parallelism within syntactic processing. It is recognized that a syntactic category can potentially be attached to more than one node in the syntactic tree of a sentence. Even if all the attachments are syntactically well-formed, nonsyntactic factors such as semantic and pragmatic consideration may require one particular attachment. Syntactic processing must synchronize and communicate with nonsyntactic processing. Two syntactic processing algorithms are proposed for use in a parallel environment: Early's algorithm and the LR(k) algorithm. Conditions are identified to detect the syntactic ambiguity and the algorithms are augmented accordingly. It is shown that by using nonsyntactic information during syntactic processing, backtracking can be reduced, and the performance of the syntactic processor is improved. For the second type of parallelism, it is recognized that one portion of a grammar can be isolated from the rest of the grammar and be processed by a separate processor. A partial grammar of a larger grammar is defined. Parallel syntactic processing is achieved by using two processors concurrently: the main processor (mp) and the two processors concurrently: the main processor (mp) and the auxiliary processor (ap).

Chang, H.O.

1986-01-01

339

A Double Precision High Speed Convolution Processor  

NASA Astrophysics Data System (ADS)

There exist several convolution processors on the market that can process images at video rate. However, none of these processors operates in floating point arithmetic. Unfortunately, many image processing algorithms presently under development are inoperable in integer arithmetic, forcing the researchers to use regular computers. To solve this problem, we designed a specialized convolution processor that operates in double precision floating point arithmetic with a throughput several thousand times faster than the one obtained on regular computer. Its high performance is attributed to a VLSI double precision convolution systolic cell designed in our laboratories. A 9X9 systolic array carries out, in a pipeline manner, every arithmetic operation. The processor is designed to interface directly with the VME Bus. A DMA chip is responsible for bringing the original pixel intensities from the memory of the computer to the systolic array and to return the convolved pixels back to memory. A special use of 8K RAMs allows an inexpensive and efficient way of delaying the pixel intensities in order to supply the right sequence to the systolic array. On board circuitry converts pixel values into floating point representation when the image is originally represented with integer values. An additional systolic cell, used as a pipeline adder at the output of the systolic array, offers the possibility of combining images together which allows a variable convolution window size and color image processing.

Larochelle, F.; Coté, J. F.; Malowany, A. S.

1989-11-01

340

How scaling will change processor architecture  

Microsoft Academic Search

For the past 30 years processors have hidden scaling from the programmer, presenting the same sequential computational interface. Power and wire scaling issues are causing this interface to change, exposing more parallelism. For efficiency, future machines must be distributed and heterogeneous and will add at least a \\

Mark Horowitz; William Dally

2004-01-01

341

Relational query coprocessing on graphics processors  

Microsoft Academic Search

Graphics processors (GPUs) have recently emerged as powerful coprocessors for general purpose computation. Compared with commodity CPUs, GPUs have an order of magnitude higher computation power as well as memory bandwidth. Moreover, new-generation GPUs allow writes to random memory locations, provide efficient interprocessor communication through on-chip local memory, and support a general purpose parallel programming model. Nevertheless, many of the

Bingsheng He; Mian Lu; Ke Yang; Rui Fang; Naga K. Govindaraju; Qiong Luo; Pedro V. Sander

2009-01-01

342

Copy Propagation Optimizations for VLIW DSP Processors with Distributed Register Files  

E-print Network

that a naive deployment of copy propagations in embedded VLIW DSP processors with distributed register files embedded VLIW DSP processors with distributed files by taking commu- nication costs into account. Parallel Architecture Core (PAC) is a 5-way VLIW DSP processors with distributed register cluster files

Lee, Jenq-Kuen

343

An implementation of scoreboarding mechanism for ARM-based SMT processor  

Microsoft Academic Search

A SMT architecture uses TLP (Thread Level Parallelism) and increases processor throughput, such that issue slots can be filled with instructions from multiple independent threads. Having multiple ready threads reduces the probability that a functional unit is left idle, which increases processor efficiency. To utilize those advantages for the SMT processor, the issue unit must control the flow of instructions

Chang-Yong Heo; Kyu-Baik Choi; In-Pyo Hong; Yong-Surk Lee

2003-01-01

344

Scheduling on the MasPar SIMD parallel computer  

E-print Network

version was faster at labeling tasks in the task graph, but the parallel version could not place the tasks onto processors in parallel. Nevertheless, the highest level first scheduling algorithm was acceptable for use in parallel operating systems....

Perkins, Keith Douglas

1995-01-01

345

Sandia secure processor : a native Java processor.  

SciTech Connect

The Sandia Secure Processor (SSP) is a new native Java processor that has been specifically designed for embedded applications. The SSP's design is a system composed of a core Java processor that directly executes Java bytecodes, on-chip intelligent IO modules, and a suite of software tools for simulation and compiling executable binary files. The SSP is unique in that it provides a way to control real-time IO modules for embedded applications. The system software for the SSP is a 'class loader' that takes Java .class files (created with your favorite Java compiler), links them together, and compiles a binary. The complete SSP system provides very powerful functionality with very light hardware requirements with the potential to be used in a wide variety of small-system embedded applications. This paper gives a detail description of the Sandia Secure Processor and its unique features.

Wickstrom, Gregory Lloyd; Gale, Jason Carl; Ma, Kwok Kee

2003-08-01

346

SUDS : automatic parallelization for raw processors  

E-print Network

A computer can never be too fast or too cheap. Computer systems pervade nearly every aspect of science, engineering, communications and commerce because they perform certain tasks at rates unachievable by any other kind ...

Frank, Matthew I

2003-01-01

347

Primitive operations for a hierarchical parallel processor  

SciTech Connect

Pyramid data structures make some image processing operations easier to compute. This paper discusses the programming strategies for pyramid machines with a view toward a set of primitive operations. One such set of operations is described. 8 references.

Tanimoto, S.L.

1982-01-01

348

Parallel Earley's parser and its application to syntactic image analysis  

SciTech Connect

A complete Earley parser which includes recognition and parse extraction has been implemented on a triangular array of processors. The detailed analysis of the complete parser is given. The recognition algorithm is executed in parallel by adopting a new operator, x/sup */, and restricting the input context-free grammar to be lamda-free. The parse extraction algorithm which follows recognition uses a nonrecursive subroutine to generate the correct right-parse in parallel. A special busing arrangement within this array enables the right data to reach the right place at the right time. Simulation examples are provided. The results show that when a string of length >n> is under testing, at the system time 2>n> + 1, the correct right-parse will be obtained if the string is accepted. 15 references.

Chiang, Y.P.; Fu, K.S.

1983-01-01

349

Data parallel sequential circuit fault simulation  

Microsoft Academic Search

Sequential circuit fault simulation is a compute-intensive problem. Parallel simulation is one method to reduce fault simulation time. In this paper, we discuss a novel technique to partition the fault set for the fault parallel simulation of sequential circuits on multiple processors. When applied statically, the technique can scale well for up to thirty two processors on an ethernet. The

Minesh B. Amin; Bapiraju Vinnakota

1996-01-01

350

Switch for serial or parallel communication networks  

DOEpatents

A communication switch apparatus and a method for use in a geographically extensive serial, parallel or hybrid communication network linking a multi-processor or parallel processing system has a very low software processing overhead in order to accommodate random burst of high density data. Associated with each processor is a communication switch. A data source and a data destination, a sensor suite or robot for example, may also be associated with a switch. The configuration of the switches in the network are coordinated through a master processor node and depends on the operational phase of the multi-processor network: data acquisition, data processing, and data exchange. The master processor node passes information on the state to be assumed by each switch to the processor node associated with the switch. The processor node then operates a series of multi-state switches internal to each communication switch. The communication switch does not parse and interpret communication protocol and message routing information. During a data acquisition phase, the communication switch couples sensors producing data to the processor node associated with the switch, to a downlink destination on the communications network, or to both. It also may couple an uplink data source to its processor node. During the data exchange phase, the switch couples its processor node or an uplink data source to a downlink destination (which may include a processor node or a robot), or couples an uplink source to its processor node and its processor node to a downlink destination.

Crosette, Dario B. (DeSoto, TX)

1994-01-01

351

Switch for serial or parallel communication networks  

DOEpatents

A communication switch apparatus and a method for use in a geographically extensive serial, parallel or hybrid communication network linking a multi-processor or parallel processing system has a very low software processing overhead in order to accommodate random burst of high density data. Associated with each processor is a communication switch. A data source and a data destination, a sensor suite or robot for example, may also be associated with a switch. The configuration of the switches in the network are coordinated through a master processor node and depends on the operational phase of the multi-processor network: data acquisition, data processing, and data exchange. The master processor node passes information on the state to be assumed by each switch to the processor node associated with the switch. The processor node then operates a series of multi-state switches internal to each communication switch. The communication switch does not parse and interpret communication protocol and message routing information. During a data acquisition phase, the communication switch couples sensors producing data to the processor node associated with the switch, to a downlink destination on the communications network, or to both. It also may couple an uplink data source to its processor node. During the data exchange phase, the switch couples its processor node or an uplink data source to a downlink destination (which may include a processor node or a robot), or couples an uplink source to its processor node and its processor node to a downlink destination. 9 figs.

Crosette, D.B.

1994-07-19

352

Processor equivalence for daisy chain load sharing processors  

Microsoft Academic Search

A linear daisy chain of processors in which processor load is divisible and shared among the processors is examined. It is shown that two or more processors can be collapsed into a single equivalent processor. This equivalence allows a characterization of the nature of the minimal time solution, a simple method to determine when to distribute load for linear daisy

THOMAS G. ROBERTAZZI

1993-01-01

353

RISC Processors and High Performance Computing  

NASA Technical Reports Server (NTRS)

This tutorial will discuss the top five RISC microprocessors and the parallel systems in which they are used. It will provide a unique cross-machine comparison not available elsewhere. The effective performance of these processors will be compared by citing standard benchmarks in the context of real applications. The latest NAS Parallel Benchmarks, both absolute performance and performance per dollar, will be listed. The next generation of the NPB will be described. The tutorial will conclude with a discussion of future directions in the field. Technology Transfer Considerations: All of these computer systems are commercially available internationally. Information about these processors is available in the public domain, mostly from the vendors themselves. The NAS Parallel Benchmarks and their results have been previously approved numerous times for public release, beginning back in 1991.

Bailey, David H.; Saini, Subhash; Craw, James M. (Technical Monitor)

1995-01-01

354

An Efficient Power Estimation Methodology for Complex RISC Processor-based Platforms  

E-print Network

is validated through ARM9 and ARM CortexA8 processor designed respectively around the OMAP5912 and OMAP3530- vide less than 3% of error for ARM940T processor, 3.5% for ARM CortexA8 processor-based system and 1x architecture for ARM CortexA8 processor as a promising solution to deal with the potential parallelism in

Paris-Sud XI, Université de

355

Gang scheduling a parallel machine  

SciTech Connect

Program development on parallel machines can be a nightmare of scheduling headaches. We have developed a portable time sharing mechanism to handle the problem of scheduling gangs of processors. User program and their gangs of processors are put to sleep and awakened by the gang scheduler to provide a time sharing environment. Time quantums are adjusted according to priority queues and a system of fair share accounting. The initial platform for this software is the 128 processor BBN TC2000 in use in the Massively Parallel Computing Initiative at the Lawrence Livermore National Laboratory. 2 refs., 1 fig.

Gorda, B.C.; Brooks, E.D. III.

1991-03-01

356

System and method for representing and manipulating three-dimensional objects on massively parallel architectures  

DOEpatents

A parallel computing system is described that comprises a plurality of uniquely labeled, parallel processors, each processor capable of modeling a three-dimensional object that includes a plurality of vertices, faces and edges. The system comprises a front-end processor for issuing a modeling command to the parallel processors, relating to a three-dimensional object. Each parallel processor, in response to the command and through the use of its own unique label, creates a directed-edge (d-edge) data structure that uniquely relates an edge of the three-dimensional object to one face of the object. Each d-edge data structure at least includes vertex descriptions of the edge and a description of the one face. As a result, each processor, in response to the modeling command, operates upon a small component of the model and generates results, in parallel with all other processors, without the need for processor-to-processor intercommunication. 8 figs.

Karasick, M.S.; Strip, D.R.

1996-01-30

357

System and method for representing and manipulating three-dimensional objects on massively parallel architectures  

DOEpatents

A parallel computing system is described that comprises a plurality of uniquely labeled, parallel processors, each processor capable of modelling a three-dimensional object that includes a plurality of vertices, faces and edges. The system comprises a front-end processor for issuing a modelling command to the parallel processors, relating to a three-dimensional object. Each parallel processor, in response to the command and through the use of its own unique label, creates a directed-edge (d-edge) data structure that uniquely relates an edge of the three-dimensional object to one face of the object. Each d-edge data structure at least includes vertex descriptions of the edge and a description of the one face. As a result, each processor, in response to the modelling command, operates upon a small component of the model and generates results, in parallel with all other processors, without the need for processor-to-processor intercommunication.

Karasick, Michael S. (Ridgefield, CT); Strip, David R. (Albuquerque, NM)

1996-01-01

358

Dynamic parallel complexity of computational circuits  

Microsoft Academic Search

The dynamic parallel complexity of general computational circuits (defined in introduction) is discussed. We exhibit some relationships between parallel circuit evaluation and some uniform closure properties of a certain class of unary functions and present a systematic method for the design of processor efficient parallel algorithms for circuit evaluation. Using this method: (1) we improve the algorithm for parallel Boolean

Gary L. Miller; Shang-Hua Teng

1987-01-01

359

Development of a prototype PET scanner with depth-of-interaction measurement using solid-state photomultiplier arrays and parallel readout electronics  

PubMed Central

In this study, we developed a prototype animal PET by applying several novel technologies to use the solid-state photomultiplier (SSPM) arrays for measuring the depth-of-interaction (DOI) and improving imaging performance. Each PET detector has an 8×8 array of about 1.9×1.9×30.0 mm3 lutetium-yttrium-oxyorthosilicate (LYSO) scintillators, with each end optically connected to a SSPM array (16-channel in a 4×4 matrix) through a light guide to enable continuous DOI measurement. Each SSPM has an active area of about 3×3 mm2, and its output is read by a custom-developed application-specific-integrated-circuit (ASIC) to directly convert analog signals to digital timing pulses that encode the interaction information. These pulses are transferred to and be decoded by a field-programmable-gate-array (FPGA) based time-to-digital convertor for coincident event selection and data acquisition. The independent readout of each SSPM and the parallel signal process can significantly improve the signal-to-noise ratio and enable using flexible algorithms for different data processes. The prototype PET consists of two rotating detector panels on a portable gantry with four detectors in each panel to provide 16 mm axial and variable transaxial field-of-view (FOV) sizes. List-mode ordered-subset-expectation-maximization image reconstruction was implemented. The measured mean energy, coincidence timing, and DOI resolution for a crystal were about 17.6%, 2.8 ns, and 5.6 mm, respectively. The measured transaxial resolutions at the center of the FOV were 2.0 mm and 2.3 mm for images reconstructed with and without DOI, respectively. In addition, the resolutions across the FOV with DOI were substantially better than those without DOI. The quality of PET images of both a hot-rod phantom and mouse acquired with DOI was much higher than that of images obtained without DOI. This study demonstrates that SSPM arrays and advanced readout/processing electronics can be used to develop a practical DOI-measureable PET scanner. PMID:24556629

Shao, Yiping; Sun, Xishan; Lan, Kejian A.; Bircher, Chad; Lou, Kai; Deng, Zhi

2014-01-01

360

Development of a prototype PET scanner with depth-of-interaction measurement using solid-state photomultiplier arrays and parallel readout electronics  

NASA Astrophysics Data System (ADS)

In this study, we developed a prototype animal PET by applying several novel technologies to use solid-state photomultiplier (SSPM) arrays to measure the depth of interaction (DOI) and improve imaging performance. Each PET detector has an 8 × 8 array of about 1.9 × 1.9 × 30.0 mm3 lutetium-yttrium-oxyorthosilicate scintillators, with each end optically connected to an SSPM array (16 channels in a 4 × 4 matrix) through a light guide to enable continuous DOI measurement. Each SSPM has an active area of about 3 × 3 mm2, and its output is read by a custom-developed application-specific integrated circuit to directly convert analogue signals to digital timing pulses that encode the interaction information. These pulses are transferred to and are decoded by a field-programmable gate array-based time-to-digital convertor for coincident event selection and data acquisition. The independent readout of each SSPM and the parallel signal process can significantly improve the signal-to-noise ratio and enable the use of flexible algorithms for different data processes. The prototype PET consists of two rotating detector panels on a portable gantry with four detectors in each panel to provide 16 mm axial and variable transaxial field-of-view (FOV) sizes. List-mode ordered subset expectation maximization image reconstruction was implemented. The measured mean energy, coincidence timing and DOI resolution for a crystal were about 17.6%, 2.8 ns and 5.6 mm, respectively. The measured transaxial resolutions at the center of the FOV were 2.0 mm and 2.3 mm for images reconstructed with and without DOI, respectively. In addition, the resolutions across the FOV with DOI were substantially better than those without DOI. The quality of PET images of both a hot-rod phantom and mouse acquired with DOI was much higher than that of images obtained without DOI. This study demonstrates that SSPM arrays and advanced readout/processing electronics can be used to develop a practical DOI-measureable PET scanner.

Shao, Yiping; Sun, Xishan; Lan, Kejian A.; Bircher, Chad; Lou, Kai; Deng, Zhi

2014-03-01

361

Rapid, single-molecule assays in nano/micro-fluidic chips with arrays of closely spaced parallel channels fabricated by femtosecond laser machining.  

PubMed

Cost-effective pharmaceutical drug discovery depends on increasing assay throughput while reducing reagent needs. To this end, we are developing an ultrasensitive, fluorescence-based platform that incorporates a nano/micro-fluidic chip with an array of closely spaced channels for parallelized optical readout of single-molecule assays. Here we describe the use of direct femtosecond laser machining to fabricate several hundred closely spaced channels on the surfaces of fused silica substrates. The channels are sealed by bonding to a microscope cover slip spin-coated with a thin film of poly(dimethylsiloxane). Single-molecule detection experiments are conducted using a custom-built, wide-field microscope. The array of channels is epi-illuminated by a line-generating red diode laser, resulting in a line focus just a few microns thick across a 500 micron field of view. A dilute aqueous solution of fluorescently labeled biomolecules is loaded into the device and fluorescence is detected with an electron-multiplying CCD camera, allowing acquisition rates up to 7 kHz for each microchannel. Matched digital filtering based on experimental parameters is used to perform an initial, rapid assessment of detected fluorescence. More detailed analysis is obtained through fluorescence correlation spectroscopy. Simulated fluorescence data is shown to agree well with experimental values. PMID:25140634

Canfield, Brian K; King, Jason K; Robinson, William N; Hofmeister, William H; Davis, Lloyd M

2014-01-01

362

Radiofrequency current source (RFCS) drive and decoupling technique for parallel transmit arrays using a high-power metal oxide semiconductor field-effect transistor (MOSFET).  

PubMed

A radiofrequency current source (RFCS) design using a high-power metal oxide semiconductor field effect transistor (MOSFET) that enables independent current control for parallel transmit applications is presented. The design of an RFCS integrated with a series tuned transmitting loop and its associated control circuitry is described. The current source is operated in a gated class AB push-pull configuration for linear operation at high efficiency. The pulsed RF current amplitude driven into the low impedance transmitting loop was found to be relatively insensitive to the various loaded loop impedances ranging from 0.4 to 10.3 ohms, confirming current mode operation. The suppression of current induced by a neighboring loop was quantified as a function of center-to-center loop distance, and was measured to be 17 dB for nonoverlapping, adjacent loops. Deterministic manipulation of the B(1) field pattern was demonstrated by the independent control of RF phase and amplitude in a head-sized two-channel volume transmit array. It was found that a high-voltage rated RF power MOSFET with a minimum load resistance, exhibits current source behavior, which aids in transmit array design. PMID:19353658

Lee, Wonje; Boskamp, Eddy; Grist, Thomas; Kurpad, Krishna

2009-07-01

363

Rapid, Single-Molecule Assays in Nano/Micro-Fluidic Chips with Arrays of Closely Spaced Parallel Channels Fabricated by Femtosecond Laser Machining  

PubMed Central

Cost-effective pharmaceutical drug discovery depends on increasing assay throughput while reducing reagent needs. To this end, we are developing an ultrasensitive, fluorescence-based platform that incorporates a nano/micro-fluidic chip with an array of closely spaced channels for parallelized optical readout of single-molecule assays. Here we describe the use of direct femtosecond laser machining to fabricate several hundred closely spaced channels on the surfaces of fused silica substrates. The channels are sealed by bonding to a microscope cover slip spin-coated with a thin film of poly(dimethylsiloxane). Single-molecule detection experiments are conducted using a custom-built, wide-field microscope. The array of channels is epi-illuminated by a line-generating red diode laser, resulting in a line focus just a few microns thick across a 500 micron field of view. A dilute aqueous solution of fluorescently labeled biomolecules is loaded into the device and fluorescence is detected with an electron-multiplying CCD camera, allowing acquisition rates up to 7 kHz for each microchannel. Matched digital filtering based on experimental parameters is used to perform an initial, rapid assessment of detected fluorescence. More detailed analysis is obtained through fluorescence correlation spectroscopy. Simulated fluorescence data is shown to agree well with experimental values. PMID:25140634

Canfield, Brian K.; King, Jason K.; Robinson, William N.; Hofmeister, William H.; Davis, Lloyd M.

2014-01-01

364

Natural language processors  

SciTech Connect

The development of natural language processors has required a shift in the perception of language structures to bring the user interface closer to the ultimate ease of natural language dialogue. This article explains the principles of these new natural language processors which are increasingly becoming commercially available.

Rauzino, V.C.

1983-09-05

365

Simulation of an array-based neural net model  

NASA Technical Reports Server (NTRS)

Research in cognitive science suggests that much of cognition involves the rapid manipulation of complex data structures. However, it is very unclear how this could be realized in neural networks or connectionist systems. A core question is: how could the interconnectivity of items in an abstract-level data structure be neurally encoded? The answer appeals mainly to positional relationships between activity patterns within neural arrays, rather than directly to neural connections in the traditional way. The new method was initially devised to account for abstract symbolic data structures, but it also supports cognitively useful spatial analogue, image-like representations. As the neural model is based on massive, uniform, parallel computations over 2D arrays, the massively parallel processor is a convenient tool for simulation work, although there are complications in using the machine to the fullest advantage. An MPP Pascal simulation program for a small pilot version of the model is running.

Barnden, John A.

1987-01-01

366

Dedicated hardware processor and corresponding system-on-chip design for real-time laser speckle imaging  

NASA Astrophysics Data System (ADS)

Laser speckle imaging (LSI) is a noninvasive and full-field optical imaging technique which produces two-dimensional blood flow maps of tissues from the raw laser speckle images captured by a CCD camera without scanning. We present a hardware-friendly algorithm for the real-time processing of laser speckle imaging. The algorithm is developed and optimized specifically for LSI processing in the field programmable gate array (FPGA). Based on this algorithm, we designed a dedicated hardware processor for real-time LSI in FPGA. The pipeline processing scheme and parallel computing architecture are introduced into the design of this LSI hardware processor. When the LSI hardware processor is implemented in the FPGA running at the maximum frequency of 130 MHz, up to 85 raw images with the resolution of 640×480 pixels can be processed per second. Meanwhile, we also present a system on chip (SOC) solution for LSI processing by integrating the CCD controller, memory controller, LSI hardware processor, and LCD display controller into a single FPGA chip. This SOC solution also can be used to produce an application specific integrated circuit for LSI processing.

Jiang, Chao; Zhang, Hongyan; Wang, Jia; Wang, Yaru; He, Heng; Liu, Rui; Zhou, Fangyuan; Deng, Jialiang; Li, Pengcheng; Luo, Qingming

2011-11-01

367

DiFX: A software correlator for very long baseline interferometry using multi-processor computing environments  

E-print Network

We describe the development of an FX style correlator for Very Long Baseline Interferometry (VLBI), implemented in software and intended to run in multi-processor computing environments, such as large clusters of commodity machines (Beowulf clusters) or computers specifically designed for high performance computing, such as multi-processor shared-memory machines. We outline the scientific and practical benefits for VLBI correlation, these chiefly being due to the inherent flexibility of software and the fact that the highly parallel and scalable nature of the correlation task is well suited to a multi-processor computing environment. We suggest scientific applications where such an approach to VLBI correlation is most suited and will give the best returns. We report detailed results from the Distributed FX (DiFX) software correlator, running on the Swinburne supercomputer (a Beowulf cluster of approximately 300 commodity processors), including measures of the performance of the system. For example, to correlate all Stokes products for a 10 antenna array, with an aggregate bandwidth of 64 MHz per station and using typical time and frequency resolution presently requires of order 100 desktop-class compute nodes. Due to the effect of Moore's Law on commodity computing performance, the total number and cost of compute nodes required to meet a given correlation task continues to decrease rapidly with time. We show detailed comparisons between DiFX and two existing hardware-based correlators: the Australian Long Baseline Array (LBA) S2 correlator, and the NRAO Very Long Baseline Array (VLBA) correlator. In both cases, excellent agreement was found between the correlators. Finally, we describe plans for the future operation of DiFX on the Swinburne supercomputer, for both astrophysical and geodetic science.

A. T. Deller; S. J. Tingay; M. Bailes; C. West

2007-02-06

368

Buffered coscheduling for parallel programming and enhanced fault tolerance  

DOEpatents

A computer implemented method schedules processor jobs on a network of parallel machine processors or distributed system processors. Control information communications generated by each process performed by each processor during a defined time interval is accumulated in buffers, where adjacent time intervals are separated by strobe intervals for a global exchange of control information. A global exchange of the control information communications at the end of each defined time interval is performed during an intervening strobe interval so that each processor is informed by all of the other processors of the number of incoming jobs to be received by each processor in a subsequent time interval. The buffered coscheduling method of this invention also enhances the fault tolerance of a network of parallel machine processors or distributed system processors

Petrini, Fabrizio (Los Alamos, NM); Feng, Wu-chun (Los Alamos, NM)

2006-01-31

369

A structural approach to the photonic processor  

NASA Astrophysics Data System (ADS)

In the early 1990, photonics, the confluence of electronics, and optics technologies to improve net processing efficiency was advanced to the highest priority ranking on the DoD critical technologies list. Currently, photonics is considered a high-leverage technology because it is believed that photonic processors could potentially circumvent the serial processor limitation, or von Neuman bottleneck, which limits the throughput capacity of most electronic processors. Indeed, the realtime solutions to currently military problems, such as high-accurate missile guidance, sensor fusion, automatic target recognition, automated guidance of remotely piloted vehicles, etc., are consistently crippled by information processing bottlenecks. Such bottlenecks are particularly endemic to image-formatted data bases. An image-formatted data base is defined as a data base where, besides the information contained in each pixel, there is also information imparted by the spatial relationship among the data in the pixels. Thus, in image data, variations in grey scale are used to define edges and corners. To extract the spatially imparted information, it is often necessary to compare N x N pixels in the input image with the N x N pixels in a model image; this process takes N exp 4 comparison calculations. As the demand for higher resolution imagery increases and N gets larger, it becomes increasingly more difficult to make the image comparisons in realtime. Currently, digital electronic processor designs are optimized for numerical processing, which is an intrinsically serial operation. It is this serial nature that causes the limitation; the photonic processor, which can be designed with a more parallel architecture, has potential for circumventing this bottleneck. It is, therefore, anticipated that the intrinsic parallelism of optics will enable the photonic processor to solve problems in realtime that were previously considered unsolvable or only marginally solvable.

Jackson, Deborah

370

Fabrication and Evaluation of a Micro(Bio)Sensor Array Chip for Multiple Parallel Measurements of Important Cell Biomarkers  

PubMed Central

This report describes the design and development of an integrated electrochemical cell culture monitoring system, based on enzyme-biosensors and chemical sensors, for monitoring indicators of mammalian cell metabolic status. MEMS technology was used to fabricate a microwell-format silicon platform including a thermometer, onto which chemical sensors (pH, O2) and screen-printed biosensors (glucose, lactate), were grafted/deposited. Microwells were formed over the fabricated sensors to give 5-well sensor strips which were interfaced with a multipotentiostat via a bespoke connector box interface. The operation of each sensor/biosensor type was examined individually, and examples of operating devices in five microwells in parallel, in either potentiometric (pH sensing) or amperometric (glucose biosensing) mode are shown. The performance characteristics of the sensors/biosensors indicate that the system could readily be applied to cell culture/toxicity studies. PMID:25360580

Pemberton, Roy M.; Cox, Timothy; Tuffin, Rachel; Drago, Guido A.; Griffiths, John; Pittson, Robin; Johnson, Graham; Xu, Jinsheng; Sage, Ian C.; Davies, Rhodri; Jackson, Simon K.; Kenna, Gerry; Luxton, Richard; Hart, John P.

2014-01-01

371

Fabrication and evaluation of a micro(bio)sensor array chip for multiple parallel measurements of important cell biomarkers.  

PubMed

This report describes the design and development of an integrated electrochemical cell culture monitoring system, based on enzyme-biosensors and chemical sensors, for monitoring indicators of mammalian cell metabolic status. MEMS technology was used to fabricate a microwell-format silicon platform including a thermometer, onto which chemical sensors (pH, O2) and screen-printed biosensors (glucose, lactate), were grafted/deposited. Microwells were formed over the fabricated sensors to give 5-well sensor strips which were interfaced with a multipotentiostat via a bespoke connector box interface. The operation of each sensor/biosensor type was examined individually, and examples of operating devices in five microwells in parallel, in either potentiometric (pH sensing) or amperometric (glucose biosensing) mode are shown. The performance characteristics of the sensors/biosensors indicate that the system could readily be applied to cell culture/toxicity studies. PMID:25360580

Pemberton, Roy M; Cox, Timothy; Tuffin, Rachel; Drago, Guido A; Griffiths, John; Pittson, Robin; Johnson, Graham; Xu, Jinsheng; Sage, Ian C; Davies, Rhodri; Jackson, Simon K; Kenna, Gerry; Luxton, Richard; Hart, John P

2014-01-01

372

Artificial intelligence in parallel  

SciTech Connect

The current rage in the Artificial Intelligence (AI) community is parallelism: the idea is to build machines with many independent processors doing many things at once. The upshot is that about a dozen parallel machines are now under development for AI alone. As might be expected, the approaches are diverse yet there are a number of fundamental issues in common: granularity, topology, control, and algorithms.

Waldrop, M.M.

1984-08-10

373

Hypercluster - Parallel processing for computational mechanics  

NASA Technical Reports Server (NTRS)

An account is given of the development status, performance capabilities and implications for further development of NASA-Lewis' testbed 'hypercluster' parallel computer network, in which multiple processors communicate through a shared memory. Processors have local as well as shared memory; the hypercluster is expanded in the same manner as the hypercube, with processor clusters replacing the normal single processor node. The NASA-Lewis machine has three nodes with a vector personality and one node with a scalar personality. Each of the vector nodes uses four board-level vector processors, while the scalar node uses four general-purpose microcomputer boards.

Blech, Richard A.

1988-01-01

374

Parallelism for imaging applications  

Microsoft Academic Search

Numerous image processing functions involve repetitive operations and therefore can benefit from parallel processing, where performance may be significantly improved as a function of the number of processors applied to the task. One such application that requires processing to be as near to real-time as possible is vision processing and, in particular, low level vision processing. A system developed by

M. P. Battaglia

1993-01-01

375

Micromechanical resonator array for an implantable bionic ear.  

PubMed

In this paper we report on a multi-resonant transducer that may be used to replace a traditional speech processor in cochlear implant applications. The transducer, made from an array of micro-machined polymer resonators, is capable of passively splitting sound into its frequency sub-bands without the need for analog-to-digital conversion and subsequent digital processing. Since all bands are mechanically filtered in parallel, there is low latency in the output signals. The simplicity of the device, high channel capability, low power requirements, and small form factor (less than 1 cm) make it a good candidate for a completely implantable bionic ear device. PMID:16439832

Bachman, Mark; Zeng, Fan-Gang; Xu, Tao; Li, G-P

2006-01-01

376

Efficiency of parallel direct optimization  

NASA Technical Reports Server (NTRS)

Tremendous progress has been made at the level of sequential computation in phylogenetics. However, little attention has been paid to parallel computation. Parallel computing is particularly suited to phylogenetics because of the many ways large computational problems can be broken into parts that can be analyzed concurrently. In this paper, we investigate the scaling factors and efficiency of random addition and tree refinement strategies using the direct optimization software, POY, on a small (10 slave processors) and a large (256 slave processors) cluster of networked PCs running LINUX. These algorithms were tested on several data sets composed of DNA and morphology ranging from 40 to 500 taxa. Various algorithms in POY show fundamentally different properties within and between clusters. All algorithms are efficient on the small cluster for the 40-taxon data set. On the large cluster, multibuilding exhibits excellent parallel efficiency, whereas parallel building is inefficient. These results are independent of data set size. Branch swapping in parallel shows excellent speed-up for 16 slave processors on the large cluster. However, there is no appreciable speed-up for branch swapping with the further addition of slave processors (>16). This result is independent of data set size. Ratcheting in parallel is efficient with the addition of up to 32 processors in the large cluster. This result is independent of data set size. c2001 The Willi Hennig Society.

Janies, D. A.; Wheeler, W. C.

2001-01-01

377

Multiple Embedded Processors for Fault-Tolerant Computing  

NASA Technical Reports Server (NTRS)

A fault-tolerant computer architecture has been conceived in an effort to reduce vulnerability to single-event upsets (spurious bit flips caused by impingement of energetic ionizing particles or photons). As in some prior fault-tolerant architectures, the redundancy needed for fault tolerance is obtained by use of multiple processors in one computer. Unlike prior architectures, the multiple processors are embedded in a single field-programmable gate array (FPGA). What makes this new approach practical is the recent commercial availability of FPGAs that are capable of having multiple embedded processors. A working prototype (see figure) consists of two embedded IBM PowerPC 405 processor cores and a comparator built on a Xilinx Virtex-II Pro FPGA. This relatively simple instantiation of the architecture implements an error-detection scheme. A planned future version, incorporating four processors and two comparators, would correct some errors in addition to detecting them.

Bolotin, Gary; Watson, Robert; Katanyoutanant, Sunant; Burke, Gary; Wang, Mandy

2005-01-01

378

Hybrid photomultiplier tube and photodiode parallel detection array for wideband optical spectroscopy of the breast guided by magnetic resonance imaging  

PubMed Central

Abstract. A new optical parallel detection system of hybrid frequency and continuous-wave domains was developed to improve the data quality and accuracy in recovery of all breast optical properties. This new system was deployed in a previously existing system for magnetic resonance imaging (MRI)-guided spectroscopy, and allows incorporation of additional near-infrared wavelengths beyond 850 nm, with interlaced channels of photomultiplier tubes (PMTs) and silicon photodiodes (PDs). The acquisition time for obtaining frequency-domain data at six wavelengths (660, 735, 785, 808, 826, and 849 nm) and continuous-wave data at three wavelengths (903, 912, and 948 nm) is 12 min. The dynamic ranges of the detected signal are 105 and 106 for PMT and PD detectors, respectively. Compared to the previous detection system, the SNR ratio of frequency-domain detection was improved by nearly 103 through the addition of an RF amplifier and the utilization of programmable gain. The current system is being utilized in a clinical trial imaging suspected breast cancer tumors as detected by contrast MRI scans. PMID:23979460

El-Ghussein, Fadi; Mastanduno, Michael A.; Jiang, Shudong; Pogue, Brian W.; Paulsen, Keith D.

2013-01-01

379

Hybrid photomultiplier tube and photodiode parallel detection array for wideband optical spectroscopy of the breast guided by magnetic resonance imaging  

NASA Astrophysics Data System (ADS)

A new optical parallel detection system of hybrid frequency and continuous-wave domains was developed to improve the data quality and accuracy in recovery of all breast optical properties. This new system was deployed in a previously existing system for magnetic resonance imaging (MRI)-guided spectroscopy, and allows incorporation of additional near-infrared wavelengths beyond 850 nm, with interlaced channels of photomultiplier tubes (PMTs) and silicon photodiodes (PDs). The acquisition time for obtaining frequency-domain data at six wavelengths (660, 735, 785, 808, 826, and 849 nm) and continuous-wave data at three wavelengths (903, 912, and 948 nm) is 12 min. The dynamic ranges of the detected signal are 105 and 106 for PMT and PD detectors, respectively. Compared to the previous detection system, the SNR ratio of frequency-domain detection was improved by nearly 103 through the addition of an RF amplifier and the utilization of programmable gain. The current system is being utilized in a clinical trial imaging suspected breast cancer tumors as detected by contrast MRI scans.

El-Ghussein, Fadi; Mastanduno, Michael A.; Jiang, Shudong; Pogue, Brian W.; Paulsen, Keith D.

2014-01-01

380

Special-Purpose Ising Model Random Lattice Processor  

NASA Astrophysics Data System (ADS)

We have designed and built a special purpose processor with a very good performance to price ratio, which permits to propose a new way for parallel computing. A simple one spin flip Monte Carlo algorithm is realized in hardware, so the processor is suitable for studies of dynamic as well as thermodynamic properties of the two-dimensional Ising model with different types of inhomogeneities. The speed of the processor is defined completely by the speed of memories used in it: to perform an elementary Monte Carlo step the processor needs a time only several percent larger than one memory cycle time. So it realizes the fastest possible one spin flip Monte Carlo processor architecture.

Andreichenko, V. B.; Dotsenko, Vl. S.; Shchur, L. N.; Talapov, A. L.

381

Design of flexible GF(2m) elliptic curve cryptography processors  

Microsoft Academic Search

The design of flexible elliptic curve cryptography processors (ECP) is considered in this paper. Novel word-level algorithms and implementations for the underlying GF(2m) multiplication and squaring arithmetic which enable improved flexibility versus performance tradeoffs, are presented and employed in the design of an efficient flexible ECP architecture; corresponding field-programmable gate-array (FPGA) prototyping results for two different processor word lengths are

Mohammed Benaissa; Wei Ming Lim

2006-01-01

382

Parallel VLSI architecture emulation and the organization of APSA/MPP  

NASA Technical Reports Server (NTRS)

The Applicative Programming System Architecture (APSA) combines an applicative language interpreter with a novel parallel computer architecture that is well suited for Very Large Scale Integration (VLSI) implementation. The Massively Parallel Processor (MPP) can simulate VLSI circuits by allocating one processing element in its square array to an area on a square VLSI chip. As long as there are not too many long data paths, the MPP can simulate a VLSI clock cycle very rapidly. The APSA circuit contains a binary tree with a few long paths and many short ones. A skewed H-tree layout allows every processing element to simulate a leaf cell and up to four tree nodes, with no loss in parallelism. Emulation of a key APSA algorithm on the MPP resulted in performance 16,000 times faster than a Vax. This speed will make it possible for the APSA language interpreter to run fast enough to support research in parallel list processing algorithms.

Odonnell, John T.

1987-01-01

383

3081/E processor  

SciTech Connect

The 3081/E project was formed to prepare a much improved IBM mainframe emulator for the future. Its design is based on a large amount of experience in using the 168/E processor to increase available CPU power in both online and offline environments. The processor will be at least equal to the execution speed of a 370/168 and up to 1.5 times faster for heavy floating point code. A single processor will thus be at least four times more powerful than the VAX 11/780, and five processors on a system would equal at least the performance of the IBM 3081K. With its large memory space and simple but flexible high speed interface, the 3081/E is well suited for the online and offline needs of high energy physics in the future.

Kunz, P.F.; Gravina, M.; Oxoby, G.; Rankin, P.; Trang, Q.; Ferran, P.M.; Fucci, A.; Hinton, R.; Jacobs, D.; Martin, B.

1984-04-01

384

A Simplified Analysis of Processor \\  

Microsoft Academic Search

This paper focuses attention upon the design of a processor and memory system which is structured to achieve a satisfactory balance of processor speed and memory speed when both the processor and input–output controller are simultaneously competing for memory service. A mathematical model is developed to investigate the degree to which the processor is capable of overlapping memory references with

JACK E. SHEMER; SOMESHWAR C. GUPTA

1969-01-01

385

Improving Latency Tolerance of Network Processors Through Simultaneous Multithreading  

Microsoft Academic Search

\\u000a Existing multithreaded network processors architecture with multiple processing engines (PEs), aims at taking advantage of\\u000a blocked multithreading technique which executes instructions of different user-defined threads in the same PE pipeline, in\\u000a explicit and interleave way. Multiple PEs, each of which is a multithreaded processor core, process several packets in parallel\\u000a to hide long memory access latency. Most of them are

Bo Liang; An Hong; Fang Lu; Rui Guo

2005-01-01

386

QSpike tools: a generic framework for parallel batch preprocessing of extracellular neuronal signals recorded by substrate microelectrode arrays  

PubMed Central

Micro-Electrode Arrays (MEAs) have emerged as a mature technique to investigate brain (dys)functions in vivo and in in vitro animal models. Often referred to as “smart” Petri dishes, MEAs have demonstrated a great potential particularly for medium-throughput studies in vitro, both in academic and pharmaceutical industrial contexts. Enabling rapid comparison of ionic/pharmacological/genetic manipulations with control conditions, MEAs are employed to screen compounds by monitoring non-invasively the spontaneous and evoked neuronal electrical activity in longitudinal studies, with relatively inexpensive equipment. However, in order to acquire sufficient statistical significance, recordings last up to tens of minutes and generate large amount of raw data (e.g., 60 channels/MEA, 16 bits A/D conversion, 20 kHz sampling rate: approximately 8 GB/MEA,h uncompressed). Thus, when the experimental conditions to be tested are numerous, the availability of fast, standardized, and automated signal preprocessing becomes pivotal for any subsequent analysis and data archiving. To this aim, we developed an in-house cloud-computing system, named QSpike Tools, where CPU-intensive operations, required for preprocessing of each recorded channel (e.g., filtering, multi-unit activity detection, spike-sorting, etc.), are decomposed and batch-queued to a multi-core architecture or to a computers cluster. With the commercial availability of new and inexpensive high-density MEAs, we believe that disseminating QSpike Tools might facilitate its wide adoption and customization, and inspire the creation of community-supported cloud-computing facilities for MEAs users. PMID:24678297

Mahmud, Mufti; Pulizzi, Rocco; Vasilaki, Eleni; Giugliano, Michele

2014-01-01

387

Extending Automatic Parallelization to Optimize High-Level Abstractions for Multicore  

SciTech Connect

Automatic introduction of OpenMP for sequential applications has attracted significant attention recently because of the proliferation of multicore processors and the simplicity of using OpenMP to express parallelism for shared-memory systems. However, most previous research has only focused on C and Fortran applications operating on primitive data types. C++ applications using high-level abstractions, such as STL containers and complex user-defined types, are largely ignored due to the lack of research compilers that are readily able to recognize high-level object-oriented abstractions and leverage their associated semantics. In this paper, we automatically parallelize C++ applications using ROSE, a multiple-language source-to-source compiler infrastructure which preserves the high-level abstractions and gives us access to their semantics. Several representative parallelization candidate kernels are used to explore semantic-aware parallelization strategies for high-level abstractions, combined with extended compiler analyses. Those kernels include an array-base computation loop, a loop with task-level parallelism, and a domain-specific tree traversal. Our work extends the applicability of automatic parallelization to modern applications using high-level abstractions and exposes more opportunities to take advantage of multicore processors.

Liao, C; Quinlan, D J; Willcock, J J; Panas, T

2008-12-12

388

Onboard processor technology review  

NASA Technical Reports Server (NTRS)

The general need and requirements for the onboard embedded processors necessary to control and manipulate data in spacecraft systems are discussed. The current known requirements are reviewed from a user perspective, based on current practices in the spacecraft development process. The current capabilities of available processor technologies are then discussed, and these are projected to the generation of spacecraft computers currently under identified, funded development. An appraisal is provided for the current national developmental effort.

Benz, Harry F.

1990-01-01

389

Programmable Stream Processors  

Microsoft Academic Search

including 3D graphics, image compression, and signal processing, requires tens to hun-dreds of billions of computations per sec-ond. To achieve these computation rates, current media processors use special-purpose archi-tectures tailored to one specific application. Such processors require significant design effort and are thus difficult to change as media-processing appli-cations and algorithms evolve. The demand for flexibility in media processing motivates

Ujval J. Kapasi; Scott Rixner; William J. Dally; Brucek Khailany; Jung Ho Ahn; Peter R. Mattson; John D. Owens

2003-01-01

390

Message-Driven Processor architecture, Version 11. Artificial intelligence memo  

SciTech Connect

The Message-Driven Processor is a node of a large-scale multiprocessor being developed by the Concurrent VLSI Architecture Group. It is intended to support fine-grained, message-passing, parallel computation. It contains several novel architectural features, such as a low-latency network interface, extensive type-checking hardware, and on-chip memory that can be used as an associative lookup table. This document is a programmer's guide to the MDP. It describes the processor's register architecture, instruction set, and the data types supported by the processor. It also details the MDP's message sending and exception handling facilities.

Dally, W.; Chien, A.; Fiske, S.; Horwat, W.; Keen, J.

1988-08-18

391

The parallel quicksort algorithm part i–run time analysis  

Microsoft Academic Search

In this paper a general purpose sorting algorithm is produced which is suitable for execution on a parallel computer. The algorithm which is based on Quicksort does not require a fixed number of processors but may theoretically use as many processors as are available. The analysis of the algorithm reveals that there is a maximum number of processors that can

D. J. Evans; R. C. Dunbar

1982-01-01

392

Data Parallel SwitchLevel Simulation \\Lambda Randal E. Bryant  

E-print Network

Mellon University Abstract Data parallel simulation involves simulating the be­ havior of a circuit over runs on a a massively­ parallel SIMD machine, with each processor simulat­ ing the circuit behavior parallelism in simulation utilize circuit parallelism. In this mode, the simulator extracts parallelism from

Bryant, Randal E.

393

Configurable Multi-Purpose Processor  

NASA Technical Reports Server (NTRS)

Advancements in technology have allowed the miniaturization of systems used in aerospace vehicles. This technology is driven by the need for next-generation systems that provide reliable, responsive, and cost-effective range operations while providing increased capabilities such as simultaneous mission support, increased launch trajectories, improved launch, and landing opportunities, etc. Leveraging the newest technologies, the command and telemetry processor (CTP) concept provides for a compact, flexible, and integrated solution for flight command and telemetry systems and range systems. The CTP is a relatively small circuit board that serves as a processing platform for high dynamic, high vibration environments. The CTP can be reconfigured and reprogrammed, allowing it to be adapted for many different applications. The design is centered around a configurable field-programmable gate array (FPGA) device that contains numerous logic cells that can be used to implement traditional integrated circuits. The FPGA contains two PowerPC processors running the Vx-Works real-time operating system and are used to execute software programs specific to each application. The CTP was designed and developed specifically to provide telemetry functions; namely, the command processing, telemetry processing, and GPS metric tracking of a flight vehicle. However, it can be used as a general-purpose processor board to perform numerous functions implemented in either hardware or software using the FPGA s processors and/or logic cells. Functionally, the CTP was designed for range safety applications where it would ultimately become part of a vehicle s flight termination system. Consequently, the major functions of the CTP are to perform the forward link command processing, GPS metric tracking, return link telemetry data processing, error detection and correction, data encryption/ decryption, and initiate flight termination action commands. Also, the CTP had to be designed to survive and operate in a launch environment. Additionally, the CTP was designed to interface with the WFF (Wallops Flight Facility) custom-designed transceiver board which is used in the Low Cost TDRSS Transceiver (LCT2) also developed by WFF. The LCT2 s transceiver board demodulates commands received from the ground via the forward link and sends them to the CTP, where they are processed. The CTP inputs and processes data from the inertial measurement unit (IMU) and the GPS receiver board, generates status data, and then sends the data to the transceiver board where it is modulated and sent to the ground via the return link. Overall, the CTP has combined processing with the ability to interface to a GPS receiver, an IMU, and a pulse code modulation (PCM) communication link, while providing the capability to support common interfaces including Ethernet and serial interfaces boarding a relatively small-sized, lightweight package.

Valencia, J. Emilio; Forney, Chirstopher; Morrison, Robert; Birr, Richard

2010-01-01

394

An efficient massively parallel Euler solver for unstructured grids  

NASA Technical Reports Server (NTRS)

A data parallel mesh-vertex upwind finite-volume scheme for solving the Euler equations on triangular unstructured meshes is described. A novel vertex-based partitioning of the problem is introduced which minimizes the computation and communication costs associated with distributing the computation to the processors of a massively parallel computer. Finally, the performance of this unstructured computation on 8K processors of the Connection Machine CM-2 is compared with one processor of a Cray-YMP. The experiments show that 8K processors of the CM-2 achieve approximately 70 percent of the performance of one processor of the Cray-YMP on the unstructured mesh computations described here.

Hammond, Steven W.; Barth, Timothy J.

1991-01-01

395

Soft-core processor study for node-based architectures.  

SciTech Connect

Node-based architecture (NBA) designs for future satellite projects hold the promise of decreasing system development time and costs, size, weight, and power and positioning the laboratory to address other emerging mission opportunities quickly. Reconfigurable Field Programmable Gate Array (FPGA) based modules will comprise the core of several of the NBA nodes. Microprocessing capabilities will be necessary with varying degrees of mission-specific performance requirements on these nodes. To enable the flexibility of these reconfigurable nodes, it is advantageous to incorporate the microprocessor into the FPGA itself, either as a hardcore processor built into the FPGA or as a soft-core processor built out of FPGA elements. This document describes the evaluation of three reconfigurable FPGA based processors for use in future NBA systems--two soft cores (MicroBlaze and non-fault-tolerant LEON) and one hard core (PowerPC 405). Two standard performance benchmark applications were developed for each processor. The first, Dhrystone, is a fixed-point operation metric. The second, Whetstone, is a floating-point operation metric. Several trials were run at varying code locations, loop counts, processor speeds, and cache configurations. FPGA resource utilization was recorded for each configuration. Cache configurations impacted the results greatly; for optimal processor efficiency it is necessary to enable caches on the processors. Processor caches carry a penalty; cache error mitigation is necessary when operating in a radiation environment.

Van Houten, Jonathan Roger; Jarosz, Jason P.; Welch, Benjamin James; Gallegos, Daniel E.; Learn, Mark Walter

2008-09-01

396

NWChem: scalable parallel computational chemistry  

SciTech Connect

NWChem is a general purpose computational chemistry code specifically designed to run on distributed memory parallel computers. The core functionality of the code focuses on molecular dynamics, Hartree-Fock and density functional theory methods for both plane-wave basis sets as well as Gaussian basis sets, tensor contraction engine based coupled cluster capabilities and combined quantum mechanics/molecular mechanics descriptions. It was realized from the beginning that scalable implementations of these methods required a programming paradigm inherently different from what message passing approaches could offer. In response a global address space library, the Global Array Toolkit, was developed. The programming model it offers is based on using predominantly one-sided communication. This model underpins most of the functionality in NWChem and the power of it is exemplified by the fact that the code scales to tens of thousands of processors. In this paper the core capabilities of NWChem are described as well as their implementation to achieve an efficient computational chemistry code with high parallel scalability. NWChem is a modern, open source, computational chemistry code1 specifically designed for large scale parallel applications2. To meet the challenges of developing efficient, scalable and portable programs of this nature a particular code design was adopted. This code design involved two main features. First of all, the code is build up in a modular fashion so that a large variety of functionality can be integrated easily. Secondly, to facilitate writing complex parallel algorithms the Global Array toolkit was developed. This toolkit allows one to write parallel applications in a shared memory like approach, but offers additional mechanisms to exploit data locality to lower communication overheads. This framework has proven to be very successful in computational chemistry but is applicable to any engineering domain. Within the context created by the features above NWChem has grown into a general purpose computational chemistry code that supports a wide variety of energy expressions and capabilities to calculate properties based there upon. The main energy expressions are classical mechanics force fields, Hartree-Fock and DFT both for finite systems and condensed phase systems, coupled cluster, as well as QM/MM. For most energy expressions single point calculations, geometry optimizations, excited states, and other properties are available. Below we briefly discuss each of the main energy expressions and the critical points involved in scalable implementations thereof.

van Dam, Hubertus JJ; De Jong, Wibe A.; Bylaska, Eric J.; Govind, Niranjan; Kowalski, Karol; Straatsma, TP; Valiev, Marat

2011-11-01

397

CFD on parallel computers  

NASA Astrophysics Data System (ADS)

CFD or Computational Fluid Dynamics is one of the scientific disciplines that has always posed new challenges to the capabilities of the modern, ultra-fast supercomputers, and now to the even faster parallel computers. For applications where number crunching is of primary importance, there is perhaps no escaping parallel computers since sequential computers can only be (as projected) as fast as a few gigaflops and no more, unless, of course, some altogether new technology appears in future. For parallel computers, on the other hand, there is no such limit since any number of processors can be made to work in parallel. Computationally demanding CFD codes and parallel computers are therefore soul-mates, and will remain so for all foreseeable future. So much so that there is a separate and fast-emerging discipline that tackles problems specific to CFD as applied to parallel computers. For some years now, there is an international conference on parallel CFD. So, one can indeed say that parallel CFD has arrived. To understand how CFD codes are parallelized, one must understand a little about how parallel computers function. Therefore, in what follows we will first deal with parallel computers, how a typical CFD code (if there is one such) looks like, and then the strategies of parallelization.

Basu, A. J.

1994-10-01

398

Parallel algorithms for interactive manipulation of digital terrain models  

NASA Technical Reports Server (NTRS)

Interactive three-dimensional graphics applications, such as terrain data representation and manipulation, require extensive arithmetic processing. Massively parallel machines are attractive for this application since they offer high computational rates, and grid connected architectures provide a natural mapping for grid based terrain models. Presented here are algorithms for data movement on the massive parallel processor (MPP) in support of pan and zoom functions over large data grids. It is an extension of earlier work that demonstrated real-time performance of graphics functions on grids that were equal in size to the physical dimensions of the MPP. When the dimensions of a data grid exceed the processing array size, data is packed in the array memory. Windows of the total data grid are interactively selected for processing. Movement of packed data is needed to distribute items across the array for efficient parallel processing. Execution time for data movement was found to exceed that for arithmetic aspects of graphics functions. Performance figures are given for routines written in MPP Pascal.

Davis, E. W.; Mcallister, D. F.; Nagaraj, V.

1988-01-01

399

Performance Issues in Parallelized Network Protocols  

Microsoft Academic Search

Parallel processing has been proposed as a means of improving network protocol throughput. Several different strategies have been taken towards parallelizing protocol s. A relatively popular approach is packet-level parallelism, where packets are distributed across processors. This paper provides an experimental performance study of packet-level parallelism on a contemporary shared- memory multiprocessor. We examine several unexplored areas in packet-level parallelism

Erich M. Nahum; David J. Yates; James F. Kurose; Donald F. Towsley

1994-01-01

400

The RISAM Storage Manager for Parallel Architectures  

Microsoft Academic Search

The use of massively parallel processors for high performance data management is moving rapidly into the commercial mainstream. This paper describes key features of the RISAM (replicated ISAM) parallel data manager, which is currently available on the C-DAC PARAM 8000 series of parallel supercomputers and on the Unisys U6000 symmetric multiprocessors, and which can be readily ported to other platforms.

Atul Tulshibagwale; Sujal Parikh; Sameer Mahajan; Tushar Tambay; K. C. Pravin; R. Talashikar; R. Pande

1994-01-01

401

PARALLEL ALGORITHM DESIGN FOR BRANCH AND BOUND  

E-print Network

/or computationally expensive optimization problems some- times require parallel or high-performance computing systems to achieve reasonable running times. This chapter gives an introduction to parallel computing for those of multi-processor workstations and Beowulf-style clusters has made parallel computing resources avail

Bader, David A.

402

MAPS: multi-algorithm parallel circuit simulation  

Microsoft Academic Search

The emergence of multi-core and many-core processors has introduced new opportunities and challenges to EDA research and development. While the availability of increasing parallel computing power holds new promise to address many computing challenges in CAD, the leverage of hardware parallelism can only be possible with a new generation of parallel CAD applications. In this paper, we propose a novel

Xiaoji Ye; Wei Dong; Peng Li; Sani R. Nassif

2008-01-01

403

Parallel computation and computers for artificial intelligence  

SciTech Connect

This book discusses Parallel Processing in Artificial Intelligence; Parallel Computing using Multilisp; Execution of Common Lisp in a Parallel Environment; Qlisp; Restricted AND-Parallel Execution of Logic Programs; PARLOG: Parallel Programming in Logic; and Data-driven Processing of Semantic Nets. Attention is also given to: Application of the Butterfly Parallel Processor in Artificial Intelligence; On the Range of Applicability of an Artificial Intelligence Machine; Low-level Vision on Warp and the Apply Programming Mode; AHR: A Parallel Computer for Pure Lisp; FAIM-1: An Architecture for Symbolic Multi-processing; and Overview of Al Application Oriented Parallel Processing Research in Japan.

Kowalik, J.S. (Boeing Computer Services, Bellevue, WA (US))

1988-01-01

404

Optical SAR processor for space applications  

NASA Astrophysics Data System (ADS)

Synthetic Aperture Radar (SAR) systems typically generate copious amounts of data in the form of complex values difficult to compress. Processing this data provides real-valued images that are easier to compress, however comprehensive processing capabilities are required. Optical processor architectures provide inherent parallel computing capabilities that could be used advantageously for SAR data processing. Onboard SAR image generation would provide local access to processed information paving the way for real-time decisions. This could also provide benefits to navigation strategy or automatic instruments orientation. Moreover, for interplanetary missions or unmanned aerial vehicles (UAVs), onboard analysis of images could provide important feature identification clues and could help select the appropriate images to be transmitted to the ground (Earth). This would reduce the data throughput requirements and the related transmission bandwidth. This paper reviews the preliminary work performed for the analysis of SAR image generation using an optical processor and describes the set-up of an optical SAR processor prototype. Results of optical reconstruction of SAR signals acquired with a state-of-the-art SAR satellite are presented. Real-time processing capabilities and dynamic range calculations for a tracking optical processor architecture are also discussed.

Bourqui, Pascal; Harnisch, Bernd; Marchese, Linda; Bergeron, Alain

2008-04-01

405

RISC Processors and High Performance Computing  

NASA Technical Reports Server (NTRS)

In this tutorial, we will discuss top five current RISC microprocessors: The IBM Power2, which is used in the IBM RS6000/590 workstation and in the IBM SP2 parallel supercomputer, the DEC Alpha, which is in the DEC Alpha workstation and in the Cray T3D; the MIPS R8000, which is used in the SGI Power Challenge; the HP PA-RISC 7100, which is used in the HP 700 series workstations and in the Convex Exemplar; and the Cray proprietary processor, which is used in the new Cray J916. The architecture of these microprocessors will first be presented. The effective performance of these processors will then be compared, both by citing standard benchmarks and also in the context of implementing a real applications. In the process, different programming models such as data parallel (CM Fortran and HPF) and message passing (PVM and MPI) will be introduced and compared. The latest NAS Parallel Benchmark (NPB) absolute performance and performance per dollar figures will be presented. The next generation of the NP13 will also be described. The tutorial will conclude with a discussion of general trends in the field of high performance computing, including likely future developments in hardware and software technology, and the relative roles of vector supercomputers tightly coupled parallel computers, and clusters of workstations. This tutorial will provide a unique cross-machine comparison not available elsewhere.

Saini, Subhash; Bailey, David H.; Lasinski, T. A. (Technical Monitor)

1995-01-01

406

3081//sub E/ processor  

SciTech Connect

Since the introduction of the 168//sub E/, emulating processors have been successful over an amazingly wide range of applications. This paper will describe a second generation processor, the 3081//sub E/. This new processor, which is being developed as a collaboration between SLAC and CERN, goes beyond just fixing the obvious faults of the 168//sub E/. Not only will the 3081//sub E/ have much more memory space, incorporate many more IBM instructions, and have much more memory space, incorporate many more IBM instructions, and have full double precision floating point arithmetic, but it will also have faster execution times and be much simpler to build, debug, and maintain. The simple interface and reasonable cost of the 168//sub E/ will be maintained for the 3081//sub E/.

Kunz, P.F.; Gravina, M.; Oxoby, G.; Trang, Q.; Fucci, A.; Jacobs, D.; Martin, B.; Storr, K.

1983-03-01

407

Fat-Btree: An Update-Conscious Parallel Directory Structure  

Microsoft Academic Search

We propose a parallel directory structure, Fat-Btree, to improve high speed access for parallel database systems in shared nothing environments. The Fat-Btree has a threefold aim: to provide an indexing mechanism for fast retrieval in each processor; to balance the amount of data among distributed disks, and to reduce synchronization costs between processors during update operations. We use a probability

Haruo Yokota; Yasuhiko Kanemasa; Jun Miyazaki

1999-01-01

408

Finite Element Modeling on Scalable Parallel Computers  

NASA Technical Reports Server (NTRS)

A coupled finite element-integral equation was developed to model fields scattered from inhomogenous, three-dimensional objects of arbitrary shape. This paper outlines how to implement the software on a scalable parallel processor.

Cwik, T.; Zuffada, C.; Jamnejad, V.; Katz, D.

1995-01-01

409

A Highly Efficient Cipher Processor for Dual-Field Elliptic Curve Cryptography  

Microsoft Academic Search

This brief presents a high-throughput dual-field elliptic-curve-cryptography (ECC) processor that features all ECC functions with the programmable field and curve parameters over both the prime and binary fields. The proposed architecture is parallel and scalable. Using 0.13-mum CMOS technology, the core size of the processor is 1.44 mm2 . The measured results show that our ECC processor can perform one

Jyu-Yuan Lai; Chih-Tsun Huang

2009-01-01

410

NAS Parallel Benchmarks Results  

NASA Technical Reports Server (NTRS)

The NAS Parallel Benchmarks (NPB) were developed in 1991 at NASA Ames Research Center to study the performance of parallel supercomputers. The eight benchmark problems are specified in a pencil and paper fashion i.e. the complete details of the problem to be solved are given in a technical document, and except for a few restrictions, benchmarkers are free to select the language constructs and implementation techniques best suited for a particular system. In this paper, we present new NPB performance results for the following systems: (a) Parallel-Vector Processors: Cray C90, Cray T'90 and Fujitsu VPP500; (b) Highly Parallel Processors: Cray T3D, IBM SP2 and IBM SP-TN2 (Thin Nodes 2); (c) Symmetric Multiprocessing Processors: Convex Exemplar SPP1000, Cray J90, DEC Alpha Server 8400 5/300, and SGI Power Challenge XL. We also present sustained performance per dollar for Class B LU, SP and BT benchmarks. We also mention NAS future plans of NPB.

Subhash, Saini; Bailey, David H.; Lasinski, T. A. (Technical Monitor)

1995-01-01

411

Fast Parallel Computation Of Multibody Dynamics  

NASA Technical Reports Server (NTRS)

Constraint-force algorithm fast, efficient, parallel-computation algorithm for solving forward dynamics problem of multibody system like robot arm or vehicle. Solves problem in minimum time proportional to log(N) by use of optimal number of processors proportional to N, where N is number of dynamical degrees of freedom: in this sense, constraint-force algorithm both time-optimal and processor-optimal parallel-processing algorithm.

Fijany, Amir; Kwan, Gregory L.; Bagherzadeh, Nader

1996-01-01

412

Parallel logic simulation on general purpose machines  

Microsoft Academic Search

Three parallel algorithms for logic simulation have been developed and implemented on a general purpose shared-memory parallel machine. The first algorithm is a synchronous version of a traditional event-driven algorithm which achieves speed-ups of 6 to 9 with 15 processors. The second algorithm is a synchronous unit-delay compiled mode algorithm which achieves speed-ups of 10 to 13 with 15 processors.

Larry Soulé; Tom Blank

1988-01-01

413

Computational Characteristics of Production Seismic Migration and its Performance on Novel Processor Architectures  

Microsoft Academic Search

We describe the computational characteristics of the Kirchhoff prestack seismic migration currently used in daily production runs at Petrobras and its port to novel architectures. Fully developed in house, this portable and fault tolerant application has high sequential and parallel efficiency, with parallel scalability tested up to 8192 processors on the IBM Blue Gene without exhausting parallelism. Production load comprises

Jairo Panetta; P. R. P. de Souza Filho; C. A. da Cunha Filho; F. M. R. da Motta; S. S. Pinheiro; I. Pedrosa; A. L. R. Rosa; L. R. Monnerat; L. T. Carneiro; C. H. B. de Albrecht

2007-01-01

414

Comparison of Generated Parallel Capillary Arrays to Three-Dimensional Reconstructed Capillary Networks in Modeling Oxygen Transport in Discrete Microvascular Volumes  

PubMed Central

Objective We compare Reconstructed Microvascular Networks (RMN) to Parallel Capillary Arrays (PCA) under several simulated physiological conditions to determine how the use of different vascular geometry affects oxygen transport solutions. Methods Three discrete networks were reconstructed from intravital video microscopy of rat skeletal muscle (84×168×342 ?m, 70×157×268 ?m and 65×240×571 ?m) and hemodynamic measurements were made in individual capillaries. PCAs were created based on statistical measurements from RMNs. Blood flow and O2 transport models were applied and the resulting solutions for RMN and PCA models were compared under 4 conditions (rest, exercise, ischemia and hypoxia). Results Predicted tissue PO2 was consistently lower in all RMN simulations compared to the paired PCA. PO2 for 3D reconstructions at rest were 28.2±4.8, 28.1±3.5, and 33.0±4.5 mmHg for networks I, II, and III compared to the PCA mean values of 31.2±4.5, 30.6±3.4, and 33.8±4.6 mmHg. Simulated exercise yielded mean tissue PO2 in the RMN of 10.1±5.4, 12.6±5.7, and 19.7±5.7 mmHg compared to 15.3±7.3, 18.8±5.3, and 21.7±6.0 in PCA. Conclusions These findings suggest that volume matched PCA yield different results compared to reconstructed microvascular geometries when applied to O2 transport modeling; the predominant characteristic of this difference being an over estimate of mean tissue PO2. Despite this limitation, PCA models remain important for theoretical studies as they produce PO2 distributions with similar shape and parameter dependence as RMN. PMID:23841679

Fraser, Graham M.; Goldman, Daniel; Ellis, Christopher G.

2013-01-01

415

Survey of new vector computers: The CRAY 1S from CRAY research; the CYBER 205 from CDC and the parallel computer from ICL - architecture and programming  

NASA Technical Reports Server (NTRS)

Problems which can arise with vector and parallel computers are discussed in a user oriented context. Emphasis is placed on the algorithms used and the programming techniques adopted. Three recently developed supercomputers are examined and typical application examples are given in CRAY FORTRAN, CYBER 205 FORTRAN and DAP (distributed array processor) FORTRAN. The systems performance is compared. The addition of parts of two N x N arrays is considered. The influence of the architecture on the algorithms and programming language is demonstrated. Numerical analysis of magnetohydrodynamic differential equations by an explicit difference method is illustrated, showing very good results for all three systems. The prognosis for supercomputer development is assessed.

Gentzsch, W.

1982-01-01

416

Ultra Dependable Processor  

NASA Astrophysics Data System (ADS)

This paper presents the processor architecture which provides much higher level dependability than the current ones. The features of it are: (1) fault tolerance and secure processing are integrated into a modern superscalar VLSI processor; (2) light-weight effective soft-error tolerant mechanisms are proposed and evaluated; (3) timing errors on random logic and registers are prevented by low-overhead mechanisms; (4) program behavior is hidden from the outer world by proposed address translation methods; (5) information leakage can be avoided by attaching policy tags for all data and monitoring them for each instruction execution; (6) injection attacks are avoided with much higher accuracy than the current systems, by providing tag trackings; (7) the overall structure of the dependable processor is proposed with a dependability manager which controls the detection of illegal conditions and recovers to the normal mode; and (8) an FPGA-based testbed system is developed where the system clock and the voltage are intentionally varied for experiment. The paper presents the fundamental scheme for the dependability, elemental technologies for dependability and the whole architecture of the ultra dependable processor. After showing them, the paper concludes with future works.

Sakai, Shuichi; Goshima, Masahiro; Irie, Hidetsugu

417

The Prospero Resource Manager: A scalable framework for processor allocation in distributed systems  

Microsoft Academic Search

Existing techniques for allocating processors in parallel and distributed systems are not suitable for use inlarge distributed systems. In such systems, dedicated multiprocessors should exist as an integral component ofthe distributed system, and idle processors should be available to applications that need them. The ProsperoResource Manager (PRM) is a scalable resource allocation system that supports the allocation of processingresources in

B. Clifford Neuman; Santosh Rao

1994-01-01

418

An optical/digital processor - Hardware and applications  

NASA Technical Reports Server (NTRS)

A real-time two-dimensional hybrid processor consisting of a coherent optical system, an optical/digital interface, and a PDP-11/15 control minicomputer is described. The input electrical-to-optical transducer is an electron-beam addressed potassium dideuterium phosphate (KD2PO4) light valve. The requirements and hardware for the output optical-to-digital interface, which is constructed from modular computer building blocks, are presented. Initial experimental results demonstrating the operation of this hybrid processor in phased-array radar data processing, synthetic-aperture image correlation, and text correlation are included. The applications chosen emphasize the role of the interface in the analysis of data from an optical processor and possible extensions to the digital feedback control of an optical processor.

Casasent, D.; Sterling, W. M.

1975-01-01

419

Fault detection and bypass in a sequence information signal processor  

NASA Technical Reports Server (NTRS)

The invention comprises a plurality of scan registers, each such register respectively associated with a processor element; an on-chip comparator, encoder and fault bypass register. Each scan register generates a unitary signal the logic state of which depends on the correctness of the input from the previous processor in the systolic array. These unitary signals are input to a common comparator which generates an output indicating whether or not an error has occurred. These unitary signals are also input to an encoder which identifies the location of any fault detected so that an appropriate multiplexer can be switched to bypass the faulty processor element. Input scan data can be readily programmed to fully exercise all of the processor elements so that no fault can remain undetected.

Peterson, John C. (Inventor); Chow, Edward T. (Inventor)

1992-01-01

420

Parallel Genetic Algorithm for Alpha Spectra Fitting  

NASA Astrophysics Data System (ADS)

We present a performance study of alpha-particle spectra fitting using parallel Genetic Algorithm (GA). The method uses a two-step approach. In the first step we run parallel GA to find an initial solution for the second step, in which we use Levenberg-Marquardt (LM) method for a precise final fit. GA is a high resources-demanding method, so we use a Beowulf cluster for parallel simulation. The relationship between simulation time (and parallel efficiency) and processors number is studied using several alpha spectra, with the aim of obtaining a method to estimate the optimal processors number that must be used in a simulation.

García-Orellana, Carlos J.; Rubio-Montero, Pilar; González-Velasco, Horacio

2005-01-01

421

Parallelization of a treecode  

E-print Network

I describe here the performance of a parallel treecode with individual particle timesteps. The code is based on the Barnes-Hut algorithm and runs cosmological N-body simulations on parallel machines with a distributed memory architecture using the MPI message-passing library. For a configuration with a constant number of particles per processor the scalability of the code was tested up to P=128 processors on an IBM SP4 machine. In the large $P$ limit the average CPU time per processor necessary for solving the gravitational interactions is $\\sim 10 %$ higher than that expected from the ideal scaling relation. The processor domains are determined every large timestep according to a recursive orthogonal bisection, using a weighting scheme which takes into account the total particle computational load within the timestep. The results of the numerical tests show that the load balancing efficiency $L$ of the code is high ($>=90%$) up to P=32, and decreases to $L\\sim 80%$ when P=128. In the latter case it is found that some aspects of the code performance are affected by machine hardware, while the proposed weighting scheme can achieve a load balance as high as $L\\sim 90%$ even in the large $P$ limit.

R. Valdarnini

2003-03-18

422

A scalable parallel open architecture data acquisition system for low to high rate experiments, test beams and all SSC (Superconducting Super Collider) detectors  

SciTech Connect

A new era of high-energy physics research is beginning requiring accelerators with much higher luminosities and interaction rates in order to discover new elementary particles. As a consequences, both orders of magnitude higher data rates from the detector and online processing power, well beyond the capabilities of current high energy physics data acquisition systems, are required. This paper describes a new data acquisition system architecture which draws heavily from the communications industry, is totally parallel (i.e., without any bottlenecks), is capable of data rates of hundreds of GigaBytes per second from the detector and into an array of online processors (i.e., processor farm), and uses an open systems architecture to guarantee compatibility with future commercially available online processor farms. The main features of the system architecture are standard interface ICs to detector subsystems wherever possible, fiber optic digital data transmission from the near-detector electronics, a self-routing parallel event builder, and the use of industry-supported and high-level language programmable processors in the proposed BCD system for both triggers and online filters. A brief status report of an ongoing project at Fermilab to build the self-routing parallel event builder will also be given in the paper. 3 figs., 1 tab.

Barsotti, E.; Booth, A.; Bowden, M.; Swoboda, C. (Fermi National Accelerator Lab., Batavia, IL (USA)); Lockyer, N.; VanBerg, R. (Pennsylvania Univ., Philadelphia, PA (USA))

1989-12-01

423

Accelerating the performance of a novel meshless method based on collocation with radial basis functions by employing a graphical processing unit as a parallel coprocessor  

NASA Astrophysics Data System (ADS)

In recent times, a variety of industries, applications and numerical methods including the meshless method have enjoyed a great deal of success by utilizing the graphical processing unit (GPU) as a parallel coprocessor. These benefits often include performance improvement over the previous implementations. Furthermore, applications running on graphics processors enjoy superior performance per dollar and performance per watt than implementations built exclusively on traditional central processing technologies. The GPU was originally designed for graphics acceleration but the modern GPU, known as the General Purpose Graphical Processing Unit (GPGPU) can be used for scientific and engineering calculations. The GPGPU consists of massively parallel array of integer and floating point processors. There are typically hundreds of processors per graphics card with dedicated high-speed memory. This work describes an application written by the author, titled GaussianRBF to show the implementation and results of a novel meshless method that in-cooperates the collocation of the Gaussian radial basis function by utilizing the GPU as a parallel co-processor. Key phases of the proposed meshless method have been executed on the GPU using the NVIDIA CUDA software development kit. Especially, the matrix fill and solution phases have been carried out on the GPU, along with some post processing. This approach resulted in a decreased processing time compared to similar algorithm implemented on the CPU while maintaining the same accuracy.

Owusu-Banson, Derek

424

3D-Flow processor for a programmable Level-1 trigger (feasibility study)  

SciTech Connect

A feasibility study has been made to use the 3D-Flow processor in a pipelined programmable parallel processing architecture to identify particles such as electrons, jets, muons, etc., in high-energy physics experiments.

Crosetto, D.

1992-10-01

425

Efficacy of Code Optimization on Cache-based Processors  

NASA Technical Reports Server (NTRS)

The current common wisdom in the U.S. is that the powerful, cost-effective supercomputers of tomorrow will be based on commodity (RISC) micro-processors with cache memories. Already, most distributed systems in the world use such hardware as building blocks. This shift away from vector supercomputers and towards cache-based systems has brought about a change in programming paradigm, even when ignoring issues of parallelism. Vector machines require inner-loop independence and regular, non-pathological memory strides (usually this means: non-power-of-two strides) to allow efficient vectorization of array operations. Cache-based systems require spatial and temporal locality of data, so that data once read from main memory and stored in high-speed cache memory is used optimally before being written back to main memory. This means that the most cache-friendly array operations are those that feature zero or unit stride, so that each unit of data read from main memory (a cache line) contains information for the next iteration in the loop. Moreover, loops ought to be 'fat', meaning that as many operations as possible are performed on cache data-provided instruction caches do not overflow and enough registers are available. If unit stride is not possible, for example because of some data dependency, then care must be taken to avoid pathological strides, just ads on vector computers. For cache-based systems the issues are more complex, due to the effects of associativity and of non-unit block (cache line) size. But there is more to the story. Most modern micro-processors are superscalar, which means that they can issue several (arithmetic) instructions per clock cycle, provided that there are enough independent instructions in the loop body. This is another argument for providing fat loop bodies. With these restrictions, it appears fairly straightforward to produce code that will run efficiently on any cache-based system. It can be argued that although some of the important computational algorithms employed at NASA Ames require different programming styles on vector machines and cache-based machines, respectively, neither architecture class appeared to be favored by particular algorithms in principle. Practice tells us that the situation is more complicated. This report presents observations and some analysis of performance tuning for cache-based systems. We point out several counterintuitive results that serve as a cautionary reminder that memory accesses are not the only factors that determine performance, and that within the class of cache-based systems, significant differences exist.

VanderWijngaart, Rob F.; Chancellor, Marisa K. (Technical Monitor)

1997-01-01

426

Fast Parallel Computation Of Manipulator Inverse Dynamics  

NASA Technical Reports Server (NTRS)

Method for fast parallel computation of inverse dynamics problem, essential for real-time dynamic control and simulation of robot manipulators, undergoing development. Enables exploitation of high degree of parallelism and, achievement of significant computational efficiency, while minimizing various communication and synchronization overheads as well as complexity of required computer architecture. Universal real-time robotic controller and simulator (URRCS) consists of internal host processor and several SIMD processors with ring topology. Architecture modular and expandable: more SIMD processors added to match size of problem. Operate asynchronously and in MIMD fashion.

Fijany, Amir; Bejczy, Antal K.

1991-01-01

427

Parallel machine architecture for production rule systems  

DOEpatents

A parallel processing system for production rule programs utilizes a host processor for storing production rule right hand sides (RHS) and a plurality of rule processors for storing left hand sides (LHS). The rule processors operate in parallel in the recognize phase of the system recognize -Act Cycle to match their respective LHS's against a stored list of working memory elements (WME) in order to find a self consistent set of WME's. The list of WME is dynamically varied during the Act phase of the system in which the host executes or fires rule RHS's for those rules for which a self-consistent set has been found by the rule processors. The host transmits instructions for creating or deleting working memory elements as dictated by the rule firings until the rule processors are unable to find any further self-consistent working memory element sets at which time the production rule system is halted.

Allen, Jr., John D. (Knoxville, TN); Butler, Philip L. (Knoxville, TN)

1989-01-01

428

Parallel processing data network of master and slave transputers controlled by a serial control network  

DOEpatents

The present device provides for a dynamically configurable communication network having a multi-processor parallel processing system having a serial communication network and a high speed parallel communication network. The serial communication network is used to disseminate commands from a master processor to a plurality of slave processors to effect communication protocol, to control transmission of high density data among nodes and to monitor each slave processor`s status. The high speed parallel processing network is used to effect the transmission of high density data among nodes in the parallel processing system. Each node comprises a transputer, a digital signal processor, a parallel transfer controller, and two three-port memory devices. A communication switch within each node connects it to a fast parallel hardware channel through which all high density data arrives or leaves the node. 6 figs.

Crosetto, D.B.

1996-12-31

429

FY 2006 Accomplishment Colony - "Services and Interfaces to Support Large Numbers of Processors"  

SciTech Connect

The Colony Project is developing operating system and runtime system technology to enable efficient general purpose environments on tens of thousands of processors. To accomplish this, we are investigating memory management techniques, fault management strategies, and parallel resource management schemes. Recent results show promising findings for scalable strategies based on processor virtualization, in-memory checkpointing, and parallel aware modifications to full featured operating systems.

Jones, T; Kale, L; Moreira, J; Mendes, C; Chakravorty, S; Tauferner, A; Inglett, T

2006-06-30

430

Initial Experiences Porting a Bioinformatics Application to a Graphics Processor  

Microsoft Academic Search

Bioinformatics applications are one of the most relevant and compute-demanding applications today. While normally these applica- tions are executed on clusters or dedicated parallel systems, in this work we explore the use of an alternative architecture. We focus on exploiting the compute-intensive characteristics offered by the graphics processors (GPU) in order to accelerate a bioinformatics application. The GPU is a

Maria Charalambous; Pedro Trancoso; Alexandros Stamatakis

2005-01-01

431

Optimization of Particle-in-Cell Codes on RISC Processors  

NASA Technical Reports Server (NTRS)

General strategies are developed to optimize particle-cell-codes written in Fortran for RISC processors which are commonly used on massively parallel computers. These strategies include data reorganization to improve cache utilization and code reorganization to improve efficiency of arithmetic pipelines.

Decyk, Viktor K.; Karmesin, Steve Roy; Boer, Aeint de; Liewer, Paulette C.

1996-01-01

432

Implementation and performance evaluation of reconstruction algorithms on graphics processors  

Microsoft Academic Search

The high-throughput needs in electron tomography and in single particle analysis have driven the parallel implementation of several reconstruction algorithms and software packages on computing clusters. Here, we report on the implementation of popular reconstruction algorithms as weighted backprojection, simultaneous iterative reconstruction technique (SIRT) and simultaneous algebraic reconstruction technique (SART) on common graphics processors (GPUs). The speed gain achieved on

Daniel Castaño Díez; Hannes Mueller; Achilleas S. Frangakis

2007-01-01

433

A high-speed Radon transform and backprojection processor  

Microsoft Academic Search

An expandable multiprocessor hardware system for the computation of the Radon transform and backprojection equations is described. The system is constructed with commercially available digital signal processing (DSP) chips. This Radon transform processor is based upon a parallel-pipelined multiprocessor architecture. Performance characteristics of the hardware system are presented

Eric Shieh; Paul Hurst; Iskender Agi

1990-01-01

434

A mask programmable DSP array  

Microsoft Academic Search

A 125000-gate 1.4-?m CMOS DSP (digital signal processor) array, which offers single-chip solutions for functions such as convolution and fast Fourier transforms, is described. High performance is obtained using full-custom RAM and multipliers, and the required system function is achieved by configuring a conventional gate array. Examples of possible applications include two standard products already designed using the DSP array.

R. D. Albon; G. E. Floyd; J. E. Coles

1989-01-01

435

Optoelectronic signal processing for phased-array antennas  

SciTech Connect

These proceedings contain 39 papers grouped under the headings of: High frequency laser sources; High speed/high frequency optical components; Radiating and control elements; Architectures and algorithms; Signal processors 1; Signal processor 11; Optically controlled phased-array antennas 1; Optically controlled phased-array antennas 11.

Bhasin, K.B.; Hendrickson, B.M.

1988-01-01

436

Scalable parallel communications  

NASA Technical Reports Server (NTRS)

Coarse-grain parallelism in networking (that is, the use of multiple protocol processors running replicated software sending over several physical channels) can be used to provide gigabit communications for a single application. Since parallel network performance is highly dependent on real issues such as hardware properties (e.g., memory speeds and cache hit rates), operating system overhead (e.g., interrupt handling), and protocol performance (e.g., effect of timeouts), we have performed detailed simulations studies of both a bus-based multiprocessor workstation node (based on the Sun Galaxy MP multiprocessor) and a distributed-memory parallel computer node (based on the Touchstone DELTA) to evaluate the behavior of coarse-grain parallelism. Our results indicate: (1) coarse-grain parallelism can deliver multiple 100 Mbps with currently available hardware platforms and existing networking protocols (such as Transmission Control Protocol/Internet Protocol (TCP/IP) and parallel Fiber Distributed Data Interface (FDDI) rings); (2) scale-up is near linear in n, the number of protocol processors, and channels (for small n and up to a few hundred Mbps); and (3) since these results are based on existing hardware without specialized devices (except perhaps for some simple modifications of the FDDI boards), this is a low cost solution to providing multiple 100 Mbps on current machines. In addition, from both the performance analysis and the properties of these architectures, we conclude: (1) multiple processors providing identical services and the use of space division multiplexing for the physical channels can provide better reliability than monolithic approaches (it also provides graceful degradation and low-cost load balancing); (2) coarse-grain parallelism supports running several transport protocols in parallel to provide different types of service (for example, one TCP handles small messages for many users, other TCP's running in parallel provide high bandwidth service to a single application); and (3) coarse grain parallelism will be able to incorporate many future improvements from related work (e.g., reduced data movement, fast TCP, fine-grain parallelism) also with near linear speed-ups.

Maly, K.; Khanna, S.; Overstreet, C. M.; Mukkamala, R.; Zubair, M.; Sekhar, Y. S.; Foudriat, E. C.

1992-01-01

437

Scalable parallel communications  

NASA Astrophysics Data System (ADS)

Coarse-grain parallelism in networking (that is, the use of multiple protocol processors running replicated software sending over several physical channels) can be used to provide gigabit communications for a single application. Since parallel network performance is highly dependent on real issues such as hardware properties (e.g., memory speeds and cache hit rates), operating system overhead (e.g., interrupt handling), and protocol performance (e.g., effect of timeouts), we have performed detailed simulations studies of both a bus-based multiprocessor workstation node (based on the Sun Galaxy MP multiprocessor) and a distributed-memory parallel computer node (based on the Touchstone DELTA) to evaluate the behavior of coarse-grain parallelism. Our results indicate: (1) coarse-grain parallelism can deliver multiple 100 Mbps with currently available hardware platforms and existing networking protocols (such as Transmission Control Protocol/Internet Protocol (TCP/IP) and parallel Fiber Distributed Data Interface (FDDI) rings); (2) scale-up is near linear in n, the number of protocol processors, and channels (for small n and up to a few hundred Mbps); and (3) since these results are based on existing hardware without specialized devices (except perhaps for some simple modifications of the FDDI boards), this is a low cost solution to providing multiple 100 Mbps on current machines. In addition, from both the performance analysis and the properties of these architectures, we conclude: (1) multiple processors providing identical services and the use of space division multiplexing for the physical channels can provide better reliability than monolithic approaches (it also provides graceful degradation and low-cost load balancing); (2) coarse-grain parallelism supports running several transport protocols in parallel to provide different types of service (for example, one TCP handles small messages for many users, other TCP's running in parallel provide high bandwidth service to a single application); and (3) coarse grain parallelism will be able to incorporate many future improvements from related work (e.g., reduced data movement, fast TCP, fine-grain parallelism) also with near linear speed-ups.

Maly, K.; Khanna, S.; Overstreet, C. M.; Mukkamala, R.; Zubair, M.; Sekhar, Y. S.; Foudriat, E. C.

1992-06-01

438

Electrostatically focused addressable field emission array chips (AFEA's) for high-speed massively parallel maskless digital E-beam direct write lithography and scanning electron microscopy  

DOEpatents

Systems and methods are described for addressable field emission array (AFEA) chips. A method of operating an addressable field-emission array, includes: generating a plurality of electron beams from a pluralitly of emitters that compose the addressable field-emission array; and focusing at least one of the plurality of electron beams with an on-chip electrostatic focusing stack. The systems and methods provide advantages including the avoidance of space-charge blow-up.

Thomas, Clarence E. (Knoxville, TN); Baylor, Larry R. (Farragut, TN); Voelkl, Edgar (Oak Ridge, TN); Simpson, Michael L. (Knoxville, TN); Paulus, Michael J. (Knoxville, TN); Lowndes, Douglas H. (Knoxville, TN); Whealton, John H. (Oak Ridge, TN); Whitson, John C. (Clinton, TN); Wilgen, John B. (Oak Ridge, TN)

2002-12-24

439

Optically smart active antenna arrays  

Microsoft Academic Search

A prototype X-band active antenna array with adaptive optical processing is presented. The optical processor, referred to as an auto-tuning filter, is able to extract the strongest principal component in a two-signal space with up to 30 dB enhancement with respect to the other signals. The processor is compact (8 cm by 4 cm) and scalable to a large number

Dana Z. Anderson; V. Damiao; Edeline Fotheringham; Darko Popovic; Stefania Romisch; Zoya Popovic

2000-01-01

440

CORDIC processor architectures  

NASA Astrophysics Data System (ADS)

As CORDIC algorithms receive more and more attention in elementary function evaluation and signal processing applications, the problem of their VLSI realization has attracted considerable interest. In this work we review the CORDIC fundamentals covering algorithm, architecture, and implementation issues. Various aspects of the CORDIC algorithm are investigated such as efficient scale factor compensation, redundant and non-redundant addition schemes, and convergence domain. Several CORDIC processor architectures and implementation examples are discussed.

Boehme, Johann F.; Timmermann, D.; Hahn, H.; Hosticka, Bedrich J.

1991-12-01

441

Distributed processor allocation for launching applications in a massively connected processors complex  

DOEpatents

A compute processor allocator architecture for allocating compute processors to run applications in a multiple processor computing apparatus is distributed among a subset of processors within the computing apparatus. Each processor of the subset includes a compute processor allocator. The compute processor allocators can share a common database of information pertinent to compute processor allocation. A communication path permits retrieval of information from the database independently of the compute processor allocators.

Pedretti, Kevin (Goleta, CA)

2008-11-18

442

Parallel algorithms for message decomposition  

SciTech Connect

The authors consider the deterministic and random parallel complexity (time and processor) of message decoding: an essential problem in communications systems and translation systems. They present an optimal parallel algorithm to decompose prefix-coded messages and uniquely decipherable-coded messages in O(n/P) time, using O(P) processors (for all P:1 less than or equal toPless than or equal ton/log n) deterministically as well as randomly on the weakest version of parallel random access machines in which concurrent read and concurrent write to a cell in the common memory are not allowed. This is done by reducing decoding to parallel finite-state automata simulation and the prefix sums.

Teng, S.H.; Wang, B.

1987-06-01

443

Parallel processing architecture for H.264 deblocking filter on multi-core platforms  

NASA Astrophysics Data System (ADS)

Massively parallel computing (multi-core) chips offer outstanding new solutions that satisfy the increasing demand for high resolution and high quality video compression technologies such as H.264. Such solutions not only provide exceptional quality but also efficiency, low power, and low latency, previously unattainable in software based designs. While custom hardware and Application Specific Integrated Circuit (ASIC) technologies may achieve lowlatency, low power, and real-time performance in some consumer devices, many applications require a flexible and scalable software-defined solution. The deblocking filter in H.264 encoder/decoder poses difficult implementation challenges because of heavy data dependencies and the conditional nature of the computations. Deblocking filter implementations tend to be fixed and difficult to reconfigure for different needs. The ability to scale up for higher quality requirements such as 10-bit pixel depth or a 4:2:2 chroma format often reduces the throughput of a parallel architecture designed for lower feature set. A scalable architecture for deblocking filtering, created with a massively parallel processor based solution, means that the same encoder or decoder will be deployed in a variety of applications, at different video resolutions, for different power requirements, and at higher bit-depths and better color sub sampling patterns like YUV, 4:2:2, or 4:4:4 formats. Low power, software-defined encoders/decoders may be implemented using a massively parallel processor array, like that found in HyperX technology, with 100 or more cores and distributed memory. The large number of processor elements allows the silicon device to operate more efficiently than conventional DSP or CPU technology. This software programing model for massively parallel processors offers a flexible implementation and a power efficiency close to that of ASIC solutions. This work describes a scalable parallel architecture for an H.264 compliant deblocking filter for multi core platforms such as HyperX technology. Parallel techniques such as parallel processing of independent macroblocks, sub blocks, and pixel row level are examined in this work. The deblocking architecture consists of a basic cell called deblocking filter unit (DFU) and dependent data buffer manager (DFM). The DFU can be used in several instances, catering to different performance needs the DFM serves the data required for the different number of DFUs, and also manages all the neighboring data required for future data processing of DFUs. This approach achieves the scalability, flexibility, and performance excellence required in deblocking filters.

Prasad, Durga P.; Sonachalam, Sekar; Kunchamwar, Mangesh K.; Gunupudi, Nageswara Rao

2012-03-01

444

A taxonomy of parallel sorting  

Microsoft Academic Search

We propose a taxonomy of parallel sorting that encompasses a broad range of array- and file-sorting algorithms. We analyze how research on parallel sorting has evolved, from the earliest sorting networks to shared memory algorithms and VLSI sorters. In the context of sorting networks, we describe two fundamental parallel merging schemes: the odd-even and the bitonic merge. We discuss sorting

Dina Bitton; David J. DeWitt; David K. Hsaio; Jaishankar Menon

1984-01-01

445

Software-Reconfigurable Processors for Spacecraft  

NASA Technical Reports Server (NTRS)

A report presents an overview of an architecture for a software-reconfigurable network data processor for a spacecraft engaged in scientific exploration. When executed on suitable electronic hardware, the software performs the functions of a physical layer (in effect, acts as a software radio in that it performs modulation, demodulation, pulse-shaping, error correction, coding, and decoding), a data-link layer, a network layer, a transport layer, and application-layer processing of scientific data. The software-reconfigurable network processor is undergoing development to enable rapid prototyping and rapid implementation of communication, navigation, and scientific signal-processing functions; to provide a long-lived communication infrastructure; and to provide greatly improved scientific-instrumentation and scientific-data-processing functions by enabling science-driven in-flight reconfiguration of computing resources devoted to these functions. This development is an extension of terrestrial radio and network developments (e.g., in the cellular-telephone industry) implemented in software running on such hardware as field-programmable gate arrays, digital signal processors, traditional digital circuits, and mixed-signal application-specific integrated circuits (ASICs).

Farrington, Allen; Gray, Andrew; Bell, Bryan; Stanton, Valerie; Chong, Yong; Peters, Kenneth; Lee, Clement; Srinivasan, Jeffrey

2005-01-01

446

Efficient Breadth-First Search on the Cell/BE Processor  

SciTech Connect

Multi-core processors are a shift of paradigm in computer architecture that promises a dramatic increase in performance. But multi-core processors also bring an unprecedented level of complexity in algorithmic design and software development. In this paper we describe the challenges involved in designing a breadth-first search (BFS) for advanced multi-core processors. The proposed methodology combines a high-level algorithmic design that captures the machine-independent aspects, to guarantee portability with performance to future processors, with a low-level implementation that embeds processor-specific optimizations. Using a fine-graind global coordination strategy derived by the Bulk-Synchronous Parallel (BSP) model we have determined an accurate performance model that has guided the implementation and the optimization of our algorithms. To validate our approach, we use a state-of-the-art multicore processor, the Cell Broadband Engine (Cell BE). Our experiments, obtained on a pre-production Cell BE board running at 3.2 GHz, show almost linear speedups when using multiple synergistic processing elements, and an impressive level of performance when compared to other processors. The Cell BE is typically an order of magnitude faster than conventional processors, such as the AMD Opteron and the Intel Pentium 4 and Woodcrest, an order of magnitude faster than the MTA-2 multi-threaded processor, and two orders of magnitude faster than a BlueGene/L processor. Index Terms—Multi-core processors, Parallel Computing, Cell Broadband Engine, Parallelization Techniques, Graph Exploration Algorithms, Breadth-First Search, BFS.

Scarpazza, Daniele P.; Villa, Oreste; Petrini, Fabrizio

2008-10-01

447

Japanese document recognition and retrieval system using programmable SIMD processor  

NASA Astrophysics Data System (ADS)

This paper describes a new efficient information-filing system for a large number of documents. The system is designed to recognize Japanese characters and make full-text searches across a document database. Key components of the system are a small fully-programmable parallel processor for both recognition and retrieval an image scanner for document input and a personal computer as the operator console. The processor is constructed by a bit-serial single instruction multiple data stream architecture (SIMD) and all components including the 256 processor elements and 11 MB of RAM are integrated on one board. The recognition process divides a document into text lines isolates each character extracts character pattern features and then identifies character categories. The entire process is performed by a single micro-program package down-loaded from the console. The recognition accuracy is more than 99. 0 for about 3 printed Japanese characters at a performance speed of more than 14 characters per second. The processor can also be made available for high speed information retrieval by changing the down-loaded microprogram package. The retrieval process can obtain sentences that include the same information as an inquiry text from the database previously created through character recognition. Retrieval performance is very fast with 20 million individual Japanese characters being examined each second when the database is stored in the processor''s IC memory. It was confirmed that a high performance but flexible and cost-effective document-information-processing system

Miyahara, Sueharu; Suzuki, Akira; Tada, Shunkichi; Kawatani, Takahiko

1991-02-01

448

Partitioning in parallel processing of production systems  

SciTech Connect

This thesis presents research on certain issues related to parallel processing of production systems. It first presents a parallel production system interpreter that has been implemented on a four-processor multiprocessor. This parallel interpreter is based on Forgy's OPS5 interpreter and exploits production-level parallelism in production systems. Runs on the multiprocessor system indicate that it is possible to obtain speed-up of around 1.7 in the match computation for certain production systems when productions are split into three sets that are processed in parallel. The next issue addressed is that of partitioning a set of rules to processors in a parallel interpreter with production-level parallelism, and the extent of additional improvement in performance. The partitioning problem is formulated and an algorithm for approximate solutions is presented. The thesis next presents a parallel processing scheme for OPS5 production systems that allows some redundancy in the match computation. This redundancy enables the processing of a production to be divided into units of medium granularity each of which can be processed in parallel. Subsequently, a parallel processor architecture for implementing the parallel processing algorithm is presented.

Oflazer, K.

1987-01-01

449

CoNNeCT Baseband Processor Module  

NASA Technical Reports Server (NTRS)

A document describes the CoNNeCT Baseband Processor Module (BPM) based on an updated processor, memory technology, and field-programmable gate arrays (FPGAs). The BPM was developed from a requirement to provide sufficient computing power and memory storage to conduct experiments for a Software Defined Radio (SDR) to be implemented. The flight SDR uses the AT697 SPARC processor with on-chip data and instruction cache. The non-volatile memory has been increased from a 20-Mbit EEPROM (electrically erasable programmable read only memory) to a 4-Gbit Flash, managed by the RTAX2000 Housekeeper, allowing more programs and FPGA bit-files to be stored. The volatile memory has been increased from a 20-Mbit SRAM (static random access memory) to a 1.25-Gbit SDRAM (synchronous dynamic random access memory), providing additional memory space for more complex operating systems and programs to be executed on the SPARC. All memory is EDAC (error detection and correction) protected, while the SPARC processor implements fault protection via TMR (triple modular redundancy) architecture. Further capability over prior BPM designs includes the addition of a second FPGA to implement features beyond the resources of a single FPGA. Both FPGAs are implemented with Xilinx Virtex-II and are interconnected by a 96-bit bus to facilitate data exchange. Dedicated 1.25- Gbit SDRAMs are wired to each Xilinx FPGA to accommodate high rate data buffering for SDR applications as well as independent SpaceWire interfaces. The RTAX2000 manages scrub and configuration of each Xilinx.

Yamamoto, Clifford K; Jedrey, Thomas C.; Gutrich, Daniel G.; Goodpasture, Richard L.

2011-01-01

450

Highly scalable linear solvers on thousands of processors.  

SciTech Connect

In this report we summarize research into new parallel algebraic multigrid (AMG) methods. We first provide a introduction to parallel AMG. We then discuss our research in parallel AMG algorithms for very large scale platforms. We detail significant improvements in the AMG setup phase to a matrix-matrix multiplication kernel. We present a smoothed aggregation AMG algorithm with fewer communication synchronization points, and discuss its links to domain decomposition methods. Finally, we discuss a multigrid smoothing technique that utilizes two message passing layers for use on multicore processors.

Domino, Stefan Paul (Sandia National Laboratories, Albuquerque, NM); Karlin, Ian (University of Colorado at Boulder, Boulder, CO); Siefert, Christopher (Sandia National Laboratories, Albuquerque, NM); Hu, Jonathan Joseph; Robinson, Allen Conrad (Sandia National Laboratories, Albuquerque, NM); Tuminaro, Raymond Stephen

2009-09-01

451

Programable Pipelined-Image Processor  

NASA Technical Reports Server (NTRS)

Computer serves as pipelined processor for imagery or other two-dimensional digital data. Processor does feature extraction, smoothing, edge detection, texture measurement, and stereoscoptic area correlation. Also plans routes for obstacle avoidance by robots and solves two-dimensional partial differential equations. Image processor consists of modular units: each includes set of computing elements of types particularly useful in pipelined-image processing. Flexible interconnection scheme used to route data to subsequent stages of pipeline.

Gennery, D. B.; Wilcox, B.

1986-01-01

452

Parallel algorithms for finding trigonometric sums  

SciTech Connect

Parallel versions of Goertzel and Reinsch algorithms for finding trigonometric sums are introduced as a special case of effcient parallel algorithms for solving linear recurrence systems. The results of the experiments performed on a 20-processors Sequent Symmetry are presented and discussed.

Stpiczynski, P. [Marie Curie-Sklodowska Univ., Lublin (Poland); Paprzycki, M. [Univ. of Texas of the Permian Basin, Odessa, TX (United States)

1995-12-01

453

Silicon Auditory Processors Computer Peripherals  

E-print Network

Silicon Auditory Processors as Computer Peripherals John Lazzaro, John Wawrzynek CS Division UC describe an alternative output method for silicon auditory models, suitable for direct interface to digital

Lazzaro, John

454

CRBLASTER: A Fast Parallel-Processing Program for Cosmic Ray Rejection in Space-Based Observations  

NASA Astrophysics Data System (ADS)

Many astronomical image analysis tasks are based on algorithms that can be described as being embarrassingly parallel - where the analysis of one subimage generally does not affect the analysis of another subimage. Yet few parallel-processing astrophysical image-analysis programs exist that can easily take full advantage of today's fast multi-core servers costing a few thousands of dollars. One reason for the shortage of state-of-the-art parallel processing astrophysical image-analysis codes is that the writing of parallel codes has been perceived to be difficult. I describe a new fast parallel-processing image-analysis program called CRBLASTER which does cosmic ray rejection using van Dokkum's L.A.Cosmic algorithm. CRBLASTER is written in C using the industry standard Message Passing Interface library. Processing a single 800 x 800 Hubble Space Telescope Wide-Field Planetary Camera 2 (WFPC2) image takes 1.9 seconds using 4 processors on an Apple Xserve with two dual-core 3.0-GHz Intel Xeons; the efficiency of the program running with the 4 cores is 82%. The code has been designed to be used as a software framework for the easy development of parallel-processing image-analysis programs using embarrassing parallel algorithms; all that needs to be done is to replace the core image processing task (in this case the C function that performs the L.A.Cosmic algorithm) with an alternative image analysis task based on a single processor algorithm. I describe the design and implementation of the program and then discuss how it could possibly be used to quickly do time-critical analysis applications such as those involved with space surveillance or do complex calibration tasks as part of the pipeline processing of images from large focal plane arrays.

Mighell, K.

455

Adaptive optical processor  

NASA Astrophysics Data System (ADS)

The Phase 1 in-house effort to develop an optical processor as an electronic counter-counter-measure for radar multipath is discussed. The closed loop system demonstrates the ability to achieve a 15.2 +/- 2.4 dB adaptive cancellation of a single tone, single delay jamming signal over a 1-5 MHz bandwidth. The open loop optical system proves capable of providing a 30.2 +/- 3.9 dB cancellation of the same signal when using operator provided information. Key elements of this project include the characterization and selection of a spatial light modulator (SLM) system and the development of a software package which assists the minimization process. This has resulted in the novel use of a two dimensional binary SLM to perform as an enhanced grey scale one dimensional SLM. Additionally, an acousto-optic (AO) deflector has been demonstrated to provide the grey level dynamic range and spatial resolution required by the system. The results of this single channel testbed will be used in the future development of a multichannel optical processor.

Ward, Michael J.; Keefer, Christopher W.; Welstead, Stephen T.

1991-08-01

456

Reconfigurable data path processor  

NASA Technical Reports Server (NTRS)

A reconfigurable data path processor comprises a plurality of independent processing elements. Each of the processing elements advantageously comprising an identical architecture. Each processing element comprises a plurality of data processing means for generating a potential output. Each processor is also capable of through-putting an input as a potential output with little or no processing. Each processing element comprises a conditional multiplexer having a first conditional multiplexer input, a second conditional multiplexer input and a conditional multiplexer output. A first potential output value is transmitted to the first conditional multiplexer input, and a second potential output value is transmitted to the second conditional multiplexer output. The conditional multiplexer couples either the first conditional multiplexer input or the second conditional multiplexer input to the conditional multiplexer output, according to an output control command. The output control command is generated by processing a set of arithmetic status-bits through a logical mask. The conditional multiplexer output is coupled to a first processing element output. A first set of arithmetic bits are generated according to the processing of the first processable value. A second set of arithmetic bits may be generated from a second processing operation. The selection of the arithmetic status-bits is performed by an arithmetic-status bit multiplexer selects the desired set of arithmetic status bits from among the first and second set of arithmetic status bits. The conditional multiplexer evaluates the select arithmetic status bits according to logical mask defining an algorithm for evaluating the arithmetic status bits.

Donohoe, Gregory (Inventor)

2005-01-01

457