fast parallel implementation: Topics by Science.gov

Sample records for fast parallel implementation

Multitasking domain decomposition fast Poisson solvers on the Cray Y-MP

NASA Technical Reports Server (NTRS)

Chan, Tony F.; Fatoohi, Rod A.

1990-01-01

The results of multitasking implementation of a domain decomposition fast Poisson solver on eight processors of the Cray Y-MP are presented. The object of this research is to study the performance of domain decomposition methods on a Cray supercomputer and to analyze the performance of different multitasking techniques using highly parallel algorithms. Two implementations of multitasking are considered: macrotasking (parallelism at the subroutine level) and microtasking (parallelism at the do-loop level). A conventional FFT-based fast Poisson solver is also multitasked. The results of different implementations are compared and analyzed. A speedup of over 7.4 on the Cray Y-MP running in a dedicated environment is achieved for all cases.
Massively Parallel Solution of Poisson Equation on Coarse Grain MIMD Architectures

NASA Technical Reports Server (NTRS)

Fijany, A.; Weinberger, D.; Roosta, R.; Gulati, S.

1998-01-01

In this paper a new algorithm, designated as Fast Invariant Imbedding algorithm, for solution of Poisson equation on vector and massively parallel MIMD architectures is presented. This algorithm achieves the same optimal computational efficiency as other Fast Poisson solvers while offering a much better structure for vector and parallel implementation. Our implementation on the Intel Delta and Paragon shows that a speedup of over two orders of magnitude can be achieved even for moderate size problems.
Massively parallel algorithms for real-time wavefront control of a dense adaptive optics system

DOE Office of Scientific and Technical Information (OSTI.GOV)

Fijany, A.; Milman, M.; Redding, D.

1994-12-31

In this paper massively parallel algorithms and architectures for real-time wavefront control of a dense adaptive optic system (SELENE) are presented. The authors have already shown that the computation of a near optimal control algorithm for SELENE can be reduced to the solution of a discrete Poisson equation on a regular domain. Although, this represents an optimal computation, due the large size of the system and the high sampling rate requirement, the implementation of this control algorithm poses a computationally challenging problem since it demands a sustained computational throughput of the order of 10 GFlops. They develop a novel algorithm,more » designated as Fast Invariant Imbedding algorithm, which offers a massive degree of parallelism with simple communication and synchronization requirements. Due to these features, this algorithm is significantly more efficient than other Fast Poisson Solvers for implementation on massively parallel architectures. The authors also discuss two massively parallel, algorithmically specialized, architectures for low-cost and optimal implementation of the Fast Invariant Imbedding algorithm.« less
A Parallel Implementation of Multilevel Recursive Spectral Bisection for Application to Adaptive Unstructured Meshes. Chapter 1

NASA Technical Reports Server (NTRS)

Barnard, Stephen T.; Simon, Horst; Lasinski, T. A. (Technical Monitor)

1994-01-01

The design of a parallel implementation of multilevel recursive spectral bisection is described. The goal is to implement a code that is fast enough to enable dynamic repartitioning of adaptive meshes.
A general purpose subroutine for fast fourier transform on a distributed memory parallel machine

NASA Technical Reports Server (NTRS)

Dubey, A.; Zubair, M.; Grosch, C. E.

1992-01-01

One issue which is central in developing a general purpose Fast Fourier Transform (FFT) subroutine on a distributed memory parallel machine is the data distribution. It is possible that different users would like to use the FFT routine with different data distributions. Thus, there is a need to design FFT schemes on distributed memory parallel machines which can support a variety of data distributions. An FFT implementation on a distributed memory parallel machine which works for a number of data distributions commonly encountered in scientific applications is presented. The problem of rearranging the data after computing the FFT is also addressed. The performance of the implementation on a distributed memory parallel machine Intel iPSC/860 is evaluated.
Fast parallel approach for 2-D DHT-based real-valued discrete Gabor transform.

PubMed

Tao, Liang; Kwan, Hon Keung

2009-12-01

Two-dimensional fast Gabor transform algorithms are useful for real-time applications due to the high computational complexity of the traditional 2-D complex-valued discrete Gabor transform (CDGT). This paper presents two block time-recursive algorithms for 2-D DHT-based real-valued discrete Gabor transform (RDGT) and its inverse transform and develops a fast parallel approach for the implementation of the two algorithms. The computational complexity of the proposed parallel approach is analyzed and compared with that of the existing 2-D CDGT algorithms. The results indicate that the proposed parallel approach is attractive for real time image processing.
A fast sorting algorithm for a hypersonic rarefied flow particle simulation on the connection machine

NASA Technical Reports Server (NTRS)

Dagum, Leonardo

1989-01-01

The data parallel implementation of a particle simulation for hypersonic rarefied flow described by Dagum associates a single parallel data element with each particle in the simulation. The simulated space is divided into discrete regions called cells containing a variable and constantly changing number of particles. The implementation requires a global sort of the parallel data elements so as to arrange them in an order that allows immediate access to the information associated with cells in the simulation. Described here is a very fast algorithm for performing the necessary ranking of the parallel data elements. The performance of the new algorithm is compared with that of the microcoded instruction for ranking on the Connection Machine.
S-HARP: A parallel dynamic spectral partitioner

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sohn, A.; Simon, H.

1998-01-01

Computational science problems with adaptive meshes involve dynamic load balancing when implemented on parallel machines. This dynamic load balancing requires fast partitioning of computational meshes at run time. The authors present in this report a fast parallel dynamic partitioner, called S-HARP. The underlying principles of S-HARP are the fast feature of inertial partitioning and the quality feature of spectral partitioning. S-HARP partitions a graph from scratch, requiring no partition information from previous iterations. Two types of parallelism have been exploited in S-HARP, fine grain loop level parallelism and coarse grain recursive parallelism. The parallel partitioner has been implemented in Messagemore » Passing Interface on Cray T3E and IBM SP2 for portability. Experimental results indicate that S-HARP can partition a mesh of over 100,000 vertices into 256 partitions in 0.2 seconds on a 64 processor Cray T3E. S-HARP is much more scalable than other dynamic partitioners, giving over 15 fold speedup on 64 processors while ParaMeTiS1.0 gives a few fold speedup. Experimental results demonstrate that S-HARP is three to 10 times faster than the dynamic partitioners ParaMeTiS and Jostle on six computational meshes of size over 100,000 vertices.« less
Fast l₁-SPIRiT compressed sensing parallel imaging MRI: scalable parallel implementation and clinically feasible runtime.

PubMed

Murphy, Mark; Alley, Marcus; Demmel, James; Keutzer, Kurt; Vasanawala, Shreyas; Lustig, Michael

2012-06-01

We present l₁-SPIRiT, a simple algorithm for auto calibrating parallel imaging (acPI) and compressed sensing (CS) that permits an efficient implementation with clinically-feasible runtimes. We propose a CS objective function that minimizes cross-channel joint sparsity in the wavelet domain. Our reconstruction minimizes this objective via iterative soft-thresholding, and integrates naturally with iterative self-consistent parallel imaging (SPIRiT). Like many iterative magnetic resonance imaging reconstructions, l₁-SPIRiT's image quality comes at a high computational cost. Excessively long runtimes are a barrier to the clinical use of any reconstruction approach, and thus we discuss our approach to efficiently parallelizing l₁-SPIRiT and to achieving clinically-feasible runtimes. We present parallelizations of l₁-SPIRiT for both multi-GPU systems and multi-core CPUs, and discuss the software optimization and parallelization decisions made in our implementation. The performance of these alternatives depends on the processor architecture, the size of the image matrix, and the number of parallel imaging channels. Fundamentally, achieving fast runtime requires the correct trade-off between cache usage and parallelization overheads. We demonstrate image quality via a case from our clinical experimentation, using a custom 3DFT spoiled gradient echo (SPGR) sequence with up to 8× acceleration via Poisson-disc undersampling in the two phase-encoded directions.
Fast ℓ1-SPIRiT Compressed Sensing Parallel Imaging MRI: Scalable Parallel Implementation and Clinically Feasible Runtime

PubMed Central

Murphy, Mark; Alley, Marcus; Demmel, James; Keutzer, Kurt; Vasanawala, Shreyas; Lustig, Michael

2012-01-01

We present ℓ1-SPIRiT, a simple algorithm for auto calibrating parallel imaging (acPI) and compressed sensing (CS) that permits an efficient implementation with clinically-feasible runtimes. We propose a CS objective function that minimizes cross-channel joint sparsity in the Wavelet domain. Our reconstruction minimizes this objective via iterative soft-thresholding, and integrates naturally with iterative Self-Consistent Parallel Imaging (SPIRiT). Like many iterative MRI reconstructions, ℓ1-SPIRiT’s image quality comes at a high computational cost. Excessively long runtimes are a barrier to the clinical use of any reconstruction approach, and thus we discuss our approach to efficiently parallelizing ℓ1-SPIRiT and to achieving clinically-feasible runtimes. We present parallelizations of ℓ1-SPIRiT for both multi-GPU systems and multi-core CPUs, and discuss the software optimization and parallelization decisions made in our implementation. The performance of these alternatives depends on the processor architecture, the size of the image matrix, and the number of parallel imaging channels. Fundamentally, achieving fast runtime requires the correct trade-off between cache usage and parallelization overheads. We demonstrate image quality via a case from our clinical experimentation, using a custom 3DFT Spoiled Gradient Echo (SPGR) sequence with up to 8× acceleration via poisson-disc undersampling in the two phase-encoded directions. PMID:22345529
FleCSPH - a parallel and distributed SPH implementation based on the FleCSI framework

DOE Office of Scientific and Technical Information (OSTI.GOV)

Junghans, Christoph; Loiseau, Julien

2017-06-20

FleCSPH is a multi-physics compact application that exercises FleCSI parallel data structures for tree-based particle methods. In particular, FleCSPH implements a smoothed-particle hydrodynamics (SPH) solver for the solution of Lagrangian problems in astrophysics and cosmology. FleCSPH includes support for gravitational forces using the fast multipole method (FMM).
Parallel heuristics for scalable community detection

DOE PAGES

Lu, Hao; Halappanavar, Mahantesh; Kalyanaraman, Ananth

2015-08-14

Community detection has become a fundamental operation in numerous graph-theoretic applications. Despite its potential for application, there is only limited support for community detection on large-scale parallel computers, largely owing to the irregular and inherently sequential nature of the underlying heuristics. In this paper, we present parallelization heuristics for fast community detection using the Louvain method as the serial template. The Louvain method is an iterative heuristic for modularity optimization. Originally developed in 2008, the method has become increasingly popular owing to its ability to detect high modularity community partitions in a fast and memory-efficient manner. However, the method ismore » also inherently sequential, thereby limiting its scalability. Here, we observe certain key properties of this method that present challenges for its parallelization, and consequently propose heuristics that are designed to break the sequential barrier. For evaluation purposes, we implemented our heuristics using OpenMP multithreading, and tested them over real world graphs derived from multiple application domains. Compared to the serial Louvain implementation, our parallel implementation is able to produce community outputs with a higher modularity for most of the inputs tested, in comparable number or fewer iterations, while providing real speedups of up to 16x using 32 threads.« less
Multirate-based fast parallel algorithms for 2-D DHT-based real-valued discrete Gabor transform.

PubMed

Tao, Liang; Kwan, Hon Keung

2012-07-01

Novel algorithms for the multirate and fast parallel implementation of the 2-D discrete Hartley transform (DHT)-based real-valued discrete Gabor transform (RDGT) and its inverse transform are presented in this paper. A 2-D multirate-based analysis convolver bank is designed for the 2-D RDGT, and a 2-D multirate-based synthesis convolver bank is designed for the 2-D inverse RDGT. The parallel channels in each of the two convolver banks have a unified structure and can apply the 2-D fast DHT algorithm to speed up their computations. The computational complexity of each parallel channel is low and is independent of the Gabor oversampling rate. All the 2-D RDGT coefficients of an image are computed in parallel during the analysis process and can be reconstructed in parallel during the synthesis process. The computational complexity and time of the proposed parallel algorithms are analyzed and compared with those of the existing fastest algorithms for 2-D discrete Gabor transforms. The results indicate that the proposed algorithms are the fastest, which make them attractive for real-time image processing.
Flexible, fast and accurate sequence alignment profiling on GPGPU with PaSWAS.

PubMed

Warris, Sven; Yalcin, Feyruz; Jackson, Katherine J L; Nap, Jan Peter

2015-01-01

To obtain large-scale sequence alignments in a fast and flexible way is an important step in the analyses of next generation sequencing data. Applications based on the Smith-Waterman (SW) algorithm are often either not fast enough, limited to dedicated tasks or not sufficiently accurate due to statistical issues. Current SW implementations that run on graphics hardware do not report the alignment details necessary for further analysis. With the Parallel SW Alignment Software (PaSWAS) it is possible (a) to have easy access to the computational power of NVIDIA-based general purpose graphics processing units (GPGPUs) to perform high-speed sequence alignments, and (b) retrieve relevant information such as score, number of gaps and mismatches. The software reports multiple hits per alignment. The added value of the new SW implementation is demonstrated with two test cases: (1) tag recovery in next generation sequence data and (2) isotype assignment within an immunoglobulin 454 sequence data set. Both cases show the usability and versatility of the new parallel Smith-Waterman implementation.
Performance of parallel computation using CUDA for solving the one-dimensional elasticity equations

NASA Astrophysics Data System (ADS)

Darmawan, J. B. B.; Mungkasi, S.

2017-01-01

In this paper, we investigate the performance of parallel computation in solving the one-dimensional elasticity equations. Elasticity equations are usually implemented in engineering science. Solving these equations fast and efficiently is desired. Therefore, we propose the use of parallel computation. Our parallel computation uses CUDA of the NVIDIA. Our research results show that parallel computation using CUDA has a great advantage and is powerful when the computation is of large scale.
Potential Application of a Graphical Processing Unit to Parallel Computations in the NUBEAM Code

NASA Astrophysics Data System (ADS)

Payne, J.; McCune, D.; Prater, R.

2010-11-01

NUBEAM is a comprehensive computational Monte Carlo based model for neutral beam injection (NBI) in tokamaks. NUBEAM computes NBI-relevant profiles in tokamak plasmas by tracking the deposition and the slowing of fast ions. At the core of NUBEAM are vector calculations used to track fast ions. These calculations have recently been parallelized to run on MPI clusters. However, cost and interlink bandwidth limit the ability to fully parallelize NUBEAM on an MPI cluster. Recent implementation of double precision capabilities for Graphical Processing Units (GPUs) presents a cost effective and high performance alternative or complement to MPI computation. Commercially available graphics cards can achieve up to 672 GFLOPS double precision and can handle hundreds of thousands of threads. The ability to execute at least one thread per particle simultaneously could significantly reduce the execution time and the statistical noise of NUBEAM. Progress on implementation on a GPU will be presented.
Parallel workflow manager for non-parallel bioinformatic applications to solve large-scale biological problems on a supercomputer.

PubMed

Suplatov, Dmitry; Popova, Nina; Zhumatiy, Sergey; Voevodin, Vladimir; Švedas, Vytas

2016-04-01

Rapid expansion of online resources providing access to genomic, structural, and functional information associated with biological macromolecules opens an opportunity to gain a deeper understanding of the mechanisms of biological processes due to systematic analysis of large datasets. This, however, requires novel strategies to optimally utilize computer processing power. Some methods in bioinformatics and molecular modeling require extensive computational resources. Other algorithms have fast implementations which take at most several hours to analyze a common input on a modern desktop station, however, due to multiple invocations for a large number of subtasks the full task requires a significant computing power. Therefore, an efficient computational solution to large-scale biological problems requires both a wise parallel implementation of resource-hungry methods as well as a smart workflow to manage multiple invocations of relatively fast algorithms. In this work, a new computer software mpiWrapper has been developed to accommodate non-parallel implementations of scientific algorithms within the parallel supercomputing environment. The Message Passing Interface has been implemented to exchange information between nodes. Two specialized threads - one for task management and communication, and another for subtask execution - are invoked on each processing unit to avoid deadlock while using blocking calls to MPI. The mpiWrapper can be used to launch all conventional Linux applications without the need to modify their original source codes and supports resubmission of subtasks on node failure. We show that this approach can be used to process huge amounts of biological data efficiently by running non-parallel programs in parallel mode on a supercomputer. The C++ source code and documentation are available from http://biokinet.belozersky.msu.ru/mpiWrapper .
Hi-Corrector: a fast, scalable and memory-efficient package for normalizing large-scale Hi-C data.

PubMed

Li, Wenyuan; Gong, Ke; Li, Qingjiao; Alber, Frank; Zhou, Xianghong Jasmine

2015-03-15

Genome-wide proximity ligation assays, e.g. Hi-C and its variant TCC, have recently become important tools to study spatial genome organization. Removing biases from chromatin contact matrices generated by such techniques is a critical preprocessing step of subsequent analyses. The continuing decline of sequencing costs has led to an ever-improving resolution of the Hi-C data, resulting in very large matrices of chromatin contacts. Such large-size matrices, however, pose a great challenge on the memory usage and speed of its normalization. Therefore, there is an urgent need for fast and memory-efficient methods for normalization of Hi-C data. We developed Hi-Corrector, an easy-to-use, open source implementation of the Hi-C data normalization algorithm. Its salient features are (i) scalability-the software is capable of normalizing Hi-C data of any size in reasonable times; (ii) memory efficiency-the sequential version can run on any single computer with very limited memory, no matter how little; (iii) fast speed-the parallel version can run very fast on multiple computing nodes with limited local memory. The sequential version is implemented in ANSI C and can be easily compiled on any system; the parallel version is implemented in ANSI C with the MPI library (a standardized and portable parallel environment designed for solving large-scale scientific problems). The package is freely available at http://zhoulab.usc.edu/Hi-Corrector/. © The Author 2014. Published by Oxford University Press.
Solutions of large-scale electromagnetics problems involving dielectric objects with the parallel multilevel fast multipole algorithm.

PubMed

Ergül, Özgür

2011-11-01

Fast and accurate solutions of large-scale electromagnetics problems involving homogeneous dielectric objects are considered. Problems are formulated with the electric and magnetic current combined-field integral equation and discretized with the Rao-Wilton-Glisson functions. Solutions are performed iteratively by using the multilevel fast multipole algorithm (MLFMA). For the solution of large-scale problems discretized with millions of unknowns, MLFMA is parallelized on distributed-memory architectures using a rigorous technique, namely, the hierarchical partitioning strategy. Efficiency and accuracy of the developed implementation are demonstrated on very large problems involving as many as 100 million unknowns.
Efficient implementation of parallel three-dimensional FFT on clusters of PCs

NASA Astrophysics Data System (ADS)

Takahashi, Daisuke

2003-05-01

In this paper, we propose a high-performance parallel three-dimensional fast Fourier transform (FFT) algorithm on clusters of PCs. The three-dimensional FFT algorithm can be altered into a block three-dimensional FFT algorithm to reduce the number of cache misses. We show that the block three-dimensional FFT algorithm improves performance by utilizing the cache memory effectively. We use the block three-dimensional FFT algorithm to implement the parallel three-dimensional FFT algorithm. We succeeded in obtaining performance of over 1.3 GFLOPS on an 8-node dual Pentium III 1 GHz PC SMP cluster.

Hybrid MPI/OpenMP Implementation of the ORAC Molecular Dynamics Program for Generalized Ensemble and Fast Switching Alchemical Simulations.

PubMed

Procacci, Piero

2016-06-27

We present a new release (6.0β) of the ORAC program [Marsili et al. J. Comput. Chem. 2010, 31, 1106-1116] with a hybrid OpenMP/MPI (open multiprocessing message passing interface) multilevel parallelism tailored for generalized ensemble (GE) and fast switching double annihilation (FS-DAM) nonequilibrium technology aimed at evaluating the binding free energy in drug-receptor system on high performance computing platforms. The production of the GE or FS-DAM trajectories is handled using a weak scaling parallel approach on the MPI level only, while a strong scaling force decomposition scheme is implemented for intranode computations with shared memory access at the OpenMP level. The efficiency, simplicity, and inherent parallel nature of the ORAC implementation of the FS-DAM algorithm, project the code as a possible effective tool for a second generation high throughput virtual screening in drug discovery and design. The code, along with documentation, testing, and ancillary tools, is distributed under the provisions of the General Public License and can be freely downloaded at www.chim.unifi.it/orac .
On a model of three-dimensional bursting and its parallel implementation

NASA Astrophysics Data System (ADS)

Tabik, S.; Romero, L. F.; Garzón, E. M.; Ramos, J. I.

2008-04-01

A mathematical model for the simulation of three-dimensional bursting phenomena and its parallel implementation are presented. The model consists of four nonlinearly coupled partial differential equations that include fast and slow variables, and exhibits bursting in the absence of diffusion. The differential equations have been discretized by means of a second-order accurate in both space and time, linearly-implicit finite difference method in equally-spaced grids. The resulting system of linear algebraic equations at each time level has been solved by means of the Preconditioned Conjugate Gradient (PCG) method. Three different parallel implementations of the proposed mathematical model have been developed; two of these implementations, i.e., the MPI and the PETSc codes, are based on a message passing paradigm, while the third one, i.e., the OpenMP code, is based on a shared space address paradigm. These three implementations are evaluated on two current high performance parallel architectures, i.e., a dual-processor cluster and a Shared Distributed Memory (SDM) system. A novel representation of the results that emphasizes the most relevant factors that affect the performance of the paralled implementations, is proposed. The comparative analysis of the computational results shows that the MPI and the OpenMP implementations are about twice more efficient than the PETSc code on the SDM system. It is also shown that, for the conditions reported here, the nonlinear dynamics of the three-dimensional bursting phenomena exhibits three stages characterized by asynchronous, synchronous and then asynchronous oscillations, before a quiescent state is reached. It is also shown that the fast system reaches steady state in much less time than the slow variables.
Fast hydrological model calibration based on the heterogeneous parallel computing accelerated shuffled complex evolution method

NASA Astrophysics Data System (ADS)

Kan, Guangyuan; He, Xiaoyan; Ding, Liuqian; Li, Jiren; Hong, Yang; Zuo, Depeng; Ren, Minglei; Lei, Tianjie; Liang, Ke

2018-01-01

Hydrological model calibration has been a hot issue for decades. The shuffled complex evolution method developed at the University of Arizona (SCE-UA) has been proved to be an effective and robust optimization approach. However, its computational efficiency deteriorates significantly when the amount of hydrometeorological data increases. In recent years, the rise of heterogeneous parallel computing has brought hope for the acceleration of hydrological model calibration. This study proposed a parallel SCE-UA method and applied it to the calibration of a watershed rainfall-runoff model, the Xinanjiang model. The parallel method was implemented on heterogeneous computing systems using OpenMP and CUDA. Performance testing and sensitivity analysis were carried out to verify its correctness and efficiency. Comparison results indicated that heterogeneous parallel computing-accelerated SCE-UA converged much more quickly than the original serial version and possessed satisfactory accuracy and stability for the task of fast hydrological model calibration.
Churchill: an ultra-fast, deterministic, highly scalable and balanced parallelization strategy for the discovery of human genetic variation in clinical and population-scale genomics.

PubMed

Kelly, Benjamin J; Fitch, James R; Hu, Yangqiu; Corsmeier, Donald J; Zhong, Huachun; Wetzel, Amy N; Nordquist, Russell D; Newsom, David L; White, Peter

2015-01-20

While advances in genome sequencing technology make population-scale genomics a possibility, current approaches for analysis of these data rely upon parallelization strategies that have limited scalability, complex implementation and lack reproducibility. Churchill, a balanced regional parallelization strategy, overcomes these challenges, fully automating the multiple steps required to go from raw sequencing reads to variant discovery. Through implementation of novel deterministic parallelization techniques, Churchill allows computationally efficient analysis of a high-depth whole genome sample in less than two hours. The method is highly scalable, enabling full analysis of the 1000 Genomes raw sequence dataset in a week using cloud resources. http://churchill.nchri.org/.
Ordered fast fourier transforms on a massively parallel hypercube multiprocessor

NASA Technical Reports Server (NTRS)

Tong, Charles; Swarztrauber, Paul N.

1989-01-01

Design alternatives for ordered Fast Fourier Transformation (FFT) algorithms were examined on massively parallel hypercube multiprocessors such as the Connection Machine. Particular emphasis is placed on reducing communication which is known to dominate the overall computing time. To this end, the order and computational phases of the FFT were combined, and the sequence to processor maps that reduce communication were used. The class of ordered transforms is expanded to include any FFT in which the order of the transform is the same as that of the input sequence. Two such orderings are examined, namely, standard-order and A-order which can be implemented with equal ease on the Connection Machine where orderings are determined by geometries and priorities. If the sequence has N = 2 exp r elements and the hypercube has P = 2 exp d processors, then a standard-order FFT can be implemented with d + r/2 + 1 parallel transmissions. An A-order sequence can be transformed with 2d - r/2 parallel transmissions which is r - d + 1 fewer than the standard order. A parallel method for computing the trigonometric coefficients is presented that does not use trigonometric functions or interprocessor communication. A performance of 0.9 GFLOPS was obtained for an A-order transform on the Connection Machine.
[CMACPAR an modified parallel neuro-controller for control processes].

PubMed

Ramos, E; Surós, R

1999-01-01

CMACPAR is a Parallel Neurocontroller oriented to real time systems as for example Control Processes. Its characteristics are mainly a fast learning algorithm, a reduced number of calculations, great generalization capacity, local learning and intrinsic parallelism. This type of neurocontroller is used in real time applications required by refineries, hydroelectric centers, factories, etc. In this work we present the analysis and the parallel implementation of a modified scheme of the Cerebellar Model CMAC for the n-dimensional space projection using a mean granularity parallel neurocontroller. The proposed memory management allows for a significant memory reduction in training time and required memory size.
Efficient Parallel Levenberg-Marquardt Model Fitting towards Real-Time Automated Parametric Imaging Microscopy

PubMed Central

Zhu, Xiang; Zhang, Dianwen

2013-01-01

We present a fast, accurate and robust parallel Levenberg-Marquardt minimization optimizer, GPU-LMFit, which is implemented on graphics processing unit for high performance scalable parallel model fitting processing. GPU-LMFit can provide a dramatic speed-up in massive model fitting analyses to enable real-time automated pixel-wise parametric imaging microscopy. We demonstrate the performance of GPU-LMFit for the applications in superresolution localization microscopy and fluorescence lifetime imaging microscopy. PMID:24130785
A distributed parallel storage architecture and its potential application within EOSDIS

NASA Technical Reports Server (NTRS)

Johnston, William E.; Tierney, Brian; Feuquay, Jay; Butzer, Tony

1994-01-01

We describe the architecture, implementation, use of a scalable, high performance, distributed-parallel data storage system developed in the ARPA funded MAGIC gigabit testbed. A collection of wide area distributed disk servers operate in parallel to provide logical block level access to large data sets. Operated primarily as a network-based cache, the architecture supports cooperation among independently owned resources to provide fast, large-scale, on-demand storage to support data handling, simulation, and computation.
Fast parallel tandem mass spectral library searching using GPU hardware acceleration.

PubMed

Baumgardner, Lydia Ashleigh; Shanmugam, Avinash Kumar; Lam, Henry; Eng, Jimmy K; Martin, Daniel B

2011-06-03

Mass spectrometry-based proteomics is a maturing discipline of biologic research that is experiencing substantial growth. Instrumentation has steadily improved over time with the advent of faster and more sensitive instruments collecting ever larger data files. Consequently, the computational process of matching a peptide fragmentation pattern to its sequence, traditionally accomplished by sequence database searching and more recently also by spectral library searching, has become a bottleneck in many mass spectrometry experiments. In both of these methods, the main rate-limiting step is the comparison of an acquired spectrum with all potential matches from a spectral library or sequence database. This is a highly parallelizable process because the core computational element can be represented as a simple but arithmetically intense multiplication of two vectors. In this paper, we present a proof of concept project taking advantage of the massively parallel computing available on graphics processing units (GPUs) to distribute and accelerate the process of spectral assignment using spectral library searching. This program, which we have named FastPaSS (for Fast Parallelized Spectral Searching), is implemented in CUDA (Compute Unified Device Architecture) from NVIDIA, which allows direct access to the processors in an NVIDIA GPU. Our efforts demonstrate the feasibility of GPU computing for spectral assignment, through implementation of the validated spectral searching algorithm SpectraST in the CUDA environment.
Development of Fast Algorithms Using Recursion, Nesting and Iterations for Computational Electromagnetics

NASA Technical Reports Server (NTRS)

Chew, W. C.; Song, J. M.; Lu, C. C.; Weedon, W. H.

1995-01-01

In the first phase of our work, we have concentrated on laying the foundation to develop fast algorithms, including the use of recursive structure like the recursive aggregate interaction matrix algorithm (RAIMA), the nested equivalence principle algorithm (NEPAL), the ray-propagation fast multipole algorithm (RPFMA), and the multi-level fast multipole algorithm (MLFMA). We have also investigated the use of curvilinear patches to build a basic method of moments code where these acceleration techniques can be used later. In the second phase, which is mainly reported on here, we have concentrated on implementing three-dimensional NEPAL on a massively parallel machine, the Connection Machine CM-5, and have been able to obtain some 3D scattering results. In order to understand the parallelization of codes on the Connection Machine, we have also studied the parallelization of 3D finite-difference time-domain (FDTD) code with PML material absorbing boundary condition (ABC). We found that simple algorithms like the FDTD with material ABC can be parallelized very well allowing us to solve within a minute a problem of over a million nodes. In addition, we have studied the use of the fast multipole method and the ray-propagation fast multipole algorithm to expedite matrix-vector multiplication in a conjugate-gradient solution to integral equations of scattering. We find that these methods are faster than LU decomposition for one incident angle, but are slower than LU decomposition when many incident angles are needed as in the monostatic RCS calculations.
Digital tomosynthesis mammography using a parallel maximum-likelihood reconstruction method

NASA Astrophysics Data System (ADS)

Wu, Tao; Zhang, Juemin; Moore, Richard; Rafferty, Elizabeth; Kopans, Daniel; Meleis, Waleed; Kaeli, David

2004-05-01

A parallel reconstruction method, based on an iterative maximum likelihood (ML) algorithm, is developed to provide fast reconstruction for digital tomosynthesis mammography. Tomosynthesis mammography acquires 11 low-dose projections of a breast by moving an x-ray tube over a 50° angular range. In parallel reconstruction, each projection is divided into multiple segments along the chest-to-nipple direction. Using the 11 projections, segments located at the same distance from the chest wall are combined to compute a partial reconstruction of the total breast volume. The shape of the partial reconstruction forms a thin slab, angled toward the x-ray source at a projection angle 0°. The reconstruction of the total breast volume is obtained by merging the partial reconstructions. The overlap region between neighboring partial reconstructions and neighboring projection segments is utilized to compensate for the incomplete data at the boundary locations present in the partial reconstructions. A serial execution of the reconstruction is compared to a parallel implementation, using clinical data. The serial code was run on a PC with a single PentiumIV 2.2GHz CPU. The parallel implementation was developed using MPI and run on a 64-node Linux cluster using 800MHz Itanium CPUs. The serial reconstruction for a medium-sized breast (5cm thickness, 11cm chest-to-nipple distance) takes 115 minutes, while a parallel implementation takes only 3.5 minutes. The reconstruction time for a larger breast using a serial implementation takes 187 minutes, while a parallel implementation takes 6.5 minutes. No significant differences were observed between the reconstructions produced by the serial and parallel implementations.
Massively Parallel Processing for Fast and Accurate Stamping Simulations

NASA Astrophysics Data System (ADS)

Gress, Jeffrey J.; Xu, Siguang; Joshi, Ramesh; Wang, Chuan-tao; Paul, Sabu

2005-08-01

The competitive automotive market drives automotive manufacturers to speed up the vehicle development cycles and reduce the lead-time. Fast tooling development is one of the key areas to support fast and short vehicle development programs (VDP). In the past ten years, the stamping simulation has become the most effective validation tool in predicting and resolving all potential formability and quality problems before the dies are physically made. The stamping simulation and formability analysis has become an critical business segment in GM math-based die engineering process. As the simulation becomes as one of the major production tools in engineering factory, the simulation speed and accuracy are the two of the most important measures for stamping simulation technology. The speed and time-in-system of forming analysis becomes an even more critical to support the fast VDP and tooling readiness. Since 1997, General Motors Die Center has been working jointly with our software vendor to develop and implement a parallel version of simulation software for mass production analysis applications. By 2001, this technology was matured in the form of distributed memory processing (DMP) of draw die simulations in a networked distributed memory computing environment. In 2004, this technology was refined to massively parallel processing (MPP) and extended to line die forming analysis (draw, trim, flange, and associated spring-back) running on a dedicated computing environment. The evolution of this technology and the insight gained through the implementation of DM0P/MPP technology as well as performance benchmarks are discussed in this publication.
GPU Lossless Hyperspectral Data Compression System for Space Applications

NASA Technical Reports Server (NTRS)

Keymeulen, Didier; Aranki, Nazeeh; Hopson, Ben; Kiely, Aaron; Klimesh, Matthew; Benkrid, Khaled

2012-01-01

On-board lossless hyperspectral data compression reduces data volume in order to meet NASA and DoD limited downlink capabilities. At JPL, a novel, adaptive and predictive technique for lossless compression of hyperspectral data, named the Fast Lossless (FL) algorithm, was recently developed. This technique uses an adaptive filtering method and achieves state-of-the-art performance in both compression effectiveness and low complexity. Because of its outstanding performance and suitability for real-time onboard hardware implementation, the FL compressor is being formalized as the emerging CCSDS Standard for Lossless Multispectral & Hyperspectral image compression. The FL compressor is well-suited for parallel hardware implementation. A GPU hardware implementation was developed for FL targeting the current state-of-the-art GPUs from NVIDIA(Trademark). The GPU implementation on a NVIDIA(Trademark) GeForce(Trademark) GTX 580 achieves a throughput performance of 583.08 Mbits/sec (44.85 MSamples/sec) and an acceleration of at least 6 times a software implementation running on a 3.47 GHz single core Intel(Trademark) Xeon(Trademark) processor. This paper describes the design and implementation of the FL algorithm on the GPU. The massively parallel implementation will provide in the future a fast and practical real-time solution for airborne and space applications.
Efficient multitasking of Choleski matrix factorization on CRAY supercomputers

NASA Technical Reports Server (NTRS)

Overman, Andrea L.; Poole, Eugene L.

1991-01-01

A Choleski method is described and used to solve linear systems of equations that arise in large scale structural analysis. The method uses a novel variable-band storage scheme and is structured to exploit fast local memory caches while minimizing data access delays between main memory and vector registers. Several parallel implementations of this method are described for the CRAY-2 and CRAY Y-MP computers demonstrating the use of microtasking and autotasking directives. A portable parallel language, FORCE, is used for comparison with the microtasked and autotasked implementations. Results are presented comparing the matrix factorization times for three representative structural analysis problems from runs made in both dedicated and multi-user modes on both computers. CPU and wall clock timings are given for the parallel implementations and are compared to single processor timings of the same algorithm.
Further optimization of SeDDaRA blind image deconvolution algorithm and its DSP implementation

NASA Astrophysics Data System (ADS)

Wen, Bo; Zhang, Qiheng; Zhang, Jianlin

2011-11-01

Efficient algorithm for blind image deconvolution and its high-speed implementation is of great value in practice. Further optimization of SeDDaRA is developed, from algorithm structure to numerical calculation methods. The main optimization covers that, the structure's modularization for good implementation feasibility, reducing the data computation and dependency of 2D-FFT/IFFT, and acceleration of power operation by segmented look-up table. Then the Fast SeDDaRA is proposed and specialized for low complexity. As the final implementation, a hardware system of image restoration is conducted by using the multi-DSP parallel processing. Experimental results show that, the processing time and memory demand of Fast SeDDaRA decreases 50% at least; the data throughput of image restoration system is over 7.8Msps. The optimization is proved efficient and feasible, and the Fast SeDDaRA is able to support the real-time application.
Large-Scale Linear Optimization through Machine Learning: From Theory to Practical System Design and Implementation

DTIC Science & Technology

2016-08-10

AFRL-AFOSR-JP-TR-2016-0073 Large-scale Linear Optimization through Machine Learning: From Theory to Practical System Design and Implementation ...2016 4. TITLE AND SUBTITLE Large-scale Linear Optimization through Machine Learning: From Theory to Practical System Design and Implementation 5a...performances on various machine learning tasks and it naturally lends itself to fast parallel implementations . Despite this, very little work has been
A parallel approach of COFFEE objective function to multiple sequence alignment

NASA Astrophysics Data System (ADS)

Zafalon, G. F. D.; Visotaky, J. M. V.; Amorim, A. R.; Valêncio, C. R.; Neves, L. A.; de Souza, R. C. G.; Machado, J. M.

2015-09-01

The computational tools to assist genomic analyzes show even more necessary due to fast increasing of data amount available. With high computational costs of deterministic algorithms for sequence alignments, many works concentrate their efforts in the development of heuristic approaches to multiple sequence alignments. However, the selection of an approach, which offers solutions with good biological significance and feasible execution time, is a great challenge. Thus, this work aims to show the parallelization of the processing steps of MSA-GA tool using multithread paradigm in the execution of COFFEE objective function. The standard objective function implemented in the tool is the Weighted Sum of Pairs (WSP), which produces some distortions in the final alignments when sequences sets with low similarity are aligned. Then, in studies previously performed we implemented the COFFEE objective function in the tool to smooth these distortions. Although the nature of COFFEE objective function implies in the increasing of execution time, this approach presents points, which can be executed in parallel. With the improvements implemented in this work, we can verify the execution time of new approach is 24% faster than the sequential approach with COFFEE. Moreover, the COFFEE multithreaded approach is more efficient than WSP, because besides it is slightly fast, its biological results are better.
Exploiting Symmetry on Parallel Architectures.

NASA Astrophysics Data System (ADS)

Stiller, Lewis Benjamin

1995-01-01

This thesis describes techniques for the design of parallel programs that solve well-structured problems with inherent symmetry. Part I demonstrates the reduction of such problems to generalized matrix multiplication by a group-equivariant matrix. Fast techniques for this multiplication are described, including factorization, orbit decomposition, and Fourier transforms over finite groups. Our algorithms entail interaction between two symmetry groups: one arising at the software level from the problem's symmetry and the other arising at the hardware level from the processors' communication network. Part II illustrates the applicability of our symmetry -exploitation techniques by presenting a series of case studies of the design and implementation of parallel programs. First, a parallel program that solves chess endgames by factorization of an associated dihedral group-equivariant matrix is described. This code runs faster than previous serial programs, and discovered it a number of results. Second, parallel algorithms for Fourier transforms for finite groups are developed, and preliminary parallel implementations for group transforms of dihedral and of symmetric groups are described. Applications in learning, vision, pattern recognition, and statistics are proposed. Third, parallel implementations solving several computational science problems are described, including the direct n-body problem, convolutions arising from molecular biology, and some communication primitives such as broadcast and reduce. Some of our implementations ran orders of magnitude faster than previous techniques, and were used in the investigation of various physical phenomena.
Overview of implementation of DARPA GPU program in SAIC

NASA Astrophysics Data System (ADS)

Braunreiter, Dennis; Furtek, Jeremy; Chen, Hai-Wen; Healy, Dennis

2008-04-01

This paper reviews the implementation of DARPA MTO STAP-BOY program for both Phase I and II conducted at Science Applications International Corporation (SAIC). The STAP-BOY program conducts fast covariance factorization and tuning techniques for space-time adaptive process (STAP) Algorithm Implementation on Graphics Processor unit (GPU) Architectures for Embedded Systems. The first part of our presentation on the DARPA STAP-BOY program will focus on GPU implementation and algorithm innovations for a prototype radar STAP algorithm. The STAP algorithm will be implemented on the GPU, using stream programming (from companies such as PeakStream, ATI Technologies' CTM, and NVIDIA) and traditional graphics APIs. This algorithm will include fast range adaptive STAP weight updates and beamforming applications, each of which has been modified to exploit the parallel nature of graphics architectures.
Bit error rate tester using fast parallel generation of linear recurring sequences

DOEpatents

Pierson, Lyndon G.; Witzke, Edward L.; Maestas, Joseph H.

2003-05-06

A fast method for generating linear recurring sequences by parallel linear recurring sequence generators (LRSGs) with a feedback circuit optimized to balance minimum propagation delay against maximal sequence period. Parallel generation of linear recurring sequences requires decimating the sequence (creating small contiguous sections of the sequence in each LRSG). A companion matrix form is selected depending on whether the LFSR is right-shifting or left-shifting. The companion matrix is completed by selecting a primitive irreducible polynomial with 1's most closely grouped in a corner of the companion matrix. A decimation matrix is created by raising the companion matrix to the (n*k).sup.th power, where k is the number of parallel LRSGs and n is the number of bits to be generated at a time by each LRSG. Companion matrices with 1's closely grouped in a corner will yield sparse decimation matrices. A feedback circuit comprised of XOR logic gates implements the decimation matrix in hardware. Sparse decimation matrices can be implemented with minimum number of XOR gates, and therefore a minimum propagation delay through the feedback circuit. The LRSG of the invention is particularly well suited to use as a bit error rate tester on high speed communication lines because it permits the receiver to synchronize to the transmitted pattern within 2n bits.

fast_protein_cluster: parallel and optimized clustering of large-scale protein modeling data.

PubMed

Hung, Ling-Hong; Samudrala, Ram

2014-06-15

fast_protein_cluster is a fast, parallel and memory efficient package used to cluster 60 000 sets of protein models (with up to 550 000 models per set) generated by the Nutritious Rice for the World project. fast_protein_cluster is an optimized and extensible toolkit that supports Root Mean Square Deviation after optimal superposition (RMSD) and Template Modeling score (TM-score) as metrics. RMSD calculations using a laptop CPU are 60× faster than qcprot and 3× faster than current graphics processing unit (GPU) implementations. New GPU code further increases the speed of RMSD and TM-score calculations. fast_protein_cluster provides novel k-means and hierarchical clustering methods that are up to 250× and 2000× faster, respectively, than Clusco, and identify significantly more accurate models than Spicker and Clusco. fast_protein_cluster is written in C++ using OpenMP for multi-threading support. Custom streaming Single Instruction Multiple Data (SIMD) extensions and advanced vector extension intrinsics code accelerate CPU calculations, and OpenCL kernels support AMD and Nvidia GPUs. fast_protein_cluster is available under the M.I.T. license. (http://software.compbio.washington.edu/fast_protein_cluster) © The Author 2014. Published by Oxford University Press.
Fast parallel tandem mass spectral library searching using GPU hardware acceleration

PubMed Central

Baumgardner, Lydia Ashleigh; Shanmugam, Avinash Kumar; Lam, Henry; Eng, Jimmy K.; Martin, Daniel B.

2011-01-01

Mass spectrometry-based proteomics is a maturing discipline of biologic research that is experiencing substantial growth. Instrumentation has steadily improved over time with the advent of faster and more sensitive instruments collecting ever larger data files. Consequently, the computational process of matching a peptide fragmentation pattern to its sequence, traditionally accomplished by sequence database searching and more recently also by spectral library searching, has become a bottleneck in many mass spectrometry experiments. In both of these methods, the main rate limiting step is the comparison of an acquired spectrum with all potential matches from a spectral library or sequence database. This is a highly parallelizable process because the core computational element can be represented as a simple but arithmetically intense multiplication of two vectors. In this paper we present a proof of concept project taking advantage of the massively parallel computing available on graphics processing units (GPUs) to distribute and accelerate the process of spectral assignment using spectral library searching. This program, which we have named FastPaSS (for Fast Parallelized Spectral Searching) is implemented in CUDA (Compute Unified Device Architecture) from NVIDIA which allows direct access to the processors in an NVIDIA GPU. Our efforts demonstrate the feasibility of GPU computing for spectral assignment, through implementation of the validated spectral searching algorithm SpectraST in the CUDA environment. PMID:21545112
Efficient implementation of multidimensional fast fourier transform on a distributed-memory parallel multi-node computer

DOEpatents

Bhanot, Gyan V [Princeton, NJ; Chen, Dong [Croton-On-Hudson, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Philip [Cortlandt Manor, NY; Steinmacher-Burow, Burkhard D [Mount Kisco, NY; Vranas, Pavlos M [Bedford Hills, NY

2012-01-10

The present in invention is directed to a method, system and program storage device for efficiently implementing a multidimensional Fast Fourier Transform (FFT) of a multidimensional array comprising a plurality of elements initially distributed in a multi-node computer system comprising a plurality of nodes in communication over a network, comprising: distributing the plurality of elements of the array in a first dimension across the plurality of nodes of the computer system over the network to facilitate a first one-dimensional FFT; performing the first one-dimensional FFT on the elements of the array distributed at each node in the first dimension; re-distributing the one-dimensional FFT-transformed elements at each node in a second dimension via "all-to-all" distribution in random order across other nodes of the computer system over the network; and performing a second one-dimensional FFT on elements of the array re-distributed at each node in the second dimension, wherein the random order facilitates efficient utilization of the network thereby efficiently implementing the multidimensional FFT. The "all-to-all" re-distribution of array elements is further efficiently implemented in applications other than the multidimensional FFT on the distributed-memory parallel supercomputer.
Efficient implementation of a multidimensional fast fourier transform on a distributed-memory parallel multi-node computer

DOEpatents

Bhanot, Gyan V [Princeton, NJ; Chen, Dong [Croton-On-Hudson, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Philip [Cortlandt Manor, NY; Steinmacher-Burow, Burkhard D [Mount Kisco, NY; Vranas, Pavlos M [Bedford Hills, NY

2008-01-01

The present in invention is directed to a method, system and program storage device for efficiently implementing a multidimensional Fast Fourier Transform (FFT) of a multidimensional array comprising a plurality of elements initially distributed in a multi-node computer system comprising a plurality of nodes in communication over a network, comprising: distributing the plurality of elements of the array in a first dimension across the plurality of nodes of the computer system over the network to facilitate a first one-dimensional FFT; performing the first one-dimensional FFT on the elements of the array distributed at each node in the first dimension; re-distributing the one-dimensional FFT-transformed elements at each node in a second dimension via "all-to-all" distribution in random order across other nodes of the computer system over the network; and performing a second one-dimensional FFT on elements of the array re-distributed at each node in the second dimension, wherein the random order facilitates efficient utilization of the network thereby efficiently implementing the multidimensional FFT. The "all-to-all" re-distribution of array elements is further efficiently implemented in applications other than the multidimensional FFT on the distributed-memory parallel supercomputer.
Turbomachinery CFD on parallel computers

NASA Technical Reports Server (NTRS)

Blech, Richard A.; Milner, Edward J.; Quealy, Angela; Townsend, Scott E.

1992-01-01

The role of multistage turbomachinery simulation in the development of propulsion system models is discussed. Particularly, the need for simulations with higher fidelity and faster turnaround time is highlighted. It is shown how such fast simulations can be used in engineering-oriented environments. The use of parallel processing to achieve the required turnaround times is discussed. Current work by several researchers in this area is summarized. Parallel turbomachinery CFD research at the NASA Lewis Research Center is then highlighted. These efforts are focused on implementing the average-passage turbomachinery model on MIMD, distributed memory parallel computers. Performance results are given for inviscid, single blade row and viscous, multistage applications on several parallel computers, including networked workstations.
Parallel processing in a host plus multiple array processor system for radar

NASA Technical Reports Server (NTRS)

Barkan, B. Z.

1983-01-01

Host plus multiple array processor architecture is demonstrated to yield a modular, fast, and cost-effective system for radar processing. Software methodology for programming such a system is developed. Parallel processing with pipelined data flow among the host, array processors, and discs is implemented. Theoretical analysis of performance is made and experimentally verified. The broad class of problems to which the architecture and methodology can be applied is indicated.
An Efficient Pipeline Wavefront Phase Recovery for the CAFADIS Camera for Extremely Large Telescopes

PubMed Central

Magdaleno, Eduardo; Rodríguez, Manuel; Rodríguez-Ramos, José Manuel

2010-01-01

In this paper we show a fast, specialized hardware implementation of the wavefront phase recovery algorithm using the CAFADIS camera. The CAFADIS camera is a new plenoptic sensor patented by the Universidad de La Laguna (Canary Islands, Spain): international patent PCT/ES2007/000046 (WIPO publication number WO/2007/082975). It can simultaneously measure the wavefront phase and the distance to the light source in a real-time process. The pipeline algorithm is implemented using Field Programmable Gate Arrays (FPGA). These devices present architecture capable of handling the sensor output stream using a massively parallel approach and they are efficient enough to resolve several Adaptive Optics (AO) problems in Extremely Large Telescopes (ELTs) in terms of processing time requirements. The FPGA implementation of the wavefront phase recovery algorithm using the CAFADIS camera is based on the very fast computation of two dimensional fast Fourier Transforms (FFTs). Thus we have carried out a comparison between our very novel FPGA 2D-FFTa and other implementations. PMID:22315523
An efficient pipeline wavefront phase recovery for the CAFADIS camera for extremely large telescopes.

PubMed

Magdaleno, Eduardo; Rodríguez, Manuel; Rodríguez-Ramos, José Manuel

2010-01-01

In this paper we show a fast, specialized hardware implementation of the wavefront phase recovery algorithm using the CAFADIS camera. The CAFADIS camera is a new plenoptic sensor patented by the Universidad de La Laguna (Canary Islands, Spain): international patent PCT/ES2007/000046 (WIPO publication number WO/2007/082975). It can simultaneously measure the wavefront phase and the distance to the light source in a real-time process. The pipeline algorithm is implemented using Field Programmable Gate Arrays (FPGA). These devices present architecture capable of handling the sensor output stream using a massively parallel approach and they are efficient enough to resolve several Adaptive Optics (AO) problems in Extremely Large Telescopes (ELTs) in terms of processing time requirements. The FPGA implementation of the wavefront phase recovery algorithm using the CAFADIS camera is based on the very fast computation of two dimensional fast Fourier Transforms (FFTs). Thus we have carried out a comparison between our very novel FPGA 2D-FFTa and other implementations.
The development of GPU-based parallel PRNG for Monte Carlo applications in CUDA Fortran

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kargaran, Hamed, E-mail: h-kargaran@sbu.ac.ir; Minuchehr, Abdolhamid; Zolfaghari, Ahmad

The implementation of Monte Carlo simulation on the CUDA Fortran requires a fast random number generation with good statistical properties on GPU. In this study, a GPU-based parallel pseudo random number generator (GPPRNG) have been proposed to use in high performance computing systems. According to the type of GPU memory usage, GPU scheme is divided into two work modes including GLOBAL-MODE and SHARED-MODE. To generate parallel random numbers based on the independent sequence method, the combination of middle-square method and chaotic map along with the Xorshift PRNG have been employed. Implementation of our developed PPRNG on a single GPU showedmore » a speedup of 150x and 470x (with respect to the speed of PRNG on a single CPU core) for GLOBAL-MODE and SHARED-MODE, respectively. To evaluate the accuracy of our developed GPPRNG, its performance was compared to that of some other commercially available PPRNGs such as MATLAB, FORTRAN and Miller-Park algorithm through employing the specific standard tests. The results of this comparison showed that the developed GPPRNG in this study can be used as a fast and accurate tool for computational science applications.« less
FastMag: Fast micromagnetic simulator for complex magnetic structures (invited)

NASA Astrophysics Data System (ADS)

Chang, R.; Li, S.; Lubarda, M. V.; Livshitz, B.; Lomakin, V.

2011-04-01

A fast micromagnetic simulator (FastMag) for general problems is presented. FastMag solves the Landau-Lifshitz-Gilbert equation and can handle multiscale problems with a high computational efficiency. The simulator derives its high performance from efficient methods for evaluating the effective field and from implementations on massively parallel graphics processing unit (GPU) architectures. FastMag discretizes the computational domain into tetrahedral elements and therefore is highly flexible for general problems. The magnetostatic field is computed via the superposition principle for both volume and surface parts of the computational domain. This is accomplished by implementing efficient quadrature rules and analytical integration for overlapping elements in which the integral kernel is singular. Thus, discretized superposition integrals are computed using a nonuniform grid interpolation method, which evaluates the field from N sources at N collocated observers in O(N) operations. This approach allows handling objects of arbitrary shape, allows easily calculating of the field outside the magnetized domains, does not require solving a linear system of equations, and requires little memory. FastMag is implemented on GPUs with ?> GPU-central processing unit speed-ups of 2 orders of magnitude. Simulations are shown of a large array of magnetic dots and a recording head fully discretized down to the exchange length, with over a hundred million tetrahedral elements on an inexpensive desktop computer.
Modular time division multiplexer: Efficient simultaneous characterization of fast and slow transients in multiple samples

NASA Astrophysics Data System (ADS)

Kim, Stephan D.; Luo, Jiajun; Buchholz, D. Bruce; Chang, R. P. H.; Grayson, M.

2016-09-01

A modular time division multiplexer (MTDM) device is introduced to enable parallel measurement of multiple samples with both fast and slow decay transients spanning from millisecond to month-long time scales. This is achieved by dedicating a single high-speed measurement instrument for rapid data collection at the start of a transient, and by multiplexing a second low-speed measurement instrument for slow data collection of several samples in parallel for the later transients. The MTDM is a high-level design concept that can in principle measure an arbitrary number of samples, and the low cost implementation here allows up to 16 samples to be measured in parallel over several months, reducing the total ensemble measurement duration and equipment usage by as much as an order of magnitude without sacrificing fidelity. The MTDM was successfully demonstrated by simultaneously measuring the photoconductivity of three amorphous indium-gallium-zinc-oxide thin films with 20 ms data resolution for fast transients and an uninterrupted parallel run time of over 20 days. The MTDM has potential applications in many areas of research that manifest response times spanning many orders of magnitude, such as photovoltaics, rechargeable batteries, amorphous semiconductors such as silicon and amorphous indium-gallium-zinc-oxide.
Modular time division multiplexer: Efficient simultaneous characterization of fast and slow transients in multiple samples.

PubMed

Kim, Stephan D; Luo, Jiajun; Buchholz, D Bruce; Chang, R P H; Grayson, M

2016-09-01

A modular time division multiplexer (MTDM) device is introduced to enable parallel measurement of multiple samples with both fast and slow decay transients spanning from millisecond to month-long time scales. This is achieved by dedicating a single high-speed measurement instrument for rapid data collection at the start of a transient, and by multiplexing a second low-speed measurement instrument for slow data collection of several samples in parallel for the later transients. The MTDM is a high-level design concept that can in principle measure an arbitrary number of samples, and the low cost implementation here allows up to 16 samples to be measured in parallel over several months, reducing the total ensemble measurement duration and equipment usage by as much as an order of magnitude without sacrificing fidelity. The MTDM was successfully demonstrated by simultaneously measuring the photoconductivity of three amorphous indium-gallium-zinc-oxide thin films with 20 ms data resolution for fast transients and an uninterrupted parallel run time of over 20 days. The MTDM has potential applications in many areas of research that manifest response times spanning many orders of magnitude, such as photovoltaics, rechargeable batteries, amorphous semiconductors such as silicon and amorphous indium-gallium-zinc-oxide.
Fast Acceleration of 2D Wave Propagation Simulations Using Modern Computational Accelerators

PubMed Central

Wang, Wei; Xu, Lifan; Cavazos, John; Huang, Howie H.; Kay, Matthew

2014-01-01

Recent developments in modern computational accelerators like Graphics Processing Units (GPUs) and coprocessors provide great opportunities for making scientific applications run faster than ever before. However, efficient parallelization of scientific code using new programming tools like CUDA requires a high level of expertise that is not available to many scientists. This, plus the fact that parallelized code is usually not portable to different architectures, creates major challenges for exploiting the full capabilities of modern computational accelerators. In this work, we sought to overcome these challenges by studying how to achieve both automated parallelization using OpenACC and enhanced portability using OpenCL. We applied our parallelization schemes using GPUs as well as Intel Many Integrated Core (MIC) coprocessor to reduce the run time of wave propagation simulations. We used a well-established 2D cardiac action potential model as a specific case-study. To the best of our knowledge, we are the first to study auto-parallelization of 2D cardiac wave propagation simulations using OpenACC. Our results identify several approaches that provide substantial speedups. The OpenACC-generated GPU code achieved more than speedup above the sequential implementation and required the addition of only a few OpenACC pragmas to the code. An OpenCL implementation provided speedups on GPUs of at least faster than the sequential implementation and faster than a parallelized OpenMP implementation. An implementation of OpenMP on Intel MIC coprocessor provided speedups of with only a few code changes to the sequential implementation. We highlight that OpenACC provides an automatic, efficient, and portable approach to achieve parallelization of 2D cardiac wave simulations on GPUs. Our approach of using OpenACC, OpenCL, and OpenMP to parallelize this particular model on modern computational accelerators should be applicable to other computational models of wave propagation in multi-dimensional media. PMID:24497950
Research in Computational Aeroscience Applications Implemented on Advanced Parallel Computing Systems

NASA Technical Reports Server (NTRS)

Wigton, Larry

1996-01-01

Improving the numerical linear algebra routines for use in new Navier-Stokes codes, specifically Tim Barth's unstructured grid code, with spin-offs to TRANAIR is reported. A fast distance calculation routine for Navier-Stokes codes using the new one-equation turbulence models is written. The primary focus of this work was devoted to improving matrix-iterative methods. New algorithms have been developed which activate the full potential of classical Cray-class computers as well as distributed-memory parallel computers.
Implementation of molecular dynamics and its extensions with the coarse-grained UNRES force field on massively parallel systems; towards millisecond-scale simulations of protein structure, dynamics, and thermodynamics

PubMed Central

Liwo, Adam; Ołdziej, Stanisław; Czaplewski, Cezary; Kleinerman, Dana S.; Blood, Philip; Scheraga, Harold A.

2010-01-01

We report the implementation of our united-residue UNRES force field for simulations of protein structure and dynamics with massively parallel architectures. In addition to coarse-grained parallelism already implemented in our previous work, in which each conformation was treated by a different task, we introduce a fine-grained level in which energy and gradient evaluation are split between several tasks. The Message Passing Interface (MPI) libraries have been utilized to construct the parallel code. The parallel performance of the code has been tested on a professional Beowulf cluster (Xeon Quad Core), a Cray XT3 supercomputer, and two IBM BlueGene/P supercomputers with canonical and replica-exchange molecular dynamics. With IBM BlueGene/P, about 50 % efficiency and 120-fold speed-up of the fine-grained part was achieved for a single trajectory of a 767-residue protein with use of 256 processors/trajectory. Because of averaging over the fast degrees of freedom, UNRES provides an effective 1000-fold speed-up compared to the experimental time scale and, therefore, enables us to effectively carry out millisecond-scale simulations of proteins with 500 and more amino-acid residues in days of wall-clock time. PMID:20305729
Fast, Parallel and Secure Cryptography Algorithm Using Lorenz's Attractor

NASA Astrophysics Data System (ADS)

Marco, Anderson Gonçalves; Martinez, Alexandre Souto; Bruno, Odemir Martinez

A novel cryptography method based on the Lorenz's attractor chaotic system is presented. The proposed algorithm is secure and fast, making it practical for general use. We introduce the chaotic operation mode, which provides an interaction among the password, message and a chaotic system. It ensures that the algorithm yields a secure codification, even if the nature of the chaotic system is known. The algorithm has been implemented in two versions: one sequential and slow and the other, parallel and fast. Our algorithm assures the integrity of the ciphertext (we know if it has been altered, which is not assured by traditional algorithms) and consequently its authenticity. Numerical experiments are presented, discussed and show the behavior of the method in terms of security and performance. The fast version of the algorithm has a performance comparable to AES, a popular cryptography program used commercially nowadays, but it is more secure, which makes it immediately suitable for general purpose cryptography applications. An internet page has been set up, which enables the readers to test the algorithm and also to try to break into the cipher.
Implementation of a high-speed face recognition system that uses an optical parallel correlator.

PubMed

Watanabe, Eriko; Kodate, Kashiko

2005-02-10

We implement a fully automatic fast face recognition system by using a 1000 frame/s optical parallel correlator designed and assembled by us. The operational speed for the 1:N (i.e., matching one image against N, where N refers to the number of images in the database) identification experiment (4000 face images) amounts to less than 1.5 s, including the preprocessing and postprocessing times. The binary real-only matched filter is devised for the sake of face recognition, and the system is optimized by the false-rejection rate (FRR) and the false-acceptance rate (FAR), according to 300 samples selected by the biometrics guideline. From trial 1:N identification experiments with the optical parallel correlator, we acquired low error rates of 2.6% FRR and 1.3% FAR. Facial images of people wearing thin glasses or heavy makeup that rendered identification difficult were identified with this system.
Accelerating Computation of DCM for ERP in MATLAB by External Function Calls to the GPU.

PubMed

Wang, Wei-Jen; Hsieh, I-Fan; Chen, Chun-Chuan

2013-01-01

This study aims to improve the performance of Dynamic Causal Modelling for Event Related Potentials (DCM for ERP) in MATLAB by using external function calls to a graphics processing unit (GPU). DCM for ERP is an advanced method for studying neuronal effective connectivity. DCM utilizes an iterative procedure, the expectation maximization (EM) algorithm, to find the optimal parameters given a set of observations and the underlying probability model. As the EM algorithm is computationally demanding and the analysis faces possible combinatorial explosion of models to be tested, we propose a parallel computing scheme using the GPU to achieve a fast estimation of DCM for ERP. The computation of DCM for ERP is dynamically partitioned and distributed to threads for parallel processing, according to the DCM model complexity and the hardware constraints. The performance efficiency of this hardware-dependent thread arrangement strategy was evaluated using the synthetic data. The experimental data were used to validate the accuracy of the proposed computing scheme and quantify the time saving in practice. The simulation results show that the proposed scheme can accelerate the computation by a factor of 155 for the parallel part. For experimental data, the speedup factor is about 7 per model on average, depending on the model complexity and the data. This GPU-based implementation of DCM for ERP gives qualitatively the same results as the original MATLAB implementation does at the group level analysis. In conclusion, we believe that the proposed GPU-based implementation is very useful for users as a fast screen tool to select the most likely model and may provide implementation guidance for possible future clinical applications such as online diagnosis.
Accelerating Computation of DCM for ERP in MATLAB by External Function Calls to the GPU

PubMed Central

Wang, Wei-Jen; Hsieh, I-Fan; Chen, Chun-Chuan

2013-01-01

This study aims to improve the performance of Dynamic Causal Modelling for Event Related Potentials (DCM for ERP) in MATLAB by using external function calls to a graphics processing unit (GPU). DCM for ERP is an advanced method for studying neuronal effective connectivity. DCM utilizes an iterative procedure, the expectation maximization (EM) algorithm, to find the optimal parameters given a set of observations and the underlying probability model. As the EM algorithm is computationally demanding and the analysis faces possible combinatorial explosion of models to be tested, we propose a parallel computing scheme using the GPU to achieve a fast estimation of DCM for ERP. The computation of DCM for ERP is dynamically partitioned and distributed to threads for parallel processing, according to the DCM model complexity and the hardware constraints. The performance efficiency of this hardware-dependent thread arrangement strategy was evaluated using the synthetic data. The experimental data were used to validate the accuracy of the proposed computing scheme and quantify the time saving in practice. The simulation results show that the proposed scheme can accelerate the computation by a factor of 155 for the parallel part. For experimental data, the speedup factor is about 7 per model on average, depending on the model complexity and the data. This GPU-based implementation of DCM for ERP gives qualitatively the same results as the original MATLAB implementation does at the group level analysis. In conclusion, we believe that the proposed GPU-based implementation is very useful for users as a fast screen tool to select the most likely model and may provide implementation guidance for possible future clinical applications such as online diagnosis. PMID:23840507
A Massively Parallel Code for Polarization Calculations

NASA Astrophysics Data System (ADS)

Akiyama, Shizuka; Höflich, Peter

2001-03-01

We present an implementation of our Monte-Carlo radiation transport method for rapidly expanding, NLTE atmospheres for massively parallel computers which utilizes both the distributed and shared memory models. This allows us to take full advantage of the fast communication and low latency inherent to nodes with multiple CPUs, and to stretch the limits of scalability with the number of nodes compared to a version which is based on the shared memory model. Test calculations on a local 20-node Beowulf cluster with dual CPUs showed an improved scalability by about 40%.

A fast ultrasonic simulation tool based on massively parallel implementations

NASA Astrophysics Data System (ADS)

Lambert, Jason; Rougeron, Gilles; Lacassagne, Lionel; Chatillon, Sylvain

2014-02-01

This paper presents a CIVA optimized ultrasonic inspection simulation tool, which takes benefit of the power of massively parallel architectures: graphical processing units (GPU) and multi-core general purpose processors (GPP). This tool is based on the classical approach used in CIVA: the interaction model is based on Kirchoff, and the ultrasonic field around the defect is computed by the pencil method. The model has been adapted and parallelized for both architectures. At this stage, the configurations addressed by the tool are : multi and mono-element probes, planar specimens made of simple isotropic materials, planar rectangular defects or side drilled holes of small diameter. Validations on the model accuracy and performances measurements are presented.
Ordered fast Fourier transforms on a massively parallel hypercube multiprocessor

NASA Technical Reports Server (NTRS)

Tong, Charles; Swarztrauber, Paul N.

1991-01-01

The present evaluation of alternative, massively parallel hypercube processor-applicable designs for ordered radix-2 decimation-in-frequency FFT algorithms gives attention to the reduction of computation time-dominating communication. A combination of the order and computational phases of the FFT is accordingly employed, in conjunction with sequence-to-processor maps which reduce communication. Two orderings, 'standard' and 'cyclic', in which the order of the transform is the same as that of the input sequence, can be implemented with ease on the Connection Machine (where orderings are determined by geometries and priorities. A parallel method for trigonometric coefficient computation is presented which does not employ trigonometric functions or interprocessor communication.
Heterogeneous computing architecture for fast detection of SNP-SNP interactions.

PubMed

Sluga, Davor; Curk, Tomaz; Zupan, Blaz; Lotric, Uros

2014-06-25

The extent of data in a typical genome-wide association study (GWAS) poses considerable computational challenges to software tools for gene-gene interaction discovery. Exhaustive evaluation of all interactions among hundreds of thousands to millions of single nucleotide polymorphisms (SNPs) may require weeks or even months of computation. Massively parallel hardware within a modern Graphic Processing Unit (GPU) and Many Integrated Core (MIC) coprocessors can shorten the run time considerably. While the utility of GPU-based implementations in bioinformatics has been well studied, MIC architecture has been introduced only recently and may provide a number of comparative advantages that have yet to be explored and tested. We have developed a heterogeneous, GPU and Intel MIC-accelerated software module for SNP-SNP interaction discovery to replace the previously single-threaded computational core in the interactive web-based data exploration program SNPsyn. We report on differences between these two modern massively parallel architectures and their software environments. Their utility resulted in an order of magnitude shorter execution times when compared to the single-threaded CPU implementation. GPU implementation on a single Nvidia Tesla K20 runs twice as fast as that for the MIC architecture-based Xeon Phi P5110 coprocessor, but also requires considerably more programming effort. General purpose GPUs are a mature platform with large amounts of computing power capable of tackling inherently parallel problems, but can prove demanding for the programmer. On the other hand the new MIC architecture, albeit lacking in performance reduces the programming effort and makes it up with a more general architecture suitable for a wider range of problems.
Heterogeneous computing architecture for fast detection of SNP-SNP interactions

PubMed Central

2014-01-01

Background The extent of data in a typical genome-wide association study (GWAS) poses considerable computational challenges to software tools for gene-gene interaction discovery. Exhaustive evaluation of all interactions among hundreds of thousands to millions of single nucleotide polymorphisms (SNPs) may require weeks or even months of computation. Massively parallel hardware within a modern Graphic Processing Unit (GPU) and Many Integrated Core (MIC) coprocessors can shorten the run time considerably. While the utility of GPU-based implementations in bioinformatics has been well studied, MIC architecture has been introduced only recently and may provide a number of comparative advantages that have yet to be explored and tested. Results We have developed a heterogeneous, GPU and Intel MIC-accelerated software module for SNP-SNP interaction discovery to replace the previously single-threaded computational core in the interactive web-based data exploration program SNPsyn. We report on differences between these two modern massively parallel architectures and their software environments. Their utility resulted in an order of magnitude shorter execution times when compared to the single-threaded CPU implementation. GPU implementation on a single Nvidia Tesla K20 runs twice as fast as that for the MIC architecture-based Xeon Phi P5110 coprocessor, but also requires considerably more programming effort. Conclusions General purpose GPUs are a mature platform with large amounts of computing power capable of tackling inherently parallel problems, but can prove demanding for the programmer. On the other hand the new MIC architecture, albeit lacking in performance reduces the programming effort and makes it up with a more general architecture suitable for a wider range of problems. PMID:24964802
Plasma Physics Calculations on a Parallel Macintosh Cluster

NASA Astrophysics Data System (ADS)

Decyk, Viktor; Dauger, Dean; Kokelaar, Pieter

2000-03-01

We have constructed a parallel cluster consisting of 16 Apple Macintosh G3 computers running the MacOS, and achieved very good performance on numerically intensive, parallel plasma particle-in-cell simulations. A subset of the MPI message-passing library was implemented in Fortran77 and C. This library enabled us to port code, without modification, from other parallel processors to the Macintosh cluster. For large problems where message packets are large and relatively few in number, performance of 50-150 MFlops/node is possible, depending on the problem. This is fast enough that 3D calculations can be routinely done. Unlike Unix-based clusters, no special expertise in operating systems is required to build and run the cluster. Full details are available on our web site: http://exodus.physics.ucla.edu/appleseed/.
Plasma Physics Calculations on a Parallel Macintosh Cluster

NASA Astrophysics Data System (ADS)

Decyk, Viktor K.; Dauger, Dean E.; Kokelaar, Pieter R.

We have constructed a parallel cluster consisting of 16 Apple Macintosh G3 computers running the MacOS, and achieved very good performance on numerically intensive, parallel plasma particle-in-cell simulations. A subset of the MPI message-passing library was implemented in Fortran77 and C. This library enabled us to port code, without modification, from other parallel processors to the Macintosh cluster. For large problems where message packets are large and relatively few in number, performance of 50-150 Mflops/node is possible, depending on the problem. This is fast enough that 3D calculations can be routinely done. Unlike Unix-based clusters, no special expertise in operating systems is required to build and run the cluster. Full details are available on our web site: http://exodus.physics.ucla.edu/appleseed/.
Large-Constraint-Length, Fast Viterbi Decoder

NASA Technical Reports Server (NTRS)

Collins, O.; Dolinar, S.; Hsu, In-Shek; Pollara, F.; Olson, E.; Statman, J.; Zimmerman, G.

1990-01-01

Scheme for efficient interconnection makes VLSI design feasible. Concept for fast Viterbi decoder provides for processing of convolutional codes of constraint length K up to 15 and rates of 1/2 to 1/6. Fully parallel (but bit-serial) architecture developed for decoder of K = 7 implemented in single dedicated VLSI circuit chip. Contains six major functional blocks. VLSI circuits perform branch metric computations, add-compare-select operations, and then store decisions in traceback memory. Traceback processor reads appropriate memory locations and puts out decoded bits. Used as building block for decoders of larger K.
Parallel community climate model: Description and user`s guide

DOE Office of Scientific and Technical Information (OSTI.GOV)

Drake, J.B.; Flanery, R.E.; Semeraro, B.D.

This report gives an overview of a parallel version of the NCAR Community Climate Model, CCM2, implemented for MIMD massively parallel computers using a message-passing programming paradigm. The parallel implementation was developed on an Intel iPSC/860 with 128 processors and on the Intel Delta with 512 processors, and the initial target platform for the production version of the code is the Intel Paragon with 2048 processors. Because the implementation uses a standard, portable message-passing libraries, the code has been easily ported to other multiprocessors supporting a message-passing programming paradigm. The parallelization strategy used is to decompose the problem domain intomore » geographical patches and assign each processor the computation associated with a distinct subset of the patches. With this decomposition, the physics calculations involve only grid points and data local to a processor and are performed in parallel. Using parallel algorithms developed for the semi-Lagrangian transport, the fast Fourier transform and the Legendre transform, both physics and dynamics are computed in parallel with minimal data movement and modest change to the original CCM2 source code. Sequential or parallel history tapes are written and input files (in history tape format) are read sequentially by the parallel code to promote compatibility with production use of the model on other computer systems. A validation exercise has been performed with the parallel code and is detailed along with some performance numbers on the Intel Paragon and the IBM SP2. A discussion of reproducibility of results is included. A user`s guide for the PCCM2 version 2.1 on the various parallel machines completes the report. Procedures for compilation, setup and execution are given. A discussion of code internals is included for those who may wish to modify and use the program in their own research.« less
HeinzelCluster: accelerated reconstruction for FORE and OSEM3D.

PubMed

Vollmar, S; Michel, C; Treffert, J T; Newport, D F; Casey, M; Knöss, C; Wienhard, K; Liu, X; Defrise, M; Heiss, W D

2002-08-07

Using iterative three-dimensional (3D) reconstruction techniques for reconstruction of positron emission tomography (PET) is not feasible on most single-processor machines due to the excessive computing time needed, especially so for the large sinogram sizes of our high-resolution research tomograph (HRRT). In our first approach to speed up reconstruction time we transform the 3D scan into the format of a two-dimensional (2D) scan with sinograms that can be reconstructed independently using Fourier rebinning (FORE) and a fast 2D reconstruction method. On our dedicated reconstruction cluster (seven four-processor systems, Intel PIII@700 MHz, switched fast ethernet and Myrinet, Windows NT Server), we process these 2D sinograms in parallel. We have achieved a speedup > 23 using 26 processors and also compared results for different communication methods (RPC, Syngo, Myrinet GM). The other approach is to parallelize OSEM3D (implementation of C Michel), which has produced the best results for HRRT data so far and is more suitable for an adequate treatment of the sinogram gaps that result from the detector geometry of the HRRT. We have implemented two levels of parallelization for four dedicated cluster (a shared memory fine-grain level on each node utilizing all four processors and a coarse-grain level allowing for 15 nodes) reducing the time for one core iteration from over 7 h to about 35 min.
Low complexity 1D IDCT for 16-bit parallel architectures

NASA Astrophysics Data System (ADS)

Bivolarski, Lazar

2007-09-01

This paper shows that using the Loeffler, Ligtenberg, and Moschytz factorization of 8-point IDCT [2] one-dimensional (1-D) algorithm as a fast approximation of the Discrete Cosine Transform (DCT) and using only 16 bit numbers, it is possible to create in an IEEE 1180-1990 compliant and multiplierless algorithm with low computational complexity. This algorithm as characterized by its structure is efficiently implemented on parallel high performance architectures as well as due to its low complexity is sufficient for wide range of other architectures. Additional constraint on this work was the requirement of compliance with the existing MPEG standards. The hardware implementation complexity and low resources where also part of the design criteria for this algorithm. This implementation is also compliant with the precision requirements described in MPEG IDCT precision specification ISO/IEC 23002-1. Complexity analysis is performed as an extension to the simple measure of shifts and adds for the multiplierless algorithm as additional operations are included in the complexity measure to better describe the actual transform implementation complexity.
Notes on implementation of sparsely distributed memory

NASA Technical Reports Server (NTRS)

Keeler, J. D.; Denning, P. J.

1986-01-01

The Sparsely Distributed Memory (SDM) developed by Kanerva is an unconventional memory design with very interesting and desirable properties. The memory works in a manner that is closely related to modern theories of human memory. The SDM model is discussed in terms of its implementation in hardware. Two appendices discuss the unconventional approaches of the SDM: Appendix A treats a resistive circuit for fast, parallel address decoding; and Appendix B treats a systolic array for high throughput read and write operations.
Implementation of High Time Delay Accuracy of Ultrasonic Phased Array Based on Interpolation CIC Filter

PubMed Central

Liu, Peilu; Li, Xinghua; Li, Haopeng; Su, Zhikun; Zhang, Hongxu

2017-01-01

In order to improve the accuracy of ultrasonic phased array focusing time delay, analyzing the original interpolation Cascade-Integrator-Comb (CIC) filter, an 8× interpolation CIC filter parallel algorithm was proposed, so that interpolation and multichannel decomposition can simultaneously process. Moreover, we summarized the general formula of arbitrary multiple interpolation CIC filter parallel algorithm and established an ultrasonic phased array focusing time delay system based on 8× interpolation CIC filter parallel algorithm. Improving the algorithmic structure, 12.5% of addition and 29.2% of multiplication was reduced, meanwhile the speed of computation is still very fast. Considering the existing problems of the CIC filter, we compensated the CIC filter; the compensated CIC filter’s pass band is flatter, the transition band becomes steep, and the stop band attenuation increases. Finally, we verified the feasibility of this algorithm on Field Programming Gate Array (FPGA). In the case of system clock is 125 MHz, after 8× interpolation filtering and decomposition, time delay accuracy of the defect echo becomes 1 ns. Simulation and experimental results both show that the algorithm we proposed has strong feasibility. Because of the fast calculation, small computational amount and high resolution, this algorithm is especially suitable for applications with high time delay accuracy and fast detection. PMID:29023385
Implementation of High Time Delay Accuracy of Ultrasonic Phased Array Based on Interpolation CIC Filter.

PubMed

Liu, Peilu; Li, Xinghua; Li, Haopeng; Su, Zhikun; Zhang, Hongxu

2017-10-12

In order to improve the accuracy of ultrasonic phased array focusing time delay, analyzing the original interpolation Cascade-Integrator-Comb (CIC) filter, an 8× interpolation CIC filter parallel algorithm was proposed, so that interpolation and multichannel decomposition can simultaneously process. Moreover, we summarized the general formula of arbitrary multiple interpolation CIC filter parallel algorithm and established an ultrasonic phased array focusing time delay system based on 8× interpolation CIC filter parallel algorithm. Improving the algorithmic structure, 12.5% of addition and 29.2% of multiplication was reduced, meanwhile the speed of computation is still very fast. Considering the existing problems of the CIC filter, we compensated the CIC filter; the compensated CIC filter's pass band is flatter, the transition band becomes steep, and the stop band attenuation increases. Finally, we verified the feasibility of this algorithm on Field Programming Gate Array (FPGA). In the case of system clock is 125 MHz, after 8× interpolation filtering and decomposition, time delay accuracy of the defect echo becomes 1 ns. Simulation and experimental results both show that the algorithm we proposed has strong feasibility. Because of the fast calculation, small computational amount and high resolution, this algorithm is especially suitable for applications with high time delay accuracy and fast detection.
FastQuery: A Parallel Indexing System for Scientific Data

DOE Office of Scientific and Technical Information (OSTI.GOV)

Chou, Jerry; Wu, Kesheng; Prabhat,

2011-07-29

Modern scientific datasets present numerous data management and analysis challenges. State-of-the- art index and query technologies such as FastBit can significantly improve accesses to these datasets by augmenting the user data with indexes and other secondary information. However, a challenge is that the indexes assume the relational data model but the scientific data generally follows the array data model. To match the two data models, we design a generic mapping mechanism and implement an efficient input and output interface for reading and writing the data and their corresponding indexes. To take advantage of the emerging many-core architectures, we also developmore » a parallel strategy for indexing using threading technology. This approach complements our on-going MPI-based parallelization efforts. We demonstrate the flexibility of our software by applying it to two of the most commonly used scientific data formats, HDF5 and NetCDF. We present two case studies using data from a particle accelerator model and a global climate model. We also conducted a detailed performance study using these scientific datasets. The results show that FastQuery speeds up the query time by a factor of 2.5x to 50x, and it reduces the indexing time by a factor of 16 on 24 cores.« less
ICON-MIC: Implementing a CPU/MIC Collaboration Parallel Framework for ICON on Tianhe-2 Supercomputer.

PubMed

Wang, Zihao; Chen, Yu; Zhang, Jingrong; Li, Lun; Wan, Xiaohua; Liu, Zhiyong; Sun, Fei; Zhang, Fa

2018-03-01

Electron tomography (ET) is an important technique for studying the three-dimensional structures of the biological ultrastructure. Recently, ET has reached sub-nanometer resolution for investigating the native and conformational dynamics of macromolecular complexes by combining with the sub-tomogram averaging approach. Due to the limited sampling angles, ET reconstruction typically suffers from the "missing wedge" problem. Using a validation procedure, iterative compressed-sensing optimized nonuniform fast Fourier transform (NUFFT) reconstruction (ICON) demonstrates its power in restoring validated missing information for a low-signal-to-noise ratio biological ET dataset. However, the huge computational demand has become a bottleneck for the application of ICON. In this work, we implemented a parallel acceleration technology ICON-many integrated core (MIC) on Xeon Phi cards to address the huge computational demand of ICON. During this step, we parallelize the element-wise matrix operations and use the efficient summation of a matrix to reduce the cost of matrix computation. We also developed parallel versions of NUFFT on MIC to achieve a high acceleration of ICON by using more efficient fast Fourier transform (FFT) calculation. We then proposed a hybrid task allocation strategy (two-level load balancing) to improve the overall performance of ICON-MIC by making full use of the idle resources on Tianhe-2 supercomputer. Experimental results using two different datasets show that ICON-MIC has high accuracy in biological specimens under different noise levels and a significant acceleration, up to 13.3 × , compared with the CPU version. Further, ICON-MIC has good scalability efficiency and overall performance on Tianhe-2 supercomputer.
An implementation of a tree code on a SIMD, parallel computer

NASA Technical Reports Server (NTRS)

Olson, Kevin M.; Dorband, John E.

1994-01-01

We describe a fast tree algorithm for gravitational N-body simulation on SIMD parallel computers. The tree construction uses fast, parallel sorts. The sorted lists are recursively divided along their x, y and z coordinates. This data structure is a completely balanced tree (i.e., each particle is paired with exactly one other particle) and maintains good spatial locality. An implementation of this tree-building algorithm on a 16k processor Maspar MP-1 performs well and constitutes only a small fraction (approximately 15%) of the entire cycle of finding the accelerations. Each node in the tree is treated as a monopole. The tree search and the summation of accelerations also perform well. During the tree search, node data that is needed from another processor is simply fetched. Roughly 55% of the tree search time is spent in communications between processors. We apply the code to two problems of astrophysical interest. The first is a simulation of the close passage of two gravitationally, interacting, disk galaxies using 65,636 particles. We also simulate the formation of structure in an expanding, model universe using 1,048,576 particles. Our code attains speeds comparable to one head of a Cray Y-MP, so single instruction, multiple data (SIMD) type computers can be used for these simulations. The cost/performance ratio for SIMD machines like the Maspar MP-1 make them an extremely attractive alternative to either vector processors or large multiple instruction, multiple data (MIMD) type parallel computers. With further optimizations (e.g., more careful load balancing), speeds in excess of today's vector processing computers should be possible.
Design of Belief Propagation Based on FPGA for the Multistereo CAFADIS Camera

PubMed Central

Magdaleno, Eduardo; Lüke, Jonás Philipp; Rodríguez, Manuel; Rodríguez-Ramos, José Manuel

2010-01-01

In this paper we describe a fast, specialized hardware implementation of the belief propagation algorithm for the CAFADIS camera, a new plenoptic sensor patented by the University of La Laguna. This camera captures the lightfield of the scene and can be used to find out at which depth each pixel is in focus. The algorithm has been designed for FPGA devices using VHDL. We propose a parallel and pipeline architecture to implement the algorithm without external memory. Although the BRAM resources of the device increase considerably, we can maintain real-time restrictions by using extremely high-performance signal processing capability through parallelism and by accessing several memories simultaneously. The quantifying results with 16 bit precision have shown that performances are really close to the original Matlab programmed algorithm. PMID:22163404
Design of belief propagation based on FPGA for the multistereo CAFADIS camera.

PubMed

Magdaleno, Eduardo; Lüke, Jonás Philipp; Rodríguez, Manuel; Rodríguez-Ramos, José Manuel

2010-01-01

In this paper we describe a fast, specialized hardware implementation of the belief propagation algorithm for the CAFADIS camera, a new plenoptic sensor patented by the University of La Laguna. This camera captures the lightfield of the scene and can be used to find out at which depth each pixel is in focus. The algorithm has been designed for FPGA devices using VHDL. We propose a parallel and pipeline architecture to implement the algorithm without external memory. Although the BRAM resources of the device increase considerably, we can maintain real-time restrictions by using extremely high-performance signal processing capability through parallelism and by accessing several memories simultaneously. The quantifying results with 16 bit precision have shown that performances are really close to the original Matlab programmed algorithm.
fastBMA: scalable network inference and transitive reduction.

PubMed

Hung, Ling-Hong; Shi, Kaiyuan; Wu, Migao; Young, William Chad; Raftery, Adrian E; Yeung, Ka Yee

2017-10-01

Inferring genetic networks from genome-wide expression data is extremely demanding computationally. We have developed fastBMA, a distributed, parallel, and scalable implementation of Bayesian model averaging (BMA) for this purpose. fastBMA also includes a computationally efficient module for eliminating redundant indirect edges in the network by mapping the transitive reduction to an easily solved shortest-path problem. We evaluated the performance of fastBMA on synthetic data and experimental genome-wide time series yeast and human datasets. When using a single CPU core, fastBMA is up to 100 times faster than the next fastest method, LASSO, with increased accuracy. It is a memory-efficient, parallel, and distributed application that scales to human genome-wide expression data. A 10 000-gene regulation network can be obtained in a matter of hours using a 32-core cloud cluster (2 nodes of 16 cores). fastBMA is a significant improvement over its predecessor ScanBMA. It is more accurate and orders of magnitude faster than other fast network inference methods such as the 1 based on LASSO. The improved scalability allows it to calculate networks from genome scale data in a reasonable time frame. The transitive reduction method can improve accuracy in denser networks. fastBMA is available as code (M.I.T. license) from GitHub (https://github.com/lhhunghimself/fastBMA), as part of the updated networkBMA Bioconductor package (https://www.bioconductor.org/packages/release/bioc/html/networkBMA.html) and as ready-to-deploy Docker images (https://hub.docker.com/r/biodepot/fastbma/). © The Authors 2017. Published by Oxford University Press.
A fast parallel clustering algorithm for molecular simulation trajectories.

PubMed

Zhao, Yutong; Sheong, Fu Kit; Sun, Jian; Sander, Pedro; Huang, Xuhui

2013-01-15

We implemented a GPU-powered parallel k-centers algorithm to perform clustering on the conformations of molecular dynamics (MD) simulations. The algorithm is up to two orders of magnitude faster than the CPU implementation. We tested our algorithm on four protein MD simulation datasets ranging from the small Alanine Dipeptide to a 370-residue Maltose Binding Protein (MBP). It is capable of grouping 250,000 conformations of the MBP into 4000 clusters within 40 seconds. To achieve this, we effectively parallelized the code on the GPU and utilize the triangle inequality of metric spaces. Furthermore, the algorithm's running time is linear with respect to the number of cluster centers. In addition, we found the triangle inequality to be less effective in higher dimensions and provide a mathematical rationale. Finally, using Alanine Dipeptide as an example, we show a strong correlation between cluster populations resulting from the k-centers algorithm and the underlying density. © 2012 Wiley Periodicals, Inc. Copyright © 2012 Wiley Periodicals, Inc.

A Fast parallel tridiagonal algorithm for a class of CFD applications

NASA Technical Reports Server (NTRS)

Moitra, Stuti; Sun, Xian-He

1996-01-01

The parallel diagonal dominant (PDD) algorithm is an efficient tridiagonal solver. This paper presents for study a variation of the PDD algorithm, the reduced PDD algorithm. The new algorithm maintains the minimum communication provided by the PDD algorithm, but has a reduced operation count. The PDD algorithm also has a smaller operation count than the conventional sequential algorithm for many applications. Accuracy analysis is provided for the reduced PDD algorithm for symmetric Toeplitz tridiagonal (STT) systems. Implementation results on Langley's Intel Paragon and IBM SP2 show that both the PDD and reduced PDD algorithms are efficient and scalable.
Hardware-efficient implementation of digital FIR filter using fast first-order moment algorithm

NASA Astrophysics Data System (ADS)

Cao, Li; Liu, Jianguo; Xiong, Jun; Zhang, Jing

2018-03-01

As the digital finite impulse response (FIR) filter can be transformed into the shift-add form of multiple small-sized firstorder moments, based on the existing fast first-order moment algorithm, this paper presents a novel multiplier-less structure to calculate any number of sequential filtering results in parallel. The theoretical analysis on its hardware and time-complexities reveals that by appropriately setting the degree of parallelism and the decomposition factor of a fixed word width, the proposed structure may achieve better area-time efficiency than the existing two-dimensional (2-D) memoryless-based filter. To evaluate the performance concretely, the proposed designs for different taps along with the existing 2-D memoryless-based filters, are synthesized by Synopsys Design Compiler with 0.18-μm SMIC library. The comparisons show that the proposed design has less area-time complexity and power consumption when the number of filter taps is larger than 48.
Hardware Implementation of Lossless Adaptive and Scalable Hyperspectral Data Compression for Space

NASA Technical Reports Server (NTRS)

Aranki, Nazeeh; Keymeulen, Didier; Bakhshi, Alireza; Klimesh, Matthew

2009-01-01

On-board lossless hyperspectral data compression reduces data volume in order to meet NASA and DoD limited downlink capabilities. The technique also improves signature extraction, object recognition and feature classification capabilities by providing exact reconstructed data on constrained downlink resources. At JPL a novel, adaptive and predictive technique for lossless compression of hyperspectral data was recently developed. This technique uses an adaptive filtering method and achieves a combination of low complexity and compression effectiveness that far exceeds state-of-the-art techniques currently in use. The JPL-developed 'Fast Lossless' algorithm requires no training data or other specific information about the nature of the spectral bands for a fixed instrument dynamic range. It is of low computational complexity and thus well-suited for implementation in hardware. A modified form of the algorithm that is better suited for data from pushbroom instruments is generally appropriate for flight implementation. A scalable field programmable gate array (FPGA) hardware implementation was developed. The FPGA implementation achieves a throughput performance of 58 Msamples/sec, which can be increased to over 100 Msamples/sec in a parallel implementation that uses twice the hardware resources This paper describes the hardware implementation of the 'Modified Fast Lossless' compression algorithm on an FPGA. The FPGA implementation targets the current state-of-the-art FPGAs (Xilinx Virtex IV and V families) and compresses one sample every clock cycle to provide a fast and practical real-time solution for space applications.
Massively parallel implementation of 3D-RISM calculation with volumetric 3D-FFT.

PubMed

Maruyama, Yutaka; Yoshida, Norio; Tadano, Hiroto; Takahashi, Daisuke; Sato, Mitsuhisa; Hirata, Fumio

2014-07-05

A new three-dimensional reference interaction site model (3D-RISM) program for massively parallel machines combined with the volumetric 3D fast Fourier transform (3D-FFT) was developed, and tested on the RIKEN K supercomputer. The ordinary parallel 3D-RISM program has a limitation on the number of parallelizations because of the limitations of the slab-type 3D-FFT. The volumetric 3D-FFT relieves this limitation drastically. We tested the 3D-RISM calculation on the large and fine calculation cell (2048(3) grid points) on 16,384 nodes, each having eight CPU cores. The new 3D-RISM program achieved excellent scalability to the parallelization, running on the RIKEN K supercomputer. As a benchmark application, we employed the program, combined with molecular dynamics simulation, to analyze the oligomerization process of chymotrypsin Inhibitor 2 mutant. The results demonstrate that the massive parallel 3D-RISM program is effective to analyze the hydration properties of the large biomolecular systems. Copyright © 2014 Wiley Periodicals, Inc.
Combined algorithmic and GPU acceleration for ultra-fast circular conebeam backprojection

NASA Astrophysics Data System (ADS)

Brokish, Jeffrey; Sack, Paul; Bresler, Yoram

2010-04-01

In this paper, we describe the first implementation and performance of a fast O(N3logN) hierarchical backprojection algorithm for cone beam CT with a circular trajectory1,developed on a modern Graphics Processing Unit (GPU). The resulting tomographic backprojection system for 3D cone beam geometry combines speedup through algorithmic improvements provided by the hierarchical backprojection algorithm with speedup from a massively parallel hardware accelerator. For data parameters typical in diagnostic CT and using a mid-range GPU card, we report reconstruction speeds of up to 360 frames per second, and relative speedup of almost 6x compared to conventional backprojection on the same hardware. The significance of these results is twofold. First, they demonstrate that the reduction in operation counts demonstrated previously for the FHBP algorithm can be translated to a comparable run-time improvement in a massively parallel hardware implementation, while preserving stringent diagnostic image quality. Second, the dramatic speedup and throughput numbers achieved indicate the feasibility of systems based on this technology, which achieve real-time 3D reconstruction for state-of-the art diagnostic CT scanners with small footprint, high-reliability, and affordable cost.
Quantitative DWI implemented after DCE-MRI yields increased specificity for BI-RADS 3 and 4 breast lesions.

PubMed

Dijkstra, Hildebrand; Dorrius, Monique D; Wielema, Mirjam; Pijnappel, Ruud M; Oudkerk, Matthijs; Sijens, Paul E

2016-12-01

To assess if specificity can be increased when semiautomated breast lesion analysis of quantitative diffusion-weighted imaging (DWI) is implemented after dynamic contrast-enhanced (DCE-) magnetic resonance imaging (MRI) in the workup of BI-RADS 3 and 4 breast lesions larger than 1 cm. In all, 120 consecutive patients (mean-age, 48 years; age range, 23-75 years) with 139 breast lesions (≥1 cm) were examined (2010-2014) with 1.5T DCE-MRI and DWI (b = 0, 50, 200, 500, 800, 1000 s/mm 2 ) and the BI-RADS classification and histopathology were obtained. For each lesion malignancy was excluded using voxelwise semiautomated breast lesion analysis based on previously defined thresholds for the apparent diffusion coefficient (ADC) and the three intravoxel incoherent motion (IVIM) parameters: molecular diffusion (D slow ), microperfusion (D fast ), and the fraction of D fast (f fast ). The sensitivity (Se), specificity (Sp), and negative predictive value (NPV) based on only IVIM parameters combined in parallel (D slow , D fast , and f fast ), or the ADC or the BI-RADS classification by DCE-MRI were compared. Subsequently, the Se, Sp, and NPV of the combination of the BI-RADS classification by DCE-MRI followed by the IVIM parameters in parallel (or the ADC) were compared. In all, 23 of 139 breast lesions were benign. Se and Sp of DCE-MRI was 100% and 30.4% (NPV = 100%). Se and Sp of IVIM parameters in parallel were 92.2% and 52.2% (NPV = 57.1%) and for the ADC 95.7% and 17.4%, respectively (NPV = 44.4%). In all, 26 of 139 lesions were classified as BI-RADS 3 (n = 7) or BI-RADS 4 (n = 19). DCE-MRI combined with ADC (Se = 99.1%, Sp = 34.8%) or IVIM (Se = 99.1%, Sp = 56.5%) did significantly improve (P = 0.016) Sp of DCE-MRI alone for workup of BI-RADS 3 and 4 lesions (NPV = 92.9%). Quantitative DWI has a lower NPV compared to DCE-MRI for evaluation of breast lesions and may therefore not be able to replace DCE-MRI; when implemented after DCE-MRI as problem solver for BI-RADS 3 and 4 lesions, the combined specificity improves significantly. J. Magn. Reson. Imaging 2016;44:1642-1649. © 2016 International Society for Magnetic Resonance in Medicine.
Evaluation and utilization of preassembled frozen commercial fast real-time qPCR master mixes for detection of cytomegalovirus and BK virus.

PubMed

Glover, William A; Atienza, Ederlyn E; Nesbitt, Shannon; Kim, Woo J; Castor, Jared; Cook, Linda; Jerome, Keith R

2016-01-01

Quantitative DNA detection of cytomegalovirus (CMV) and BK virus (BKV) is critical in the management of transplant patients. Quantitative laboratory-developed procedures for CMV and BKV have been described in which much of the processing is automated, resulting in rapid, reproducible, and high-throughput testing of transplant patients. To increase the efficiency of such assays, the performance and stability of four commercial preassembled frozen fast qPCR master mixes (Roche FastStart Universal Probe Master Mix with Rox, Bio-Rad SsoFast Probes Supermix with Rox, Life Technologies TaqMan FastAdvanced Master Mix, and Life Technologies Fast Universal PCR Master Mix), in combination with in-house designed primers and probes, was evaluated using controls and standards from standard CMV and BK assays. A subsequent parallel evaluation using patient samples was performed comparing the performance of freshly prepared assay mixes versus aliquoted frozen master mixes made with two of the fast qPCR mixes (Life Technologies TaqMan FastAdvanced Master Mix, and Bio-Rad SsoFast Probes Supermix with Rox), chosen based on their performance and compatibility with existing PCR cycling conditions. The data demonstrate that the frozen master mixes retain excellent performance over a period of at least 10 weeks. During the parallel testing using clinical specimens, no difference in quantitative results was observed between the preassembled frozen master mixes and freshly prepared master mixes. Preassembled fast real-time qPCR frozen master mixes perform well and represent an additional strategy laboratories can implement to reduce assay preparation times, and to minimize technical errors and effort necessary to perform clinical PCR. © 2015 Wiley Periodicals, Inc.
GPU-based acceleration of computations in nonlinear finite element deformation analysis.

PubMed

Mafi, Ramin; Sirouspour, Shahin

2014-03-01

The physics of deformation for biological soft-tissue is best described by nonlinear continuum mechanics-based models, which then can be discretized by the FEM for a numerical solution. However, computational complexity of such models have limited their use in applications requiring real-time or fast response. In this work, we propose a graphic processing unit-based implementation of the FEM using implicit time integration for dynamic nonlinear deformation analysis. This is the most general formulation of the deformation analysis. It is valid for large deformations and strains and can account for material nonlinearities. The data-parallel nature and the intense arithmetic computations of nonlinear FEM equations make it particularly suitable for implementation on a parallel computing platform such as graphic processing unit. In this work, we present and compare two different designs based on the matrix-free and conventional preconditioned conjugate gradients algorithms for solving the FEM equations arising in deformation analysis. The speedup achieved with the proposed parallel implementations of the algorithms will be instrumental in the development of advanced surgical simulators and medical image registration methods involving soft-tissue deformation. Copyright © 2013 John Wiley & Sons, Ltd.
Fast Face-Recognition Optical Parallel Correlator Using High Accuracy Correlation Filter

NASA Astrophysics Data System (ADS)

Watanabe, Eriko; Kodate, Kashiko

2005-11-01

We designed and fabricated a fully automatic fast face recognition optical parallel correlator [E. Watanabe and K. Kodate: Appl. Opt. 44 (2005) 5666] based on the VanderLugt principle. The implementation of an as-yet unattained ultra high-speed system was aided by reconfiguring the system to make it suitable for easier parallel processing, as well as by composing a higher accuracy correlation filter and high-speed ferroelectric liquid crystal-spatial light modulator (FLC-SLM). In running trial experiments using this system (dubbed FARCO), we succeeded in acquiring remarkably low error rates of 1.3% for false match rate (FMR) and 2.6% for false non-match rate (FNMR). Given the results of our experiments, the aim of this paper is to examine methods of designing correlation filters and arranging database image arrays for even faster parallel correlation, underlining the issues of calculation technique, quantization bit rate, pixel size and shift from optical axis. The correlation filter has proved its excellent performance and higher precision than classical correlation and joint transform correlator (JTC). Moreover, arrangement of multi-object reference images leads to 10-channel correlation signals, as sharply marked as those of a single channel. This experiment result demonstrates great potential for achieving the process speed of 10000 face/s.
Efficient Helicopter Aerodynamic and Aeroacoustic Predictions on Parallel Computers

NASA Technical Reports Server (NTRS)

Wissink, Andrew M.; Lyrintzis, Anastasios S.; Strawn, Roger C.; Oliker, Leonid; Biswas, Rupak

1996-01-01

This paper presents parallel implementations of two codes used in a combined CFD/Kirchhoff methodology to predict the aerodynamics and aeroacoustics properties of helicopters. The rotorcraft Navier-Stokes code, TURNS, computes the aerodynamic flowfield near the helicopter blades and the Kirchhoff acoustics code computes the noise in the far field, using the TURNS solution as input. The overall parallel strategy adds MPI message passing calls to the existing serial codes to allow for communication between processors. As a result, the total code modifications required for parallel execution are relatively small. The biggest bottleneck in running the TURNS code in parallel comes from the LU-SGS algorithm that solves the implicit system of equations. We use a new hybrid domain decomposition implementation of LU-SGS to obtain good parallel performance on the SP-2. TURNS demonstrates excellent parallel speedups for quasi-steady and unsteady three-dimensional calculations of a helicopter blade in forward flight. The execution rate attained by the code on 114 processors is six times faster than the same cases run on one processor of the Cray C-90. The parallel Kirchhoff code also shows excellent parallel speedups and fast execution rates. As a performance demonstration, unsteady acoustic pressures are computed at 1886 far-field observer locations for a sample acoustics problem. The calculation requires over two hundred hours of CPU time on one C-90 processor but takes only a few hours on 80 processors of the SP2. The resultant far-field acoustic field is analyzed with state of-the-art audio and video rendering of the propagating acoustic signals.
Hybrid massively parallel fast sweeping method for static Hamilton-Jacobi equations

NASA Astrophysics Data System (ADS)

Detrixhe, Miles; Gibou, Frédéric

2016-10-01

The fast sweeping method is a popular algorithm for solving a variety of static Hamilton-Jacobi equations. Fast sweeping algorithms for parallel computing have been developed, but are severely limited. In this work, we present a multilevel, hybrid parallel algorithm that combines the desirable traits of two distinct parallel methods. The fine and coarse grained components of the algorithm take advantage of heterogeneous computer architecture common in high performance computing facilities. We present the algorithm and demonstrate its effectiveness on a set of example problems including optimal control, dynamic games, and seismic wave propagation. We give results for convergence, parallel scaling, and show state-of-the-art speedup values for the fast sweeping method.
A FAST ITERATIVE METHOD FOR SOLVING THE EIKONAL EQUATION ON TRIANGULATED SURFACES*

PubMed Central

Fu, Zhisong; Jeong, Won-Ki; Pan, Yongsheng; Kirby, Robert M.; Whitaker, Ross T.

2012-01-01

This paper presents an efficient, fine-grained parallel algorithm for solving the Eikonal equation on triangular meshes. The Eikonal equation, and the broader class of Hamilton–Jacobi equations to which it belongs, have a wide range of applications from geometric optics and seismology to biological modeling and analysis of geometry and images. The ability to solve such equations accurately and efficiently provides new capabilities for exploring and visualizing parameter spaces and for solving inverse problems that rely on such equations in the forward model. Efficient solvers on state-of-the-art, parallel architectures require new algorithms that are not, in many cases, optimal, but are better suited to synchronous updates of the solution. In previous work [W. K. Jeong and R. T. Whitaker, SIAM J. Sci. Comput., 30 (2008), pp. 2512–2534], the authors proposed the fast iterative method (FIM) to efficiently solve the Eikonal equation on regular grids. In this paper we extend the fast iterative method to solve Eikonal equations efficiently on triangulated domains on the CPU and on parallel architectures, including graphics processors. We propose a new local update scheme that provides solutions of first-order accuracy for both architectures. We also propose a novel triangle-based update scheme and its corresponding data structure for efficient irregular data mapping to parallel single-instruction multiple-data (SIMD) processors. We provide detailed descriptions of the implementations on a single CPU, a multicore CPU with shared memory, and SIMD architectures with comparative results against state-of-the-art Eikonal solvers. PMID:22641200
Parallel Fast Multipole Method For Molecular Dynamics

DTIC Science & Technology

2007-06-01

Parallel Fast Multipole Method For Molecular Dynamics THESIS Reid G. Ormseth, Captain, USAF AFIT/GAP/ENP/07-J02 DEPARTMENT OF THE AIR FORCE AIR...the United States Government. AFIT/GAP/ENP/07-J02 Parallel Fast Multipole Method For Molecular Dynamics THESIS Presented to the Faculty Department of...has also been provided by ‘The Art of Molecular Dynamics Simulation ’ by Dennis Rapaport. This work is the clearest treatment of the Fast Multipole
A Tensor Product Formulation of Strassen's Matrix Multiplication Algorithm with Memory Reduction

DOE PAGES

Kumar, B.; Huang, C. -H.; Sadayappan, P.; ...

1995-01-01

In this article, we present a program generation strategy of Strassen's matrix multiplication algorithm using a programming methodology based on tensor product formulas. In this methodology, block recursive programs such as the fast Fourier Transforms and Strassen's matrix multiplication algorithm are expressed as algebraic formulas involving tensor products and other matrix operations. Such formulas can be systematically translated to high-performance parallel/vector codes for various architectures. In this article, we present a nonrecursive implementation of Strassen's algorithm for shared memory vector processors such as the Cray Y-MP. A previous implementation of Strassen's algorithm synthesized from tensor product formulas required working storagemore » of size O(7 n ) for multiplying 2 n × 2 n matrices. We present a modified formulation in which the working storage requirement is reduced to O(4 n ). The modified formulation exhibits sufficient parallelism for efficient implementation on a shared memory multiprocessor. Performance results on a Cray Y-MP8/64 are presented.« less
Symplectic molecular dynamics simulations on specially designed parallel computers.

PubMed

Borstnik, Urban; Janezic, Dusanka

2005-01-01

We have developed a computer program for molecular dynamics (MD) simulation that implements the Split Integration Symplectic Method (SISM) and is designed to run on specialized parallel computers. The MD integration is performed by the SISM, which analytically treats high-frequency vibrational motion and thus enables the use of longer simulation time steps. The low-frequency motion is treated numerically on specially designed parallel computers, which decreases the computational time of each simulation time step. The combination of these approaches means that less time is required and fewer steps are needed and so enables fast MD simulations. We study the computational performance of MD simulation of molecular systems on specialized computers and provide a comparison to standard personal computers. The combination of the SISM with two specialized parallel computers is an effective way to increase the speed of MD simulations up to 16-fold over a single PC processor.
Multi-jagged: A scalable parallel spatial partitioning algorithm

DOE PAGES

Deveci, Mehmet; Rajamanickam, Sivasankaran; Devine, Karen D.; ...

2015-03-18

Geometric partitioning is fast and effective for load-balancing dynamic applications, particularly those requiring geometric locality of data (particle methods, crash simulations). We present, to our knowledge, the first parallel implementation of a multidimensional-jagged geometric partitioner. In contrast to the traditional recursive coordinate bisection algorithm (RCB), which recursively bisects subdomains perpendicular to their longest dimension until the desired number of parts is obtained, our algorithm does recursive multi-section with a given number of parts in each dimension. By computing multiple cut lines concurrently and intelligently deciding when to migrate data while computing the partition, we minimize data movement compared to efficientmore » implementations of recursive bisection. We demonstrate the algorithm's scalability and quality relative to the RCB implementation in Zoltan on both real and synthetic datasets. Our experiments show that the proposed algorithm performs and scales better than RCB in terms of run-time without degrading the load balance. Lastly, our implementation partitions 24 billion points into 65,536 parts within a few seconds and exhibits near perfect weak scaling up to 6K cores.« less
Fast polyenergetic forward projection for image formation using OpenCL on a heterogeneous parallel computing platform.

PubMed

Zhou, Lili; Clifford Chao, K S; Chang, Jenghwa

2012-11-01

Simulated projection images of digital phantoms constructed from CT scans have been widely used for clinical and research applications but their quality and computation speed are not optimal for real-time comparison with the radiography acquired with an x-ray source of different energies. In this paper, the authors performed polyenergetic forward projections using open computing language (OpenCL) in a parallel computing ecosystem consisting of CPU and general purpose graphics processing unit (GPGPU) for fast and realistic image formation. The proposed polyenergetic forward projection uses a lookup table containing the NIST published mass attenuation coefficients (μ∕ρ) for different tissue types and photon energies ranging from 1 keV to 20 MeV. The CT images of interested sites are first segmented into different tissue types based on the CT numbers and converted to a three-dimensional attenuation phantom by linking each voxel to the corresponding tissue type in the lookup table. The x-ray source can be a radioisotope or an x-ray generator with a known spectrum described as weight w(n) for energy bin E(n). The Siddon method is used to compute the x-ray transmission line integral for E(n) and the x-ray fluence is the weighted sum of the exponential of line integral for all energy bins with added Poisson noise. To validate this method, a digital head and neck phantom constructed from the CT scan of a Rando head phantom was segmented into three (air, gray∕white matter, and bone) regions for calculating the polyenergetic projection images for the Mohan 4 MV energy spectrum. To accelerate the calculation, the authors partitioned the workloads using the task parallelism and data parallelism and scheduled them in a parallel computing ecosystem consisting of CPU and GPGPU (NVIDIA Tesla C2050) using OpenCL only. The authors explored the task overlapping strategy and the sequential method for generating the first and subsequent DRRs. A dispatcher was designed to drive the high-degree parallelism of the task overlapping strategy. Numerical experiments were conducted to compare the performance of the OpenCL∕GPGPU-based implementation with the CPU-based implementation. The projection images were similar to typical portal images obtained with a 4 or 6 MV x-ray source. For a phantom size of 512 × 512 × 223, the time for calculating the line integrals for a 512 × 512 image panel was 16.2 ms on GPGPU for one energy bin in comparison to 8.83 s on CPU. The total computation time for generating one polyenergetic projection image of 512 × 512 was 0.3 s (141 s for CPU). The relative difference between the projection images obtained with the CPU-based and OpenCL∕GPGPU-based implementations was on the order of 10(-6) and was virtually indistinguishable. The task overlapping strategy was 5.84 and 1.16 times faster than the sequential method for the first and the subsequent digitally reconstruction radiographies, respectively. The authors have successfully built digital phantoms using anatomic CT images and NIST μ∕ρ tables for simulating realistic polyenergetic projection images and optimized the processing speed with parallel computing using GPGPU∕OpenCL-based implementation. The computation time was fast (0.3 s per projection image) enough for real-time IGRT (image-guided radiotherapy) applications.
Hybrid massively parallel fast sweeping method for static Hamilton–Jacobi equations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Detrixhe, Miles, E-mail: mdetrixhe@engineering.ucsb.edu; University of California Santa Barbara, Santa Barbara, CA, 93106; Gibou, Frédéric, E-mail: fgibou@engineering.ucsb.edu

The fast sweeping method is a popular algorithm for solving a variety of static Hamilton–Jacobi equations. Fast sweeping algorithms for parallel computing have been developed, but are severely limited. In this work, we present a multilevel, hybrid parallel algorithm that combines the desirable traits of two distinct parallel methods. The fine and coarse grained components of the algorithm take advantage of heterogeneous computer architecture common in high performance computing facilities. We present the algorithm and demonstrate its effectiveness on a set of example problems including optimal control, dynamic games, and seismic wave propagation. We give results for convergence, parallel scaling,more » and show state-of-the-art speedup values for the fast sweeping method.« less
A transient-enhanced NMOS low dropout voltage regulator with parallel feedback compensation

NASA Astrophysics Data System (ADS)

Han, Wang; Lin, Tan

2016-02-01

This paper presents a transient-enhanced NMOS low-dropout regulator (LDO) for portable applications with parallel feedback compensation. The parallel feedback structure adds a dynamic zero to get an adequate phase margin with a load current variation from 0 to 1 A. A class-AB error amplifier and a fast charging/discharging unit are adopted to enhance the transient performance. The proposed LDO has been implemented in a 0.35 μm BCD process. From experimental results, the regulator can operate with a minimum dropout voltage of 150 mV at a maximum 1 A load and IQ of 165 μA. Under the full range load current step, the voltage undershoot and overshoot of the proposed LDO are reduced to 38 mV and 27 mV respectively.
Crustal origin of trench-parallel shear-wave fast polarizations in the Central Andes

NASA Astrophysics Data System (ADS)

Wölbern, I.; Löbl, U.; Rümpker, G.

2014-04-01

In this study, SKS and local S phases are analyzed to investigate variations of shear-wave splitting parameters along two dense seismic profiles across the central Andean Altiplano and Puna plateaus. In contrast to previous observations, the vast majority of the measurements reveal fast polarizations sub-parallel to the subduction direction of the Nazca plate with delay times between 0.3 and 1.2 s. Local phases show larger variations of fast polarizations and exhibit delay times ranging between 0.1 and 1.1 s. Two 70 km and 100 km wide sections along the Altiplano profile exhibit larger delay times and are characterized by fast polarizations oriented sub-parallel to major fault zones. Based on finite-difference wavefield calculations for anisotropic subduction zone models we demonstrate that the observations are best explained by fossil slab anisotropy with fast symmetry axes oriented sub-parallel to the slab movement in combination with a significant component of crustal anisotropy of nearly trench-parallel fast-axis orientation. From the modeling we exclude a sub-lithospheric origin of the observed strong anomalies due to the short-scale variations of the fast polarizations. Instead, our results indicate that anisotropy in the Central Andes generally reflects the direction of plate motion while the observed trench-parallel fast polarizations likely originate in the continental crust above the subducting slab.

Parallel VLSI architecture emulation and the organization of APSA/MPP

NASA Technical Reports Server (NTRS)

Odonnell, John T.

1987-01-01

The Applicative Programming System Architecture (APSA) combines an applicative language interpreter with a novel parallel computer architecture that is well suited for Very Large Scale Integration (VLSI) implementation. The Massively Parallel Processor (MPP) can simulate VLSI circuits by allocating one processing element in its square array to an area on a square VLSI chip. As long as there are not too many long data paths, the MPP can simulate a VLSI clock cycle very rapidly. The APSA circuit contains a binary tree with a few long paths and many short ones. A skewed H-tree layout allows every processing element to simulate a leaf cell and up to four tree nodes, with no loss in parallelism. Emulation of a key APSA algorithm on the MPP resulted in performance 16,000 times faster than a Vax. This speed will make it possible for the APSA language interpreter to run fast enough to support research in parallel list processing algorithms.
Parallel point-multiplication architecture using combined group operations for high-speed cryptographic applications

PubMed Central

Saeedi, Ehsan; Kong, Yinan

2017-01-01

In this paper, we propose a novel parallel architecture for fast hardware implementation of elliptic curve point multiplication (ECPM), which is the key operation of an elliptic curve cryptography processor. The point multiplication over binary fields is synthesized on both FPGA and ASIC technology by designing fast elliptic curve group operations in Jacobian projective coordinates. A novel combined point doubling and point addition (PDPA) architecture is proposed for group operations to achieve high speed and low hardware requirements for ECPM. It has been implemented over the binary field which is recommended by the National Institute of Standards and Technology (NIST). The proposed ECPM supports two Koblitz and random curves for the key sizes 233 and 163 bits. For group operations, a finite-field arithmetic operation, e.g. multiplication, is designed on a polynomial basis. The delay of a 233-bit point multiplication is only 3.05 and 3.56 μs, in a Xilinx Virtex-7 FPGA, for Koblitz and random curves, respectively, and 0.81 μs in an ASIC 65-nm technology, which are the fastest hardware implementation results reported in the literature to date. In addition, a 163-bit point multiplication is also implemented in FPGA and ASIC for fair comparison which takes around 0.33 and 0.46 μs, respectively. The area-time product of the proposed point multiplication is very low compared to similar designs. The performance (1Area×Time=1AT) and Area × Time × Energy (ATE) product of the proposed design are far better than the most significant studies found in the literature. PMID:28459831
Parallel point-multiplication architecture using combined group operations for high-speed cryptographic applications.

PubMed

Hossain, Md Selim; Saeedi, Ehsan; Kong, Yinan

2017-01-01

In this paper, we propose a novel parallel architecture for fast hardware implementation of elliptic curve point multiplication (ECPM), which is the key operation of an elliptic curve cryptography processor. The point multiplication over binary fields is synthesized on both FPGA and ASIC technology by designing fast elliptic curve group operations in Jacobian projective coordinates. A novel combined point doubling and point addition (PDPA) architecture is proposed for group operations to achieve high speed and low hardware requirements for ECPM. It has been implemented over the binary field which is recommended by the National Institute of Standards and Technology (NIST). The proposed ECPM supports two Koblitz and random curves for the key sizes 233 and 163 bits. For group operations, a finite-field arithmetic operation, e.g. multiplication, is designed on a polynomial basis. The delay of a 233-bit point multiplication is only 3.05 and 3.56 μs, in a Xilinx Virtex-7 FPGA, for Koblitz and random curves, respectively, and 0.81 μs in an ASIC 65-nm technology, which are the fastest hardware implementation results reported in the literature to date. In addition, a 163-bit point multiplication is also implemented in FPGA and ASIC for fair comparison which takes around 0.33 and 0.46 μs, respectively. The area-time product of the proposed point multiplication is very low compared to similar designs. The performance ([Formula: see text]) and Area × Time × Energy (ATE) product of the proposed design are far better than the most significant studies found in the literature.
Wideband aperture array using RF channelizers and massively parallel digital 2D IIR filterbank

NASA Astrophysics Data System (ADS)

Sengupta, Arindam; Madanayake, Arjuna; Gómez-García, Roberto; Engeberg, Erik D.

2014-05-01

Wideband receive-mode beamforming applications in wireless location, electronically-scanned antennas for radar, RF sensing, microwave imaging and wireless communications require digital aperture arrays that offer a relatively constant far-field beam over several octaves of bandwidth. Several beamforming schemes including the well-known true time-delay and the phased array beamformers have been realized using either finite impulse response (FIR) or fast Fourier transform (FFT) digital filter-sum based techniques. These beamforming algorithms offer the desired selectivity at the cost of a high computational complexity and frequency-dependant far-field array patterns. A novel approach to receiver beamforming is the use of massively parallel 2-D infinite impulse response (IIR) fan filterbanks for the synthesis of relatively frequency independent RF beams at an order of magnitude lower multiplier complexity compared to FFT or FIR filter based conventional algorithms. The 2-D IIR filterbanks demand fast digital processing that can support several octaves of RF bandwidth, fast analog-to-digital converters (ADCs) for RF-to-bits type direct conversion of wideband antenna element signals. Fast digital implementation platforms that can realize high-precision recursive filter structures necessary for real-time beamforming, at RF radio bandwidths, are also desired. We propose a novel technique that combines a passive RF channelizer, multichannel ADC technology, and single-phase massively parallel 2-D IIR digital fan filterbanks, realized at low complexity using FPGA and/or ASIC technology. There exists native support for a larger bandwidth than the maximum clock frequency of the digital implementation technology. We also strive to achieve More-than-Moore throughput by processing a wideband RF signal having content with N-fold (B = N Fclk/2) bandwidth compared to the maximum clock frequency Fclk Hz of the digital VLSI platform under consideration. Such increase in bandwidth is achieved without use of polyphase signal processing or time-interleaved ADC methods. That is, all digital processors operate at the same Fclk clock frequency without phasing, while wideband operation is achieved by sub-sampling of narrower sub-bands at the the RF channelizer outputs.
Parallel and pipeline computation of fast unitary transforms

NASA Technical Reports Server (NTRS)

Fino, B. J.; Algazi, V. R.

1975-01-01

The letter discusses the parallel and pipeline organization of fast-unitary-transform algorithms such as the fast Fourier transform, and points out the efficiency of a combined parallel-pipeline processor of a transform such as the Haar transform, in which (2 to the n-th power) -1 hardware 'butterflies' generate a transform of order 2 to the n-th power every computation cycle.
Large-scale virtual screening on public cloud resources with Apache Spark.

PubMed

Capuccini, Marco; Ahmed, Laeeq; Schaal, Wesley; Laure, Erwin; Spjuth, Ola

2017-01-01

Structure-based virtual screening is an in-silico method to screen a target receptor against a virtual molecular library. Applying docking-based screening to large molecular libraries can be computationally expensive, however it constitutes a trivially parallelizable task. Most of the available parallel implementations are based on message passing interface, relying on low failure rate hardware and fast network connection. Google's MapReduce revolutionized large-scale analysis, enabling the processing of massive datasets on commodity hardware and cloud resources, providing transparent scalability and fault tolerance at the software level. Open source implementations of MapReduce include Apache Hadoop and the more recent Apache Spark. We developed a method to run existing docking-based screening software on distributed cloud resources, utilizing the MapReduce approach. We benchmarked our method, which is implemented in Apache Spark, docking a publicly available target receptor against [Formula: see text]2.2 M compounds. The performance experiments show a good parallel efficiency (87%) when running in a public cloud environment. Our method enables parallel Structure-based virtual screening on public cloud resources or commodity computer clusters. The degree of scalability that we achieve allows for trying out our method on relatively small libraries first and then to scale to larger libraries. Our implementation is named Spark-VS and it is freely available as open source from GitHub (https://github.com/mcapuccini/spark-vs).Graphical abstract.
Real-time dynamic simulation of the Cassini spacecraft using DARTS. Part 2: Parallel/vectorized real-time implementation

NASA Technical Reports Server (NTRS)

Fijany, A.; Roberts, J. A.; Jain, A.; Man, G. K.

1993-01-01

Part 1 of this paper presented the requirements for the real-time simulation of Cassini spacecraft along with some discussion of the DARTS algorithm. Here, in Part 2 we discuss the development and implementation of parallel/vectorized DARTS algorithm and architecture for real-time simulation. Development of the fast algorithms and architecture for real-time hardware-in-the-loop simulation of spacecraft dynamics is motivated by the fact that it represents a hard real-time problem, in the sense that the correctness of the simulation depends on both the numerical accuracy and the exact timing of the computation. For a given model fidelity, the computation should be computed within a predefined time period. Further reduction in computation time allows increasing the fidelity of the model (i.e., inclusion of more flexible modes) and the integration routine.
Model-based spectral estimation of Doppler signals using parallel genetic algorithms.

PubMed

Solano González, J; Rodríguez Vázquez, K; García Nocetti, D F

2000-05-01

Conventional spectral analysis methods use a fast Fourier transform (FFT) on consecutive or overlapping windowed data segments. For Doppler ultrasound signals, this approach suffers from an inadequate frequency resolution due to the time segment duration and the non-stationarity characteristics of the signals. Parametric or model-based estimators can give significant improvements in the time-frequency resolution at the expense of a higher computational complexity. This work describes an approach which implements in real-time a parametric spectral estimator method using genetic algorithms (GAs) in order to find the optimum set of parameters for the adaptive filter that minimises the error function. The aim is to reduce the computational complexity of the conventional algorithm by using the simplicity associated to GAs and exploiting its parallel characteristics. This will allow the implementation of higher order filters, increasing the spectrum resolution, and opening a greater scope for using more complex methods.
Applying High-Speed Vision Sensing to an Industrial Robot for High-Performance Position Regulation under Uncertainties

PubMed Central

Huang, Shouren; Bergström, Niklas; Yamakawa, Yuji; Senoo, Taku; Ishikawa, Masatoshi

2016-01-01

It is traditionally difficult to implement fast and accurate position regulation on an industrial robot in the presence of uncertainties. The uncertain factors can be attributed either to the industrial robot itself (e.g., a mismatch of dynamics, mechanical defects such as backlash, etc.) or to the external environment (e.g., calibration errors, misalignment or perturbations of a workpiece, etc.). This paper proposes a systematic approach to implement high-performance position regulation under uncertainties on a general industrial robot (referred to as the main robot) with minimal or no manual teaching. The method is based on a coarse-to-fine strategy that involves configuring an add-on module for the main robot’s end effector. The add-on module consists of a 1000 Hz vision sensor and a high-speed actuator to compensate for accumulated uncertainties. The main robot only focuses on fast and coarse motion, with its trajectories automatically planned by image information from a static low-cost camera. Fast and accurate peg-and-hole alignment in one dimension was implemented as an application scenario by using a commercial parallel-link robot and an add-on compensation module with one degree of freedom (DoF). Experimental results yielded an almost 100% success rate for fast peg-in-hole manipulation (with regulation accuracy at about 0.1 mm) when the workpiece was randomly placed. PMID:27483274
Novel Scalable 3-D MT Inverse Solver

NASA Astrophysics Data System (ADS)

Kuvshinov, A. V.; Kruglyakov, M.; Geraskin, A.

2016-12-01

We present a new, robust and fast, three-dimensional (3-D) magnetotelluric (MT) inverse solver. As a forward modelling engine a highly-scalable solver extrEMe [1] is used. The (regularized) inversion is based on an iterative gradient-type optimization (quasi-Newton method) and exploits adjoint sources approach for fast calculation of the gradient of the misfit. The inverse solver is able to deal with highly detailed and contrasting models, allows for working (separately or jointly) with any type of MT (single-site and/or inter-site) responses, and supports massive parallelization. Different parallelization strategies implemented in the code allow for optimal usage of available computational resources for a given problem set up. To parameterize an inverse domain a mask approach is implemented, which means that one can merge any subset of forward modelling cells in order to account for (usually) irregular distribution of observation sites. We report results of 3-D numerical experiments aimed at analysing the robustness, performance and scalability of the code. In particular, our computational experiments carried out at different platforms ranging from modern laptops to high-performance clusters demonstrate practically linear scalability of the code up to thousands of nodes. 1. Kruglyakov, M., A. Geraskin, A. Kuvshinov, 2016. Novel accurate and scalable 3-D MT forward solver based on a contracting integral equation method, Computers and Geosciences, in press.
A note on parallel and pipeline computation of fast unitary transforms

NASA Technical Reports Server (NTRS)

Fino, B. J.; Algazi, V. R.

1974-01-01

The parallel and pipeline organization of fast unitary transform algorithms such as the Fast Fourier Transform are discussed. The efficiency is pointed out of a combined parallel-pipeline processor of a transform such as the Haar transform in which 2 to the n minus 1 power hardware butterflies generate a transform of order 2 to the n power every computation cycle.
A Parallel Fast Sweeping Method for the Eikonal Equation

NASA Astrophysics Data System (ADS)

Baker, B.

2017-12-01

Recently, there has been an exciting emergence of probabilistic methods for travel time tomography. Unlike gradient-based optimization strategies, probabilistic tomographic methods are resistant to becoming trapped in a local minimum and provide a much better quantification of parameter resolution than, say, appealing to ray density or performing checkerboard reconstruction tests. The benefits associated with random sampling methods however are only realized by successive computation of predicted travel times in, potentially, strongly heterogeneous media. To this end this abstract is concerned with expediting the solution of the Eikonal equation. While many Eikonal solvers use a fast marching method, the proposed solver will use the iterative fast sweeping method because the eight fixed sweep orderings in each iteration are natural targets for parallelization. To reduce the number of iterations and grid points required the high-accuracy finite difference stencil of Nobel et al., 2014 is implemented. A directed acyclic graph (DAG) is created with a priori knowledge of the sweep ordering and finite different stencil. By performing a topological sort of the DAG sets of independent nodes are identified as candidates for concurrent updating. Additionally, the proposed solver will also address scalability during earthquake relocation, a necessary step in local and regional earthquake tomography and a barrier to extending probabilistic methods from active source to passive source applications, by introducing an asynchronous parallel forward solve phase for all receivers in the network. Synthetic examples using the SEG over-thrust model will be presented.
Parallel, stochastic measurement of molecular surface area.

PubMed

Juba, Derek; Varshney, Amitabh

2008-08-01

Biochemists often wish to compute surface areas of proteins. A variety of algorithms have been developed for this task, but they are designed for traditional single-processor architectures. The current trend in computer hardware is towards increasingly parallel architectures for which these algorithms are not well suited. We describe a parallel, stochastic algorithm for molecular surface area computation that maps well to the emerging multi-core architectures. Our algorithm is also progressive, providing a rough estimate of surface area immediately and refining this estimate as time goes on. Furthermore, the algorithm generates points on the molecular surface which can be used for point-based rendering. We demonstrate a GPU implementation of our algorithm and show that it compares favorably with several existing molecular surface computation programs, giving fast estimates of the molecular surface area with good accuracy.
A rapid parallelization of cone-beam projection and back-projection operator based on texture fetching interpolation

NASA Astrophysics Data System (ADS)

Xie, Lizhe; Hu, Yining; Chen, Yang; Shi, Luyao

2015-03-01

Projection and back-projection are the most computational consuming parts in Computed Tomography (CT) reconstruction. Parallelization strategies using GPU computing techniques have been introduced. We in this paper present a new parallelization scheme for both projection and back-projection. The proposed method is based on CUDA technology carried out by NVIDIA Corporation. Instead of build complex model, we aimed on optimizing the existing algorithm and make it suitable for CUDA implementation so as to gain fast computation speed. Besides making use of texture fetching operation which helps gain faster interpolation speed, we fixed sampling numbers in the computation of projection, to ensure the synchronization of blocks and threads, thus prevents the latency caused by inconsistent computation complexity. Experiment results have proven the computational efficiency and imaging quality of the proposed method.
Highly parallel implementation of non-adiabatic Ehrenfest molecular dynamics

NASA Astrophysics Data System (ADS)

Kanai, Yosuke; Schleife, Andre; Draeger, Erik; Anisimov, Victor; Correa, Alfredo

2014-03-01

While the adiabatic Born-Oppenheimer approximation tremendously lowers computational effort, many questions in modern physics, chemistry, and materials science require an explicit description of coupled non-adiabatic electron-ion dynamics. Electronic stopping, i.e. the energy transfer of a fast projectile atom to the electronic system of the target material, is a notorious example. We recently implemented real-time time-dependent density functional theory based on the plane-wave pseudopotential formalism in the Qbox/qb@ll codes. We demonstrate that explicit integration using a fourth-order Runge-Kutta scheme is very suitable for modern highly parallelized supercomputers. Applying the new implementation to systems with hundreds of atoms and thousands of electrons, we achieved excellent performance and scalability on a large number of nodes both on the BlueGene based ``Sequoia'' system at LLNL as well as the Cray architecture of ``Blue Waters'' at NCSA. As an example, we discuss our work on computing the electronic stopping power of aluminum and gold for hydrogen projectiles, showing an excellent agreement with experiment. These first-principles calculations allow us to gain important insight into the the fundamental physics of electronic stopping.
A domain-specific compiler for a parallel multiresolution adaptive numerical simulation environment

DOE Office of Scientific and Technical Information (OSTI.GOV)

Rajbhandari, Samyam; Kim, Jinsung; Krishnamoorthy, Sriram

This paper describes the design and implementation of a layered domain-specific compiler to support MADNESS---Multiresolution ADaptive Numerical Environment for Scientific Simulation. MADNESS is a high-level software environment for the solution of integral and differential equations in many dimensions, using adaptive and fast harmonic analysis methods with guaranteed precision. MADNESS uses k-d trees to represent spatial functions and implements operators like addition, multiplication, differentiation, and integration on the numerical representation of functions. The MADNESS runtime system provides global namespace support and a task-based execution model including futures. MADNESS is currently deployed on massively parallel supercomputers and has enabled many science advances.more » Due to the highly irregular and statically unpredictable structure of the k-d trees representing the spatial functions encountered in MADNESS applications, only purely runtime approaches to optimization have previously been implemented in the MADNESS framework. This paper describes a layered domain-specific compiler developed to address some performance bottlenecks in MADNESS. The newly developed static compile-time optimizations, in conjunction with the MADNESS runtime support, enable significant performance improvement for the MADNESS framework.« less
Parallel Computer System for 3D Visualization Stereo on GPU

NASA Astrophysics Data System (ADS)

Al-Oraiqat, Anas M.; Zori, Sergii A.

2018-03-01

This paper proposes the organization of a parallel computer system based on Graphic Processors Unit (GPU) for 3D stereo image synthesis. The development is based on the modified ray tracing method developed by the authors for fast search of tracing rays intersections with scene objects. The system allows significant increase in the productivity for the 3D stereo synthesis of photorealistic quality. The generalized procedure of 3D stereo image synthesis on the Graphics Processing Unit/Graphics Processing Clusters (GPU/GPC) is proposed. The efficiency of the proposed solutions by GPU implementation is compared with single-threaded and multithreaded implementations on the CPU. The achieved average acceleration in multi-thread implementation on the test GPU and CPU is about 7.5 and 1.6 times, respectively. Studying the influence of choosing the size and configuration of the computational Compute Unified Device Archi-tecture (CUDA) network on the computational speed shows the importance of their correct selection. The obtained experimental estimations can be significantly improved by new GPUs with a large number of processing cores and multiprocessors, as well as optimized configuration of the computing CUDA network.
An Efficient Multicore Implementation of a Novel HSS-Structured Multifrontal Solver Using Randomized Sampling

DOE PAGES

Ghysels, Pieter; Li, Xiaoye S.; Rouet, Francois -Henry; ...

2016-10-27

Here, we present a sparse linear system solver that is based on a multifrontal variant of Gaussian elimination and exploits low-rank approximation of the resulting dense frontal matrices. We use hierarchically semiseparable (HSS) matrices, which have low-rank off-diagonal blocks, to approximate the frontal matrices. For HSS matrix construction, a randomized sampling algorithm is used together with interpolative decompositions. The combination of the randomized compression with a fast ULV HSS factoriz ation leads to a solver with lower computational complexity than the standard multifrontal method for many applications, resulting in speedups up to 7 fold for problems in our test suite.more » The implementation targets many-core systems by using task parallelism with dynamic runtime scheduling. Numerical experiments show performance improvements over state-of-the-art sparse direct solvers. The implementation achieves high performance and good scalability on a range of modern shared memory parallel systems, including the Intel Xeon Phi (MIC). The code is part of a software package called STRUMPACK - STRUctured Matrices PACKage, which also has a distributed memory component for dense rank-structured matrices.« less
Block-Parallel Data Analysis with DIY2

DOE Office of Scientific and Technical Information (OSTI.GOV)

Morozov, Dmitriy; Peterka, Tom

DIY2 is a programming model and runtime for block-parallel analytics on distributed-memory machines. Its main abstraction is block-structured data parallelism: data are decomposed into blocks; blocks are assigned to processing elements (processes or threads); computation is described as iterations over these blocks, and communication between blocks is defined by reusable patterns. By expressing computation in this general form, the DIY2 runtime is free to optimize the movement of blocks between slow and fast memories (disk and flash vs. DRAM) and to concurrently execute blocks residing in memory with multiple threads. This enables the same program to execute in-core, out-of-core, serial,more » parallel, single-threaded, multithreaded, or combinations thereof. This paper describes the implementation of the main features of the DIY2 programming model and optimizations to improve performance. DIY2 is evaluated on benchmark test cases to establish baseline performance for several common patterns and on larger complete analysis codes running on large-scale HPC machines.« less
Highly parallel demagnetization field calculation using the fast multipole method on tetrahedral meshes with continuous sources

NASA Astrophysics Data System (ADS)

Palmesi, P.; Exl, L.; Bruckner, F.; Abert, C.; Suess, D.

2017-11-01

The long-range magnetic field is the most time-consuming part in micromagnetic simulations. Computational improvements can relieve problems related to this bottleneck. This work presents an efficient implementation of the Fast Multipole Method [FMM] for the magnetic scalar potential as used in micromagnetics. The novelty lies in extending FMM to linearly magnetized tetrahedral sources making it interesting also for other areas of computational physics. We treat the near field directly and in use (exact) numerical integration on the multipole expansion in the far field. This approach tackles important issues like the vectorial and continuous nature of the magnetic field. By using FMM the calculations scale linearly in time and memory.

Parallel ICA and its hardware implementation in hyperspectral image analysis

NASA Astrophysics Data System (ADS)

Du, Hongtao; Qi, Hairong; Peterson, Gregory D.

2004-04-01

Advances in hyperspectral images have dramatically boosted remote sensing applications by providing abundant information using hundreds of contiguous spectral bands. However, the high volume of information also results in excessive computation burden. Since most materials have specific characteristics only at certain bands, a lot of these information is redundant. This property of hyperspectral images has motivated many researchers to study various dimensionality reduction algorithms, including Projection Pursuit (PP), Principal Component Analysis (PCA), wavelet transform, and Independent Component Analysis (ICA), where ICA is one of the most popular techniques. It searches for a linear or nonlinear transformation which minimizes the statistical dependence between spectral bands. Through this process, ICA can eliminate superfluous but retain practical information given only the observations of hyperspectral images. One hurdle of applying ICA in hyperspectral image (HSI) analysis, however, is its long computation time, especially for high volume hyperspectral data sets. Even the most efficient method, FastICA, is a very time-consuming process. In this paper, we present a parallel ICA (pICA) algorithm derived from FastICA. During the unmixing process, pICA divides the estimation of weight matrix into sub-processes which can be conducted in parallel on multiple processors. The decorrelation process is decomposed into the internal decorrelation and the external decorrelation, which perform weight vector decorrelations within individual processors and between cooperative processors, respectively. In order to further improve the performance of pICA, we seek hardware solutions in the implementation of pICA. Until now, there are very few hardware designs for ICA-related processes due to the complicated and iterant computation. This paper discusses capacity limitation of FPGA implementations for pICA in HSI analysis. A synthesis of Application-Specific Integrated Circuit (ASIC) is designed for pICA-based dimensionality reduction in HSI analysis. The pICA design is implemented using standard-height cells and aimed at TSMC 0.18 micron process. During the synthesis procedure, three ICA-related reconfigurable components are developed for the reuse and retargeting purpose. Preliminary results show that the standard-height cell based ASIC synthesis provide an effective solution for pICA and ICA-related processes in HSI analysis.
Comparison of temporal and spectral scattering methods using acoustically large breast models derived from magnetic resonance images.

PubMed

Hesford, Andrew J; Tillett, Jason C; Astheimer, Jeffrey P; Waag, Robert C

2014-08-01

Accurate and efficient modeling of ultrasound propagation through realistic tissue models is important to many aspects of clinical ultrasound imaging. Simplified problems with known solutions are often used to study and validate numerical methods. Greater confidence in a time-domain k-space method and a frequency-domain fast multipole method is established in this paper by analyzing results for realistic models of the human breast. Models of breast tissue were produced by segmenting magnetic resonance images of ex vivo specimens into seven distinct tissue types. After confirming with histologic analysis by pathologists that the model structures mimicked in vivo breast, the tissue types were mapped to variations in sound speed and acoustic absorption. Calculations of acoustic scattering by the resulting model were performed on massively parallel supercomputer clusters using parallel implementations of the k-space method and the fast multipole method. The efficient use of these resources was confirmed by parallel efficiency and scalability studies using large-scale, realistic tissue models. Comparisons between the temporal and spectral results were performed in representative planes by Fourier transforming the temporal results. An RMS field error less than 3% throughout the model volume confirms the accuracy of the methods for modeling ultrasound propagation through human breast.
Improved preconditioned conjugate gradient algorithm and application in 3D inversion of gravity-gradiometry data

NASA Astrophysics Data System (ADS)

Wang, Tai-Han; Huang, Da-Nian; Ma, Guo-Qing; Meng, Zhao-Hai; Li, Ye

2017-06-01

With the continuous development of full tensor gradiometer (FTG) measurement techniques, three-dimensional (3D) inversion of FTG data is becoming increasingly used in oil and gas exploration. In the fast processing and interpretation of large-scale high-precision data, the use of the graphics processing unit process unit (GPU) and preconditioning methods are very important in the data inversion. In this paper, an improved preconditioned conjugate gradient algorithm is proposed by combining the symmetric successive over-relaxation (SSOR) technique and the incomplete Choleksy decomposition conjugate gradient algorithm (ICCG). Since preparing the preconditioner requires extra time, a parallel implement based on GPU is proposed. The improved method is then applied in the inversion of noisecontaminated synthetic data to prove its adaptability in the inversion of 3D FTG data. Results show that the parallel SSOR-ICCG algorithm based on NVIDIA Tesla C2050 GPU achieves a speedup of approximately 25 times that of a serial program using a 2.0 GHz Central Processing Unit (CPU). Real airborne gravity-gradiometry data from Vinton salt dome (southwest Louisiana, USA) are also considered. Good results are obtained, which verifies the efficiency and feasibility of the proposed parallel method in fast inversion of 3D FTG data.
LDRD final report on massively-parallel linear programming : the parPCx system.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Parekh, Ojas; Phillips, Cynthia Ann; Boman, Erik Gunnar

2005-02-01

This report summarizes the research and development performed from October 2002 to September 2004 at Sandia National Laboratories under the Laboratory-Directed Research and Development (LDRD) project ''Massively-Parallel Linear Programming''. We developed a linear programming (LP) solver designed to use a large number of processors. LP is the optimization of a linear objective function subject to linear constraints. Companies and universities have expended huge efforts over decades to produce fast, stable serial LP solvers. Previous parallel codes run on shared-memory systems and have little or no distribution of the constraint matrix. We have seen no reports of general LP solver runsmore » on large numbers of processors. Our parallel LP code is based on an efficient serial implementation of Mehrotra's interior-point predictor-corrector algorithm (PCx). The computational core of this algorithm is the assembly and solution of a sparse linear system. We have substantially rewritten the PCx code and based it on Trilinos, the parallel linear algebra library developed at Sandia. Our interior-point method can use either direct or iterative solvers for the linear system. To achieve a good parallel data distribution of the constraint matrix, we use a (pre-release) version of a hypergraph partitioner from the Zoltan partitioning library. We describe the design and implementation of our new LP solver called parPCx and give preliminary computational results. We summarize a number of issues related to efficient parallel solution of LPs with interior-point methods including data distribution, numerical stability, and solving the core linear system using both direct and iterative methods. We describe a number of applications of LP specific to US Department of Energy mission areas and we summarize our efforts to integrate parPCx (and parallel LP solvers in general) into Sandia's massively-parallel integer programming solver PICO (Parallel Interger and Combinatorial Optimizer). We conclude with directions for long-term future algorithmic research and for near-term development that could improve the performance of parPCx.« less
Moving magnets in a micromagnetic finite-difference framework

NASA Astrophysics Data System (ADS)

Rissanen, Ilari; Laurson, Lasse

2018-05-01

We present a method and an implementation for smooth linear motion in a finite-difference-based micromagnetic simulation code, to be used in simulating magnetic friction and other phenomena involving moving microscale magnets. Our aim is to accurately simulate the magnetization dynamics and relative motion of magnets while retaining high computational speed. To this end, we combine techniques for fast scalar potential calculation and cubic b-spline interpolation, parallelizing them on a graphics processing unit (GPU). The implementation also includes the possibility of explicitly simulating eddy currents in the case of conducting magnets. We test our implementation by providing numerical examples of stick-slip motion of thin films pulled by a spring and the effect of eddy currents on the switching time of magnetic nanocubes.
An Artificial Neural Networks Method for Solving Partial Differential Equations

NASA Astrophysics Data System (ADS)

Alharbi, Abir

2010-09-01

While there already exists many analytical and numerical techniques for solving PDEs, this paper introduces an approach using artificial neural networks. The approach consists of a technique developed by combining the standard numerical method, finite-difference, with the Hopfield neural network. The method is denoted Hopfield-finite-difference (HFD). The architecture of the nets, energy function, updating equations, and algorithms are developed for the method. The HFD method has been used successfully to approximate the solution of classical PDEs, such as the Wave, Heat, Poisson and the Diffusion equations, and on a system of PDEs. The software Matlab is used to obtain the results in both tabular and graphical form. The results are similar in terms of accuracy to those obtained by standard numerical methods. In terms of speed, the parallel nature of the Hopfield nets methods makes them easier to implement on fast parallel computers while some numerical methods need extra effort for parallelization.
Density-based parallel skin lesion border detection with webCL

PubMed Central

2015-01-01

Background Dermoscopy is a highly effective and noninvasive imaging technique used in diagnosis of melanoma and other pigmented skin lesions. Many aspects of the lesion under consideration are defined in relation to the lesion border. This makes border detection one of the most important steps in dermoscopic image analysis. In current practice, dermatologists often delineate borders through a hand drawn representation based upon visual inspection. Due to the subjective nature of this technique, intra- and inter-observer variations are common. Because of this, the automated assessment of lesion borders in dermoscopic images has become an important area of study. Methods Fast density based skin lesion border detection method has been implemented in parallel with a new parallel technology called WebCL. WebCL utilizes client side computing capabilities to use available hardware resources such as multi cores and GPUs. Developed WebCL-parallel density based skin lesion border detection method runs efficiently from internet browsers. Results Previous research indicates that one of the highest accuracy rates can be achieved using density based clustering techniques for skin lesion border detection. While these algorithms do have unfavorable time complexities, this effect could be mitigated when implemented in parallel. In this study, density based clustering technique for skin lesion border detection is parallelized and redesigned to run very efficiently on the heterogeneous platforms (e.g. tablets, SmartPhones, multi-core CPUs, GPUs, and fully-integrated Accelerated Processing Units) by transforming the technique into a series of independent concurrent operations. Heterogeneous computing is adopted to support accessibility, portability and multi-device use in the clinical settings. For this, we used WebCL, an emerging technology that enables a HTML5 Web browser to execute code in parallel for heterogeneous platforms. We depicted WebCL and our parallel algorithm design. In addition, we tested parallel code on 100 dermoscopy images and showed the execution speedups with respect to the serial version. Results indicate that parallel (WebCL) version and serial version of density based lesion border detection methods generate the same accuracy rates for 100 dermoscopy images, in which mean of border error is 6.94%, mean of recall is 76.66%, and mean of precision is 99.29% respectively. Moreover, WebCL version's speedup factor for 100 dermoscopy images' lesion border detection averages around ~491.2. Conclusions When large amount of high resolution dermoscopy images considered in a usual clinical setting along with the critical importance of early detection and diagnosis of melanoma before metastasis, the importance of fast processing dermoscopy images become obvious. In this paper, we introduce WebCL and the use of it for biomedical image processing applications. WebCL is a javascript binding of OpenCL, which takes advantage of GPU computing from a web browser. Therefore, WebCL parallel version of density based skin lesion border detection introduced in this study can supplement expert dermatologist, and aid them in early diagnosis of skin lesions. While WebCL is currently an emerging technology, a full adoption of WebCL into the HTML5 standard would allow for this implementation to run on a very large set of hardware and software systems. WebCL takes full advantage of parallel computational resources including multi-cores and GPUs on a local machine, and allows for compiled code to run directly from the Web Browser. PMID:26423836
Density-based parallel skin lesion border detection with webCL.

PubMed

Lemon, James; Kockara, Sinan; Halic, Tansel; Mete, Mutlu

2015-01-01

Dermoscopy is a highly effective and noninvasive imaging technique used in diagnosis of melanoma and other pigmented skin lesions. Many aspects of the lesion under consideration are defined in relation to the lesion border. This makes border detection one of the most important steps in dermoscopic image analysis. In current practice, dermatologists often delineate borders through a hand drawn representation based upon visual inspection. Due to the subjective nature of this technique, intra- and inter-observer variations are common. Because of this, the automated assessment of lesion borders in dermoscopic images has become an important area of study. Fast density based skin lesion border detection method has been implemented in parallel with a new parallel technology called WebCL. WebCL utilizes client side computing capabilities to use available hardware resources such as multi cores and GPUs. Developed WebCL-parallel density based skin lesion border detection method runs efficiently from internet browsers. Previous research indicates that one of the highest accuracy rates can be achieved using density based clustering techniques for skin lesion border detection. While these algorithms do have unfavorable time complexities, this effect could be mitigated when implemented in parallel. In this study, density based clustering technique for skin lesion border detection is parallelized and redesigned to run very efficiently on the heterogeneous platforms (e.g. tablets, SmartPhones, multi-core CPUs, GPUs, and fully-integrated Accelerated Processing Units) by transforming the technique into a series of independent concurrent operations. Heterogeneous computing is adopted to support accessibility, portability and multi-device use in the clinical settings. For this, we used WebCL, an emerging technology that enables a HTML5 Web browser to execute code in parallel for heterogeneous platforms. We depicted WebCL and our parallel algorithm design. In addition, we tested parallel code on 100 dermoscopy images and showed the execution speedups with respect to the serial version. Results indicate that parallel (WebCL) version and serial version of density based lesion border detection methods generate the same accuracy rates for 100 dermoscopy images, in which mean of border error is 6.94%, mean of recall is 76.66%, and mean of precision is 99.29% respectively. Moreover, WebCL version's speedup factor for 100 dermoscopy images' lesion border detection averages around ~491.2. When large amount of high resolution dermoscopy images considered in a usual clinical setting along with the critical importance of early detection and diagnosis of melanoma before metastasis, the importance of fast processing dermoscopy images become obvious. In this paper, we introduce WebCL and the use of it for biomedical image processing applications. WebCL is a javascript binding of OpenCL, which takes advantage of GPU computing from a web browser. Therefore, WebCL parallel version of density based skin lesion border detection introduced in this study can supplement expert dermatologist, and aid them in early diagnosis of skin lesions. While WebCL is currently an emerging technology, a full adoption of WebCL into the HTML5 standard would allow for this implementation to run on a very large set of hardware and software systems. WebCL takes full advantage of parallel computational resources including multi-cores and GPUs on a local machine, and allows for compiled code to run directly from the Web Browser.
The Anisotropic Structure of South China Sea: Using OBS Data to Constrain Mantle Flow

NASA Astrophysics Data System (ADS)

Li, L.; Xue, M.; Yang, T.; Liu, C.; Hua, Q.; Xia, S.; Huang, H.; Le, B. M.; Huo, D.; Pan, M.

2015-12-01

The dynamic mechanism of the formation of South China Sea (SCS) has been debated for decades. The anisotropic structure can provide useful insight into the complex evolution of SCS by indicating its mantle flow direction and strength. In this study, we employ shear wave splitting methods on two half-year seismic data collected from 10 and 6 passive source Ocean Bottom Seismometers (OBS) respectively. These OBSs were deployed along both sides of the extinct ridge in the central basin of SCS by Tongji University in 2012 and 2013 respectively, which were then successfully recovered in 2013 and 2015 respectively. Through processing and inspecting the global and regional earthquakes (with local events being processing) of the 2012 dataset, measurements are made for 2 global events and 24 regional events at 5 OBSs using the tangential energy minimization, the smallest eigenvalue minimization, as well as the correlation methods. We also implement cluster analysis on the splitting results obtained for different time windows as well as filtered at different frequency bands. For teleseismic core phases like SKS and PKS, we find the fast polarization direction beneath the central basin is approximately NE-SW, nearly parallel to the extinct ridge in the central basin of SCS. Whereas for regional events, the splitting analysis on S, PS and ScS phases shows much more complicated fast directions as the ray path varies for different phases. The fast directions observed can be divided into three groups: (1) for the events from the Eurasia plate, a gradual rotation of the fast polarization direction from NNE-SSW to NEE-SWW along the path from the inner Eurasia plate to the central SCS is observed, implying the mantle flow is controlled by the India-Eurasia collision; (2) for the events located at the junction of Pacific plate and Philippine plate, the dominant fast direction is NW-SE, almost perpendicular to Ryukyu Trench as well as sub-parallel to the absolute direction of Philippine plate; (3) for the events occurred in the SE direction near the Philippine Fault zone, the observed NE-SW fast direction is sub-parallel to the subduction direction of the Philippine plate.
GPU-based ultra-fast dose calculation using a finite size pencil beam model.

PubMed

Gu, Xuejun; Choi, Dongju; Men, Chunhua; Pan, Hubert; Majumdar, Amitava; Jiang, Steve B

2009-10-21

Online adaptive radiation therapy (ART) is an attractive concept that promises the ability to deliver an optimal treatment in response to the inter-fraction variability in patient anatomy. However, it has yet to be realized due to technical limitations. Fast dose deposit coefficient calculation is a critical component of the online planning process that is required for plan optimization of intensity-modulated radiation therapy (IMRT). Computer graphics processing units (GPUs) are well suited to provide the requisite fast performance for the data-parallel nature of dose calculation. In this work, we develop a dose calculation engine based on a finite-size pencil beam (FSPB) algorithm and a GPU parallel computing framework. The developed framework can accommodate any FSPB model. We test our implementation in the case of a water phantom and the case of a prostate cancer patient with varying beamlet and voxel sizes. All testing scenarios achieved speedup ranging from 200 to 400 times when using a NVIDIA Tesla C1060 card in comparison with a 2.27 GHz Intel Xeon CPU. The computational time for calculating dose deposition coefficients for a nine-field prostate IMRT plan with this new framework is less than 1 s. This indicates that the GPU-based FSPB algorithm is well suited for online re-planning for adaptive radiotherapy.
Accelerating a three-dimensional eco-hydrological cellular automaton on GPGPU with OpenCL

NASA Astrophysics Data System (ADS)

Senatore, Alfonso; D'Ambrosio, Donato; De Rango, Alessio; Rongo, Rocco; Spataro, William; Straface, Salvatore; Mendicino, Giuseppe

2016-10-01

This work presents an effective implementation of a numerical model for complete eco-hydrological Cellular Automata modeling on Graphical Processing Units (GPU) with OpenCL (Open Computing Language) for heterogeneous computation (i.e., on CPUs and/or GPUs). Different types of parallel implementations were carried out (e.g., use of fast local memory, loop unrolling, etc), showing increasing performance improvements in terms of speedup, adopting also some original optimizations strategies. Moreover, numerical analysis of results (i.e., comparison of CPU and GPU outcomes in terms of rounding errors) have proven to be satisfactory. Experiments were carried out on a workstation with two CPUs (Intel Xeon E5440 at 2.83GHz), one GPU AMD R9 280X and one GPU nVIDIA Tesla K20c. Results have been extremely positive, but further testing should be performed to assess the functionality of the adopted strategies on other complete models and their ability to fruitfully exploit parallel systems resources.
Parallel programming of gradient-based iterative image reconstruction schemes for optical tomography.

PubMed

Hielscher, Andreas H; Bartel, Sebastian

2004-02-01

Optical tomography (OT) is a fast developing novel imaging modality that uses near-infrared (NIR) light to obtain cross-sectional views of optical properties inside the human body. A major challenge remains the time-consuming, computational-intensive image reconstruction problem that converts NIR transmission measurements into cross-sectional images. To increase the speed of iterative image reconstruction schemes that are commonly applied for OT, we have developed and implemented several parallel algorithms on a cluster of workstations. Static process distribution as well as dynamic load balancing schemes suitable for heterogeneous clusters and varying machine performances are introduced and tested. The resulting algorithms are shown to accelerate the reconstruction process to various degrees, substantially reducing the computation times for clinically relevant problems.
Symmetric nonnegative matrix factorization: algorithms and applications to probabilistic clustering.

PubMed

He, Zhaoshui; Xie, Shengli; Zdunek, Rafal; Zhou, Guoxu; Cichocki, Andrzej

2011-12-01

Nonnegative matrix factorization (NMF) is an unsupervised learning method useful in various applications including image processing and semantic analysis of documents. This paper focuses on symmetric NMF (SNMF), which is a special case of NMF decomposition. Three parallel multiplicative update algorithms using level 3 basic linear algebra subprograms directly are developed for this problem. First, by minimizing the Euclidean distance, a multiplicative update algorithm is proposed, and its convergence under mild conditions is proved. Based on it, we further propose another two fast parallel methods: α-SNMF and β -SNMF algorithms. All of them are easy to implement. These algorithms are applied to probabilistic clustering. We demonstrate their effectiveness for facial image clustering, document categorization, and pattern clustering in gene expression.
Computational algorithms for simulations in atmospheric optics.

PubMed

Konyaev, P A; Lukin, V P

2016-04-20

A computer simulation technique for atmospheric and adaptive optics based on parallel programing is discussed. A parallel propagation algorithm is designed and a modified spectral-phase method for computer generation of 2D time-variant random fields is developed. Temporal power spectra of Laguerre-Gaussian beam fluctuations are considered as an example to illustrate the applications discussed. Implementation of the proposed algorithms using Intel MKL and IPP libraries and NVIDIA CUDA technology is shown to be very fast and accurate. The hardware system for the computer simulation is an off-the-shelf desktop with an Intel Core i7-4790K CPU operating at a turbo-speed frequency up to 5 GHz and an NVIDIA GeForce GTX-960 graphics accelerator with 1024 1.5 GHz processors.
FastGCN: A GPU Accelerated Tool for Fast Gene Co-Expression Networks

PubMed Central

Liang, Meimei; Zhang, Futao; Jin, Gulei; Zhu, Jun

2015-01-01

Gene co-expression networks comprise one type of valuable biological networks. Many methods and tools have been published to construct gene co-expression networks; however, most of these tools and methods are inconvenient and time consuming for large datasets. We have developed a user-friendly, accelerated and optimized tool for constructing gene co-expression networks that can fully harness the parallel nature of GPU (Graphic Processing Unit) architectures. Genetic entropies were exploited to filter out genes with no or small expression changes in the raw data preprocessing step. Pearson correlation coefficients were then calculated. After that, we normalized these coefficients and employed the False Discovery Rate to control the multiple tests. At last, modules identification was conducted to construct the co-expression networks. All of these calculations were implemented on a GPU. We also compressed the coefficient matrix to save space. We compared the performance of the GPU implementation with those of multi-core CPU implementations with 16 CPU threads, single-thread C/C++ implementation and single-thread R implementation. Our results show that GPU implementation largely outperforms single-thread C/C++ implementation and single-thread R implementation, and GPU implementation outperforms multi-core CPU implementation when the number of genes increases. With the test dataset containing 16,000 genes and 590 individuals, we can achieve greater than 63 times the speed using a GPU implementation compared with a single-thread R implementation when 50 percent of genes were filtered out and about 80 times the speed when no genes were filtered out. PMID:25602758
FastGCN: a GPU accelerated tool for fast gene co-expression networks.

PubMed

Liang, Meimei; Zhang, Futao; Jin, Gulei; Zhu, Jun

2015-01-01

Gene co-expression networks comprise one type of valuable biological networks. Many methods and tools have been published to construct gene co-expression networks; however, most of these tools and methods are inconvenient and time consuming for large datasets. We have developed a user-friendly, accelerated and optimized tool for constructing gene co-expression networks that can fully harness the parallel nature of GPU (Graphic Processing Unit) architectures. Genetic entropies were exploited to filter out genes with no or small expression changes in the raw data preprocessing step. Pearson correlation coefficients were then calculated. After that, we normalized these coefficients and employed the False Discovery Rate to control the multiple tests. At last, modules identification was conducted to construct the co-expression networks. All of these calculations were implemented on a GPU. We also compressed the coefficient matrix to save space. We compared the performance of the GPU implementation with those of multi-core CPU implementations with 16 CPU threads, single-thread C/C++ implementation and single-thread R implementation. Our results show that GPU implementation largely outperforms single-thread C/C++ implementation and single-thread R implementation, and GPU implementation outperforms multi-core CPU implementation when the number of genes increases. With the test dataset containing 16,000 genes and 590 individuals, we can achieve greater than 63 times the speed using a GPU implementation compared with a single-thread R implementation when 50 percent of genes were filtered out and about 80 times the speed when no genes were filtered out.
Analysis techniques for diagnosing runaway ion distributions in the reversed field pinch

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kim, J., E-mail: jkim536@wisc.edu; Anderson, J. K.; Capecchi, W.

2016-11-15

An advanced neutral particle analyzer (ANPA) on the Madison Symmetric Torus measures deuterium ions of energy ranges 8-45 keV with an energy resolution of 2-4 keV and time resolution of 10 μs. Three different experimental configurations measure distinct portions of the naturally occurring fast ion distributions: fast ions moving parallel, anti-parallel, or perpendicular to the plasma current. On a radial-facing port, fast ions moving perpendicular to the current have the necessary pitch to be measured by the ANPA. With the diagnostic positioned on a tangent line through the plasma core, a chord integration over fast ion density, background neutral density,more » and local appropriate pitch defines the measured sample. The plasma current can be reversed to measure anti-parallel fast ions in the same configuration. Comparisons of energy distributions for the three configurations show an anisotropic fast ion distribution favoring high pitch ions.« less
Fast segmentation of satellite images using SLIC, WebGL and Google Earth Engine

NASA Astrophysics Data System (ADS)

Donchyts, Gennadii; Baart, Fedor; Gorelick, Noel; Eisemann, Elmar; van de Giesen, Nick

2017-04-01

Google Earth Engine (GEE) is a parallel geospatial processing platform, which harmonizes access to petabytes of freely available satellite images. It provides a very rich API, allowing development of dedicated algorithms to extract useful geospatial information from these images. At the same time, modern GPUs provide thousands of computing cores, which are mostly not utilized in this context. In the last years, WebGL became a popular and well-supported API, allowing fast image processing directly in web browsers. In this work, we will evaluate the applicability of WebGL to enable fast segmentation of satellite images. A new implementation of a Simple Linear Iterative Clustering (SLIC) algorithm using GPU shaders will be presented. SLIC is a simple and efficient method to decompose an image in visually homogeneous regions. It adapts a k-means clustering approach to generate superpixels efficiently. While this approach will be hard to scale, due to a significant amount of data to be transferred to the client, it should significantly improve exploratory possibilities and simplify development of dedicated algorithms for geoscience applications. Our prototype implementation will be used to improve surface water detection of the reservoirs using multispectral satellite imagery.
Implementation of Adaptive Digital Controllers on Programmable Logic Devices

NASA Technical Reports Server (NTRS)

Gwaltney, David A.; King, Kenneth D.; Smith, Keary J.; Ormsby, John (Technical Monitor)

2002-01-01

Much has been made of the capabilities of FPGA's (Field Programmable Gate Arrays) in the hardware implementation of fast digital signal processing (DSP) functions. Such capability also makes and FPGA a suitable platform for the digital implementation of closed loop controllers. There are myriad advantages to utilizing an FPGA for discrete-time control functions which include the capability for reconfiguration when SRAM- based FPGA's are employed, fast parallel implementation of multiple control loops and implementations that can meet space level radiation tolerance in a compact form-factor. Other researchers have presented the notion that a second order digital filter with proportional-integral-derivative (PID) control functionality can be implemented in an FPGA. At Marshall Space Flight Center, the Control Electronics Group has been studying adaptive discrete-time control of motor driven actuator systems using digital signal processor (DSF) devices. Our goal is to create a fully digital, flight ready controller design that utilizes an FPGA for implementation of signal conditioning for control feedback signals, generation of commands to the controlled system, and hardware insertion of adaptive control algorithm approaches. While small form factor, commercial DSP devices are now available with event capture, data conversion, pulse width modulated outputs and communication peripherals, these devices are not currently available in designs and packages which meet space level radiation requirements. Meeting our goals requires alternative compact implementation of such functionality to withstand the harsh environment encountered on spacecraft. Radiation tolerant FPGA's are a feasible option for reaching these goals.
Ultrafast Target Recognition via Super-Parallel Holograph Based Correlator, RAM and Associative Memory

DTIC Science & Technology

2008-03-11

JTC) 2𔃾 based on a dynamic material answers the challenge of fast correlation with large databases. Images retrieved from the SPHRAM and used as the...transform (JTC) and matched spatial filter or VanderLugt ( VLC ) correlators, either of which can be implemented in real-time by degenerate four wave-mixing in...proposed system, consisting of the SPHROM coupled with a shift-invariant real-time VLC . The correlation is performed in the VLC architecture to

Real-Time Parallel Software Design Case Study: Implementation of the RASSP SAR Benchmark on the Intel Paragon.

DTIC Science & Technology

1996-01-01

Real-Time 19 5 Conclusion 23 List of References 25 ii LIST OF FIGURES FIGURE PAGE 3-1 Test Bench Pseudo Code 7 3-2 Fast Convolution...3-1 shows pseudo - code for a test bench with two application nodes. The outer test bench wrapper consists of three functions: pipeline_init, pipeline...exit_func); Figure 3-1. Test Bench Pseudo Code The application wrapper is contained in the pipeline routine and similarly consists of an
Incompressible SPH (ISPH) with fast Poisson solver on a GPU

NASA Astrophysics Data System (ADS)

Chow, Alex D.; Rogers, Benedict D.; Lind, Steven J.; Stansby, Peter K.

2018-05-01

This paper presents a fast incompressible SPH (ISPH) solver implemented to run entirely on a graphics processing unit (GPU) capable of simulating several millions of particles in three dimensions on a single GPU. The ISPH algorithm is implemented by converting the highly optimised open-source weakly-compressible SPH (WCSPH) code DualSPHysics to run ISPH on the GPU, combining it with the open-source linear algebra library ViennaCL for fast solutions of the pressure Poisson equation (PPE). Several challenges are addressed with this research: constructing a PPE matrix every timestep on the GPU for moving particles, optimising the limited GPU memory, and exploiting fast matrix solvers. The ISPH pressure projection algorithm is implemented as 4 separate stages, each with a particle sweep, including an algorithm for the population of the PPE matrix suitable for the GPU, and mixed precision storage methods. An accurate and robust ISPH boundary condition ideal for parallel processing is also established by adapting an existing WCSPH boundary condition for ISPH. A variety of validation cases are presented: an impulsively started plate, incompressible flow around a moving square in a box, and dambreaks (2-D and 3-D) which demonstrate the accuracy, flexibility, and speed of the methodology. Fragmentation of the free surface is shown to influence the performance of matrix preconditioners and therefore the PPE matrix solution time. The Jacobi preconditioner demonstrates robustness and reliability in the presence of fragmented flows. For a dambreak simulation, GPU speed ups demonstrate up to 10-18 times and 1.1-4.5 times compared to single-threaded and 16-threaded CPU run times respectively.
GPU-based Branchless Distance-Driven Projection and Backprojection

PubMed Central

Liu, Rui; Fu, Lin; De Man, Bruno; Yu, Hengyong

2017-01-01

Projection and backprojection operations are essential in a variety of image reconstruction and physical correction algorithms in CT. The distance-driven (DD) projection and backprojection are widely used for their highly sequential memory access pattern and low arithmetic cost. However, a typical DD implementation has an inner loop that adjusts the calculation depending on the relative position between voxel and detector cell boundaries. The irregularity of the branch behavior makes it inefficient to be implemented on massively parallel computing devices such as graphics processing units (GPUs). Such irregular branch behaviors can be eliminated by factorizing the DD operation as three branchless steps: integration, linear interpolation, and differentiation, all of which are highly amenable to massive vectorization. In this paper, we implement and evaluate a highly parallel branchless DD algorithm for 3D cone beam CT. The algorithm utilizes the texture memory and hardware interpolation on GPUs to achieve fast computational speed. The developed branchless DD algorithm achieved 137-fold speedup for forward projection and 188-fold speedup for backprojection relative to a single-thread CPU implementation. Compared with a state-of-the-art 32-thread CPU implementation, the proposed branchless DD achieved 8-fold acceleration for forward projection and 10-fold acceleration for backprojection. GPU based branchless DD method was evaluated by iterative reconstruction algorithms with both simulation and real datasets. It obtained visually identical images as the CPU reference algorithm. PMID:29333480
GPU-based Branchless Distance-Driven Projection and Backprojection.

PubMed

Liu, Rui; Fu, Lin; De Man, Bruno; Yu, Hengyong

2017-12-01

Projection and backprojection operations are essential in a variety of image reconstruction and physical correction algorithms in CT. The distance-driven (DD) projection and backprojection are widely used for their highly sequential memory access pattern and low arithmetic cost. However, a typical DD implementation has an inner loop that adjusts the calculation depending on the relative position between voxel and detector cell boundaries. The irregularity of the branch behavior makes it inefficient to be implemented on massively parallel computing devices such as graphics processing units (GPUs). Such irregular branch behaviors can be eliminated by factorizing the DD operation as three branchless steps: integration, linear interpolation, and differentiation, all of which are highly amenable to massive vectorization. In this paper, we implement and evaluate a highly parallel branchless DD algorithm for 3D cone beam CT. The algorithm utilizes the texture memory and hardware interpolation on GPUs to achieve fast computational speed. The developed branchless DD algorithm achieved 137-fold speedup for forward projection and 188-fold speedup for backprojection relative to a single-thread CPU implementation. Compared with a state-of-the-art 32-thread CPU implementation, the proposed branchless DD achieved 8-fold acceleration for forward projection and 10-fold acceleration for backprojection. GPU based branchless DD method was evaluated by iterative reconstruction algorithms with both simulation and real datasets. It obtained visually identical images as the CPU reference algorithm.
cljam: a library for handling DNA sequence alignment/map (SAM) with parallel processing.

PubMed

Takeuchi, Toshiki; Yamada, Atsuo; Aoki, Takashi; Nishimura, Kunihiro

2016-01-01

Next-generation sequencing can determine DNA bases and the results of sequence alignments are generally stored in files in the Sequence Alignment/Map (SAM) format and the compressed binary version (BAM) of it. SAMtools is a typical tool for dealing with files in the SAM/BAM format. SAMtools has various functions, including detection of variants, visualization of alignments, indexing, extraction of parts of the data and loci, and conversion of file formats. It is written in C and can execute fast. However, SAMtools requires an additional implementation to be used in parallel with, for example, OpenMP (Open Multi-Processing) libraries. For the accumulation of next-generation sequencing data, a simple parallelization program, which can support cloud and PC cluster environments, is required. We have developed cljam using the Clojure programming language, which simplifies parallel programming, to handle SAM/BAM data. Cljam can run in a Java runtime environment (e.g., Windows, Linux, Mac OS X) with Clojure. Cljam can process and analyze SAM/BAM files in parallel and at high speed. The execution time with cljam is almost the same as with SAMtools. The cljam code is written in Clojure and has fewer lines than other similar tools.
Real-time FPGA architectures for computer vision

NASA Astrophysics Data System (ADS)

Arias-Estrada, Miguel; Torres-Huitzil, Cesar

2000-03-01

This paper presents an architecture for real-time generic convolution of a mask and an image. The architecture is intended for fast low level image processing. The FPGA-based architecture takes advantage of the availability of registers in FPGAs to implement an efficient and compact module to process the convolutions. The architecture is designed to minimize the number of accesses to the image memory and is based on parallel modules with internal pipeline operation in order to improve its performance. The architecture is prototyped in a FPGA, but it can be implemented on a dedicated VLSI to reach higher clock frequencies. Complexity issues, FPGA resources utilization, FPGA limitations, and real time performance are discussed. Some results are presented and discussed.
Scalable Static and Dynamic Community Detection Using Grappolo

DOE Office of Scientific and Technical Information (OSTI.GOV)

Halappanavar, Mahantesh; Lu, Hao; Kalyanaraman, Anantharaman

Graph clustering, popularly known as community detection, is a fundamental kernel for several applications of relevance to the Defense Advanced Research Projects Agency’s (DARPA) Hierarchical Identify Verify Exploit (HIVE) Pro- gram. Clusters or communities represent natural divisions within a network that are densely connected within a cluster and sparsely connected to the rest of the network. The need to compute clustering on large scale data necessitates the development of efficient algorithms that can exploit modern architectures that are fundamentally parallel in nature. How- ever, due to their irregular and inherently sequential nature, many of the current algorithms for community detectionmore » are challenging to parallelize. In response to the HIVE Graph Challenge, we present several parallelization heuristics for fast community detection using the Louvain method as the serial template. We implement all the heuristics in a software library called Grappolo. Using the inputs from the HIVE Challenge, we demonstrate superior performance and high quality solutions based on four parallelization heuristics. We use Grappolo on static graphs as the first step towards community detection on streaming graphs.« less
A Parallel Point Matching Algorithm for Landmark Based Image Registration Using Multicore Platform

PubMed Central

Yang, Lin; Gong, Leiguang; Zhang, Hong; Nosher, John L.; Foran, David J.

2013-01-01

Point matching is crucial for many computer vision applications. Establishing the correspondence between a large number of data points is a computationally intensive process. Some point matching related applications, such as medical image registration, require real time or near real time performance if applied to critical clinical applications like image assisted surgery. In this paper, we report a new multicore platform based parallel algorithm for fast point matching in the context of landmark based medical image registration. We introduced a non-regular data partition algorithm which utilizes the K-means clustering algorithm to group the landmarks based on the number of available processing cores, which optimize the memory usage and data transfer. We have tested our method using the IBM Cell Broadband Engine (Cell/B.E.) platform. The results demonstrated a significant speed up over its sequential implementation. The proposed data partition and parallelization algorithm, though tested only on one multicore platform, is generic by its design. Therefore the parallel algorithm can be extended to other computing platforms, as well as other point matching related applications. PMID:24308014
On Parallelizing Single Dynamic Simulation Using HPC Techniques and APIs of Commercial Software

DOE Office of Scientific and Technical Information (OSTI.GOV)

Diao, Ruisheng; Jin, Shuangshuang; Howell, Frederic

Time-domain simulations are heavily used in today’s planning and operation practices to assess power system transient stability and post-transient voltage/frequency profiles following severe contingencies to comply with industry standards. Because of the increased modeling complexity, it is several times slower than real time for state-of-the-art commercial packages to complete a dynamic simulation for a large-scale model. With the growing stochastic behavior introduced by emerging technologies, power industry has seen a growing need for performing security assessment in real time. This paper presents a parallel implementation framework to speed up a single dynamic simulation by leveraging the existing stability model librarymore » in commercial tools through their application programming interfaces (APIs). Several high performance computing (HPC) techniques are explored such as parallelizing the calculation of generator current injection, identifying fast linear solvers for network solution, and parallelizing data outputs when interacting with APIs in the commercial package, TSAT. The proposed method has been tested on a WECC planning base case with detailed synchronous generator models and exhibits outstanding scalable performance with sufficient accuracy.« less
Fast and Scalable Computation of the Forward and Inverse Discrete Periodic Radon Transform.

PubMed

Carranza, Cesar; Llamocca, Daniel; Pattichis, Marios

2016-01-01

The discrete periodic radon transform (DPRT) has extensively been used in applications that involve image reconstructions from projections. Beyond classic applications, the DPRT can also be used to compute fast convolutions that avoids the use of floating-point arithmetic associated with the use of the fast Fourier transform. Unfortunately, the use of the DPRT has been limited by the need to compute a large number of additions and the need for a large number of memory accesses. This paper introduces a fast and scalable approach for computing the forward and inverse DPRT that is based on the use of: a parallel array of fixed-point adder trees; circular shift registers to remove the need for accessing external memory components when selecting the input data for the adder trees; an image block-based approach to DPRT computation that can fit the proposed architecture to available resources; and fast transpositions that are computed in one or a few clock cycles that do not depend on the size of the input image. As a result, for an N × N image (N prime), the proposed approach can compute up to N(2) additions per clock cycle. Compared with the previous approaches, the scalable approach provides the fastest known implementations for different amounts of computational resources. For example, for a 251×251 image, for approximately 25% fewer flip-flops than required for a systolic implementation, we have that the scalable DPRT is computed 36 times faster. For the fastest case, we introduce optimized just 2N + ⌈log(2) N⌉ + 1 and 2N + 3 ⌈log(2) N⌉ + B + 2 cycles, architectures that can compute the DPRT and its inverse in respectively, where B is the number of bits used to represent each input pixel. On the other hand, the scalable DPRT approach requires more 1-b additions than for the systolic implementation and provides a tradeoff between speed and additional 1-b additions. All of the proposed DPRT architectures were implemented in VHSIC Hardware Description Language (VHDL) and validated using an Field-Programmable Gate Array (FPGA) implementation.
GPU Lossless Hyperspectral Data Compression System

NASA Technical Reports Server (NTRS)

Aranki, Nazeeh I.; Keymeulen, Didier; Kiely, Aaron B.; Klimesh, Matthew A.

2014-01-01

Hyperspectral imaging systems onboard aircraft or spacecraft can acquire large amounts of data, putting a strain on limited downlink and storage resources. Onboard data compression can mitigate this problem but may require a system capable of a high throughput. In order to achieve a high throughput with a software compressor, a graphics processing unit (GPU) implementation of a compressor was developed targeting the current state-of-the-art GPUs from NVIDIA(R). The implementation is based on the fast lossless (FL) compression algorithm reported in "Fast Lossless Compression of Multispectral-Image Data" (NPO- 42517), NASA Tech Briefs, Vol. 30, No. 8 (August 2006), page 26, which operates on hyperspectral data and achieves excellent compression performance while having low complexity. The FL compressor uses an adaptive filtering method and achieves state-of-the-art performance in both compression effectiveness and low complexity. The new Consultative Committee for Space Data Systems (CCSDS) Standard for Lossless Multispectral & Hyperspectral image compression (CCSDS 123) is based on the FL compressor. The software makes use of the highly-parallel processing capability of GPUs to achieve a throughput at least six times higher than that of a software implementation running on a single-core CPU. This implementation provides a practical real-time solution for compression of data from airborne hyperspectral instruments.
Pteros 2.0: Evolution of the fast parallel molecular analysis library for C++ and python.

PubMed

Yesylevskyy, Semen O

2015-07-15

Pteros is the high-performance open-source library for molecular modeling and analysis of molecular dynamics trajectories. Starting from version 2.0 Pteros is available for C++ and Python programming languages with very similar interfaces. This makes it suitable for writing complex reusable programs in C++ and simple interactive scripts in Python alike. New version improves the facilities for asynchronous trajectory reading and parallel execution of analysis tasks by introducing analysis plugins which could be written in either C++ or Python in completely uniform way. The high level of abstraction provided by analysis plugins greatly simplifies prototyping and implementation of complex analysis algorithms. Pteros is available for free under Artistic License from http://sourceforge.net/projects/pteros/. © 2015 Wiley Periodicals, Inc.
Efficiently modeling neural networks on massively parallel computers

NASA Technical Reports Server (NTRS)

Farber, Robert M.

1993-01-01

Neural networks are a very useful tool for analyzing and modeling complex real world systems. Applying neural network simulations to real world problems generally involves large amounts of data and massive amounts of computation. To efficiently handle the computational requirements of large problems, we have implemented at Los Alamos a highly efficient neural network compiler for serial computers, vector computers, vector parallel computers, and fine grain SIMD computers such as the CM-2 connection machine. This paper describes the mapping used by the compiler to implement feed-forward backpropagation neural networks for a SIMD (Single Instruction Multiple Data) architecture parallel computer. Thinking Machines Corporation has benchmarked our code at 1.3 billion interconnects per second (approximately 3 gigaflops) on a 64,000 processor CM-2 connection machine (Singer 1990). This mapping is applicable to other SIMD computers and can be implemented on MIMD computers such as the CM-5 connection machine. Our mapping has virtually no communications overhead with the exception of the communications required for a global summation across the processors (which has a sub-linear runtime growth on the order of O(log(number of processors)). We can efficiently model very large neural networks which have many neurons and interconnects and our mapping can extend to arbitrarily large networks (within memory limitations) by merging the memory space of separate processors with fast adjacent processor interprocessor communications. This paper will consider the simulation of only feed forward neural network although this method is extendable to recurrent networks.
Fast Time and Space Parallel Algorithms for Solution of Parabolic Partial Differential Equations

NASA Technical Reports Server (NTRS)

Fijany, Amir

1993-01-01

In this paper, fast time- and Space -Parallel agorithms for solution of linear parabolic PDEs are developed. It is shown that the seemingly strictly serial iterations of the time-stepping procedure for solution of the problem can be completed decoupled.
A Fast MHD Code for Gravitationally Stratified Media using Graphical Processing Units: SMAUG

NASA Astrophysics Data System (ADS)

Griffiths, M. K.; Fedun, V.; Erdélyi, R.

2015-03-01

Parallelization techniques have been exploited most successfully by the gaming/graphics industry with the adoption of graphical processing units (GPUs), possessing hundreds of processor cores. The opportunity has been recognized by the computational sciences and engineering communities, who have recently harnessed successfully the numerical performance of GPUs. For example, parallel magnetohydrodynamic (MHD) algorithms are important for numerical modelling of highly inhomogeneous solar, astrophysical and geophysical plasmas. Here, we describe the implementation of SMAUG, the Sheffield Magnetohydrodynamics Algorithm Using GPUs. SMAUG is a 1-3D MHD code capable of modelling magnetized and gravitationally stratified plasma. The objective of this paper is to present the numerical methods and techniques used for porting the code to this novel and highly parallel compute architecture. The methods employed are justified by the performance benchmarks and validation results demonstrating that the code successfully simulates the physics for a range of test scenarios including a full 3D realistic model of wave propagation in the solar atmosphere.
Acceleration of the matrix multiplication of Radiance three phase daylighting simulations with parallel computing on heterogeneous hardware of personal computer

DOE Office of Scientific and Technical Information (OSTI.GOV)

Zuo, Wangda; McNeil, Andrew; Wetter, Michael

2013-05-23

Building designers are increasingly relying on complex fenestration systems to reduce energy consumed for lighting and HVAC in low energy buildings. Radiance, a lighting simulation program, has been used to conduct daylighting simulations for complex fenestration systems. Depending on the configurations, the simulation can take hours or even days using a personal computer. This paper describes how to accelerate the matrix multiplication portion of a Radiance three-phase daylight simulation by conducting parallel computing on heterogeneous hardware of a personal computer. The algorithm was optimized and the computational part was implemented in parallel using OpenCL. The speed of new approach wasmore » evaluated using various daylighting simulation cases on a multicore central processing unit and a graphics processing unit. Based on the measurements and analysis of the time usage for the Radiance daylighting simulation, further speedups can be achieved by using fast I/O devices and storing the data in a binary format.« less
Synchronous parallel spatially resolved stochastic cluster dynamics

DOE PAGES

Dunn, Aaron; Dingreville, Rémi; Martínez, Enrique; ...

2016-04-23

In this work, a spatially resolved stochastic cluster dynamics (SRSCD) model for radiation damage accumulation in metals is implemented using a synchronous parallel kinetic Monte Carlo algorithm. The parallel algorithm is shown to significantly increase the size of representative volumes achievable in SRSCD simulations of radiation damage accumulation. Additionally, weak scaling performance of the method is tested in two cases: (1) an idealized case of Frenkel pair diffusion and annihilation, and (2) a characteristic example problem including defect cluster formation and growth in α-Fe. For the latter case, weak scaling is tested using both Frenkel pair and displacement cascade damage.more » To improve scaling of simulations with cascade damage, an explicit cascade implantation scheme is developed for cases in which fast-moving defects are created in displacement cascades. For the first time, simulation of radiation damage accumulation in nanopolycrystals can be achieved with a three dimensional rendition of the microstructure, allowing demonstration of the effect of grain size on defect accumulation in Frenkel pair-irradiated α-Fe.« less
The Research and Implementation of Vehicle Bluetooth Hands-free Devices Key Parameters Downloading Algorithm

NASA Astrophysics Data System (ADS)

Zhang, Xiao-bo; Wang, Zhi-xue; Li, Jian-xin; Ma, Jian-hui; Li, Yang; Li, Yan-qiang

In order to facilitate Bluetooth function realization and information can be effectively tracked in the process of production, the vehicle Bluetooth hands-free devices need to download such key parameters as Bluetooth address, CVC license and base plate numbers, etc. Therefore, it is the aim to search simple and effective methods to download parameters for each vehicle Bluetooth hands-free device, and to control and record the use of parameters. In this paper, by means of Bluetooth Serial Peripheral Interface programmer device, the parallel port is switched to SPI. The first step is to download parameters is simulating SPI with the parallel port. To perform SPI function, operating the parallel port in accordance with the SPI timing. The next step is to achieve SPI data transceiver functions according to the programming parameters of options. Utilizing the new method, downloading parameters is fast and accurate. It fully meets vehicle Bluetooth hands-free devices production requirements. In the production line, it has played a large role.
Parallel Navier-Stokes computations on shared and distributed memory architectures

NASA Technical Reports Server (NTRS)

Hayder, M. Ehtesham; Jayasimha, D. N.; Pillay, Sasi Kumar

1995-01-01

We study a high order finite difference scheme to solve the time accurate flow field of a jet using the compressible Navier-Stokes equations. As part of our ongoing efforts, we have implemented our numerical model on three parallel computing platforms to study the computational, communication, and scalability characteristics. The platforms chosen for this study are a cluster of workstations connected through fast networks (the LACE experimental testbed at NASA Lewis), a shared memory multiprocessor (the Cray YMP), and a distributed memory multiprocessor (the IBM SPI). Our focus in this study is on the LACE testbed. We present some results for the Cray YMP and the IBM SP1 mainly for comparison purposes. On the LACE testbed, we study: (1) the communication characteristics of Ethernet, FDDI, and the ALLNODE networks and (2) the overheads induced by the PVM message passing library used for parallelizing the application. We demonstrate that clustering of workstations is effective and has the potential to be computationally competitive with supercomputers at a fraction of the cost.
Parallel spatial direct numerical simulations on the Intel iPSC/860 hypercube

NASA Technical Reports Server (NTRS)

Joslin, Ronald D.; Zubair, Mohammad

1993-01-01

The implementation and performance of a parallel spatial direct numerical simulation (PSDNS) approach on the Intel iPSC/860 hypercube is documented. The direct numerical simulation approach is used to compute spatially evolving disturbances associated with the laminar-to-turbulent transition in boundary-layer flows. The feasibility of using the PSDNS on the hypercube to perform transition studies is examined. The results indicate that the direct numerical simulation approach can effectively be parallelized on a distributed-memory parallel machine. By increasing the number of processors nearly ideal linear speedups are achieved with nonoptimized routines; slower than linear speedups are achieved with optimized (machine dependent library) routines. This slower than linear speedup results because the Fast Fourier Transform (FFT) routine dominates the computational cost and because the routine indicates less than ideal speedups. However with the machine-dependent routines the total computational cost decreases by a factor of 4 to 5 compared with standard FORTRAN routines. The computational cost increases linearly with spanwise wall-normal and streamwise grid refinements. The hypercube with 32 processors was estimated to require approximately twice the amount of Cray supercomputer single processor time to complete a comparable simulation; however it is estimated that a subgrid-scale model which reduces the required number of grid points and becomes a large-eddy simulation (PSLES) would reduce the computational cost and memory requirements by a factor of 10 over the PSDNS. This PSLES implementation would enable transition simulations on the hypercube at a reasonable computational cost.

Some fast elliptic solvers on parallel architectures and their complexities

NASA Technical Reports Server (NTRS)

Gallopoulos, E.; Saad, Y.

1989-01-01

The discretization of separable elliptic partial differential equations leads to linear systems with special block tridiagonal matrices. Several methods are known to solve these systems, the most general of which is the Block Cyclic Reduction (BCR) algorithm which handles equations with nonconstant coefficients. A method was recently proposed to parallelize and vectorize BCR. In this paper, the mapping of BCR on distributed memory architectures is discussed, and its complexity is compared with that of other approaches including the Alternating-Direction method. A fast parallel solver is also described, based on an explicit formula for the solution, which has parallel computational compelxity lower than that of parallel BCR.
Some fast elliptic solvers on parallel architectures and their complexities

NASA Technical Reports Server (NTRS)

Gallopoulos, E.; Saad, Youcef

1989-01-01

The discretization of separable elliptic partial differential equations leads to linear systems with special block triangular matrices. Several methods are known to solve these systems, the most general of which is the Block Cyclic Reduction (BCR) algorithm which handles equations with nonconsistant coefficients. A method was recently proposed to parallelize and vectorize BCR. Here, the mapping of BCR on distributed memory architectures is discussed, and its complexity is compared with that of other approaches, including the Alternating-Direction method. A fast parallel solver is also described, based on an explicit formula for the solution, which has parallel computational complexity lower than that of parallel BCR.
Fast adaptive composite grid methods on distributed parallel architectures

NASA Technical Reports Server (NTRS)

Lemke, Max; Quinlan, Daniel

1992-01-01

The fast adaptive composite (FAC) grid method is compared with the adaptive composite method (AFAC) under variety of conditions including vectorization and parallelization. Results are given for distributed memory multiprocessor architectures (SUPRENUM, Intel iPSC/2 and iPSC/860). It is shown that the good performance of AFAC and its superiority over FAC in a parallel environment is a property of the algorithm and not dependent on peculiarities of any machine.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Ghysels, Pieter; Li, Xiaoye S.; Rouet, Francois -Henry

Here, we present a sparse linear system solver that is based on a multifrontal variant of Gaussian elimination and exploits low-rank approximation of the resulting dense frontal matrices. We use hierarchically semiseparable (HSS) matrices, which have low-rank off-diagonal blocks, to approximate the frontal matrices. For HSS matrix construction, a randomized sampling algorithm is used together with interpolative decompositions. The combination of the randomized compression with a fast ULV HSS factoriz ation leads to a solver with lower computational complexity than the standard multifrontal method for many applications, resulting in speedups up to 7 fold for problems in our test suite.more » The implementation targets many-core systems by using task parallelism with dynamic runtime scheduling. Numerical experiments show performance improvements over state-of-the-art sparse direct solvers. The implementation achieves high performance and good scalability on a range of modern shared memory parallel systems, including the Intel Xeon Phi (MIC). The code is part of a software package called STRUMPACK - STRUctured Matrices PACKage, which also has a distributed memory component for dense rank-structured matrices.« less
Implementation of Adaptive Digital Controllers on Programmable Logic Devices

NASA Technical Reports Server (NTRS)

Gwaltney, David A.; King, Kenneth D.; Smith, Keary J.; Monenegro, Justino (Technical Monitor)

2002-01-01

Much has been made of the capabilities of FPGA's (Field Programmable Gate Arrays) in the hardware implementation of fast digital signal processing. Such capability also makes an FPGA a suitable platform for the digital implementation of closed loop controllers. Other researchers have implemented a variety of closed-loop digital controllers on FPGA's. Some of these controllers include the widely used proportional-integral-derivative (PID) controller, state space controllers, neural network and fuzzy logic based controllers. There are myriad advantages to utilizing an FPGA for discrete-time control functions which include the capability for reconfiguration when SRAM-based FPGA's are employed, fast parallel implementation of multiple control loops and implementations that can meet space level radiation tolerance requirements in a compact form-factor. Generally, a software implementation on a DSP (Digital Signal Processor) or microcontroller is used to implement digital controllers. At Marshall Space Flight Center, the Control Electronics Group has been studying adaptive discrete-time control of motor driven actuator systems using digital signal processor (DSP) devices. While small form factor, commercial DSP devices are now available with event capture, data conversion, pulse width modulated (PWM) outputs and communication peripherals, these devices are not currently available in designs and packages which meet space level radiation requirements. In general, very few DSP devices are produced that are designed to meet any level of radiation tolerance or hardness. The goal of this effort is to create a fully digital, flight ready controller design that utilizes an FPGA for implementation of signal conditioning for control feedback signals, generation of commands to the controlled system, and hardware insertion of adaptive control algorithm approaches. An alternative is required for compact implementation of such functionality to withstand the harsh environment encountered on spacecraft. Radiation tolerant FPGA's are a feasible option for reaching this goal.
Implementation of Adaptive Digital Controllers on Programmable Logic Devices

NASA Technical Reports Server (NTRS)

Gwaltney, David A.; King, Kenneth D.; Smith, Keary J.; Montenegro, Justino (Technical Monitor)

2002-01-01

Much has been made of the capabilities of Field Programmable Gate Arrays (FPGA's) in the hardware implementation of fast digital signal processing functions. Such capability also makes an FPGA a suitable platform for the digital implementation of closed loop controllers. Other researchers have implemented a variety of closed-loop digital controllers on FPGA's. Some of these controllers include the widely used Proportional-Integral-Derivative (PID) controller, state space controllers, neural network and fuzzy logic based controllers. There are myriad advantages to utilizing an FPGA for discrete-time control functions which include the capability for reconfiguration when SRAM- based FPGA's are employed, fast parallel implementation of multiple control loops and implementations that can meet space level radiation tolerance requirements in a compact form-factor. Generally, a software implementation on a Digital Signal Processor (DSP) device or microcontroller is used to implement digital controllers. At Marshall Space Flight Center, the Control Electronics Group has been studying adaptive discrete-time control of motor driven actuator systems using DSP devices. While small form factor, commercial DSP devices are now available with event capture, data conversion, Pulse Width Modulated (PWM) outputs and communication peripherals, these devices are not currently available in designs and packages which meet space level radiation requirements. In general, very few DSP devices are produced that are designed to meet any level of radiation tolerance or hardness. An alternative is required for compact implementation of such functionality to withstand the harsh environment encountered on spacemap. The goal of this effort is to create a fully digital, flight ready controller design that utilizes an FPGA for implementation of signal conditioning for control feedback signals, generation of commands to the controlled system, and hardware insertion of adaptive-control algorithm approaches. Radiation tolerant FPGA's are a feasible option for reaching this goal.
A mixed finite difference/Galerkin method for three-dimensional Rayleigh-Benard convection

NASA Technical Reports Server (NTRS)

Buell, Jeffrey C.

1988-01-01

A fast and accurate numerical method, for nonlinear conservation equation systems whose solutions are periodic in two of the three spatial dimensions, is presently implemented for the case of Rayleigh-Benard convection between two rigid parallel plates in the parameter region where steady, three-dimensional convection is known to be stable. High-order streamfunctions secure the reduction of the system of five partial differential equations to a system of only three. Numerical experiments are presented which verify both the expected convergence rates and the absolute accuracy of the method.
A High-Performance Parallel Implementation of the Certified Reduced Basis Method

DTIC Science & Technology

2010-12-15

point of view of model reduction due to the “curse of dimensionality”. We consider transient thermal conduction in a three– dimensional “ Swiss cheese ... Swiss cheese ” problem (see Figure 7a) there are 54 unique ordered pairs in I. A histogram of 〈δµ〉 values computed for the ntrain = 106 case is given in...our primal-dual RB method yields a very fast and accurate output approxima- tion for the “ Swiss Cheese ” problem. Our goal in this final subsection is
GPU-based streaming architectures for fast cone-beam CT image reconstruction and demons deformable registration.

PubMed

Sharp, G C; Kandasamy, N; Singh, H; Folkert, M

2007-10-07

This paper shows how to significantly accelerate cone-beam CT reconstruction and 3D deformable image registration using the stream-processing model. We describe data-parallel designs for the Feldkamp, Davis and Kress (FDK) reconstruction algorithm, and the demons deformable registration algorithm, suitable for use on a commodity graphics processing unit. The streaming versions of these algorithms are implemented using the Brook programming environment and executed on an NVidia 8800 GPU. Performance results using CT data of a preserved swine lung indicate that the GPU-based implementations of the FDK and demons algorithms achieve a substantial speedup--up to 80 times for FDK and 70 times for demons when compared to an optimized reference implementation on a 2.8 GHz Intel processor. In addition, the accuracy of the GPU-based implementations was found to be excellent. Compared with CPU-based implementations, the RMS differences were less than 0.1 Hounsfield unit for reconstruction and less than 0.1 mm for deformable registration.
Dynamic grid refinement for partial differential equations on parallel computers

NASA Technical Reports Server (NTRS)

Mccormick, S.; Quinlan, D.

1989-01-01

The fast adaptive composite grid method (FAC) is an algorithm that uses various levels of uniform grids to provide adaptive resolution and fast solution of PDEs. An asynchronous version of FAC, called AFAC, that completely eliminates the bottleneck to parallelism is presented. This paper describes the advantage that this algorithm has in adaptive refinement for moving singularities on multiprocessor computers. This work is applicable to the parallel solution of two- and three-dimensional shock tracking problems.
Evaluating the performance of the particle finite element method in parallel architectures

NASA Astrophysics Data System (ADS)

Gimenez, Juan M.; Nigro, Norberto M.; Idelsohn, Sergio R.

2014-05-01

This paper presents a high performance implementation for the particle-mesh based method called particle finite element method two (PFEM-2). It consists of a material derivative based formulation of the equations with a hybrid spatial discretization which uses an Eulerian mesh and Lagrangian particles. The main aim of PFEM-2 is to solve transport equations as fast as possible keeping some level of accuracy. The method was found to be competitive with classical Eulerian alternatives for these targets, even in their range of optimal application. To evaluate the goodness of the method with large simulations, it is imperative to use of parallel environments. Parallel strategies for Finite Element Method have been widely studied and many libraries can be used to solve Eulerian stages of PFEM-2. However, Lagrangian stages, such as streamline integration, must be developed considering the parallel strategy selected. The main drawback of PFEM-2 is the large amount of memory needed, which limits its application to large problems with only one computer. Therefore, a distributed-memory implementation is urgently needed. Unlike a shared-memory approach, using domain decomposition the memory is automatically isolated, thus avoiding race conditions; however new issues appear due to data distribution over the processes. Thus, a domain decomposition strategy for both particle and mesh is adopted, which minimizes the communication between processes. Finally, performance analysis running over multicore and multinode architectures are presented. The Courant-Friedrichs-Lewy number used influences the efficiency of the parallelization and, in some cases, a weighted partitioning can be used to improve the speed-up. However the total cputime for cases presented is lower than that obtained when using classical Eulerian strategies.
Accelerated Adaptive MGS Phase Retrieval

NASA Technical Reports Server (NTRS)

Lam, Raymond K.; Ohara, Catherine M.; Green, Joseph J.; Bikkannavar, Siddarayappa A.; Basinger, Scott A.; Redding, David C.; Shi, Fang

2011-01-01

The Modified Gerchberg-Saxton (MGS) algorithm is an image-based wavefront-sensing method that can turn any science instrument focal plane into a wavefront sensor. MGS characterizes optical systems by estimating the wavefront errors in the exit pupil using only intensity images of a star or other point source of light. This innovative implementation of MGS significantly accelerates the MGS phase retrieval algorithm by using stream-processing hardware on conventional graphics cards. Stream processing is a relatively new, yet powerful, paradigm to allow parallel processing of certain applications that apply single instructions to multiple data (SIMD). These stream processors are designed specifically to support large-scale parallel computing on a single graphics chip. Computationally intensive algorithms, such as the Fast Fourier Transform (FFT), are particularly well suited for this computing environment. This high-speed version of MGS exploits commercially available hardware to accomplish the same objective in a fraction of the original time. The exploit involves performing matrix calculations in nVidia graphic cards. The graphical processor unit (GPU) is hardware that is specialized for computationally intensive, highly parallel computation. From the software perspective, a parallel programming model is used, called CUDA, to transparently scale multicore parallelism in hardware. This technology gives computationally intensive applications access to the processing power of the nVidia GPUs through a C/C++ programming interface. The AAMGS (Accelerated Adaptive MGS) software takes advantage of these advanced technologies, to accelerate the optical phase error characterization. With a single PC that contains four nVidia GTX-280 graphic cards, the new implementation can process four images simultaneously to produce a JWST (James Webb Space Telescope) wavefront measurement 60 times faster than the previous code.
A Low-Power High-Speed Smart Sensor Design for Space Exploration Missions

NASA Technical Reports Server (NTRS)

Fang, Wai-Chi

1997-01-01

A low-power high-speed smart sensor system based on a large format active pixel sensor (APS) integrated with a programmable neural processor for space exploration missions is presented. The concept of building an advanced smart sensing system is demonstrated by a system-level microchip design that is composed with an APS sensor, a programmable neural processor, and an embedded microprocessor in a SOI CMOS technology. This ultra-fast smart sensor system-on-a-chip design mimics what is inherent in biological vision systems. Moreover, it is programmable and capable of performing ultra-fast machine vision processing in all levels such as image acquisition, image fusion, image analysis, scene interpretation, and control functions. The system provides about one tera-operation-per-second computing power which is a two order-of-magnitude increase over that of state-of-the-art microcomputers. Its high performance is due to massively parallel computing structures, high data throughput rates, fast learning capabilities, and advanced VLSI system-on-a-chip implementation.
A data distributed parallel algorithm for ray-traced volume rendering

NASA Technical Reports Server (NTRS)

Ma, Kwan-Liu; Painter, James S.; Hansen, Charles D.; Krogh, Michael F.

1993-01-01

This paper presents a divide-and-conquer ray-traced volume rendering algorithm and a parallel image compositing method, along with their implementation and performance on the Connection Machine CM-5, and networked workstations. This algorithm distributes both the data and the computations to individual processing units to achieve fast, high-quality rendering of high-resolution data. The volume data, once distributed, is left intact. The processing nodes perform local ray tracing of their subvolume concurrently. No communication between processing units is needed during this locally ray-tracing process. A subimage is generated by each processing unit and the final image is obtained by compositing subimages in the proper order, which can be determined a priori. Test results on both the CM-5 and a group of networked workstations demonstrate the practicality of our rendering algorithm and compositing method.
HPCC Methodologies for Structural Design and Analysis on Parallel and Distributed Computing Platforms

NASA Technical Reports Server (NTRS)

Farhat, Charbel

1998-01-01

In this grant, we have proposed a three-year research effort focused on developing High Performance Computation and Communication (HPCC) methodologies for structural analysis on parallel processors and clusters of workstations, with emphasis on reducing the structural design cycle time. Besides consolidating and further improving the FETI solver technology to address plate and shell structures, we have proposed to tackle the following design related issues: (a) parallel coupling and assembly of independently designed and analyzed three-dimensional substructures with non-matching interfaces, (b) fast and smart parallel re-analysis of a given structure after it has undergone design modifications, (c) parallel evaluation of sensitivity operators (derivatives) for design optimization, and (d) fast parallel analysis of mildly nonlinear structures. While our proposal was accepted, support was provided only for one year.
An object-oriented approach for parallel self adaptive mesh refinement on block structured grids

NASA Technical Reports Server (NTRS)

Lemke, Max; Witsch, Kristian; Quinlan, Daniel

1993-01-01

Self-adaptive mesh refinement dynamically matches the computational demands of a solver for partial differential equations to the activity in the application's domain. In this paper we present two C++ class libraries, P++ and AMR++, which significantly simplify the development of sophisticated adaptive mesh refinement codes on (massively) parallel distributed memory architectures. The development is based on our previous research in this area. The C++ class libraries provide abstractions to separate the issues of developing parallel adaptive mesh refinement applications into those of parallelism, abstracted by P++, and adaptive mesh refinement, abstracted by AMR++. P++ is a parallel array class library to permit efficient development of architecture independent codes for structured grid applications, and AMR++ provides support for self-adaptive mesh refinement on block-structured grids of rectangular non-overlapping blocks. Using these libraries, the application programmers' work is greatly simplified to primarily specifying the serial single grid application and obtaining the parallel and self-adaptive mesh refinement code with minimal effort. Initial results for simple singular perturbation problems solved by self-adaptive multilevel techniques (FAC, AFAC), being implemented on the basis of prototypes of the P++/AMR++ environment, are presented. Singular perturbation problems frequently arise in large applications, e.g. in the area of computational fluid dynamics. They usually have solutions with layers which require adaptive mesh refinement and fast basic solvers in order to be resolved efficiently.
Commodity cluster and hardware-based massively parallel implementations of hyperspectral imaging algorithms

NASA Astrophysics Data System (ADS)

Plaza, Antonio; Chang, Chein-I.; Plaza, Javier; Valencia, David

2006-05-01

The incorporation of hyperspectral sensors aboard airborne/satellite platforms is currently producing a nearly continual stream of multidimensional image data, and this high data volume has soon introduced new processing challenges. The price paid for the wealth spatial and spectral information available from hyperspectral sensors is the enormous amounts of data that they generate. Several applications exist, however, where having the desired information calculated quickly enough for practical use is highly desirable. High computing performance of algorithm analysis is particularly important in homeland defense and security applications, in which swift decisions often involve detection of (sub-pixel) military targets (including hostile weaponry, camouflage, concealment, and decoys) or chemical/biological agents. In order to speed-up computational performance of hyperspectral imaging algorithms, this paper develops several fast parallel data processing techniques. Techniques include four classes of algorithms: (1) unsupervised classification, (2) spectral unmixing, and (3) automatic target recognition, and (4) onboard data compression. A massively parallel Beowulf cluster (Thunderhead) at NASA's Goddard Space Flight Center in Maryland is used to measure parallel performance of the proposed algorithms. In order to explore the viability of developing onboard, real-time hyperspectral data compression algorithms, a Xilinx Virtex-II field programmable gate array (FPGA) is also used in experiments. Our quantitative and comparative assessment of parallel techniques and strategies may help image analysts in selection of parallel hyperspectral algorithms for specific applications.
Fast, Massively Parallel Data Processors

NASA Technical Reports Server (NTRS)

Heaton, Robert A.; Blevins, Donald W.; Davis, ED

1994-01-01

Proposed fast, massively parallel data processor contains 8x16 array of processing elements with efficient interconnection scheme and options for flexible local control. Processing elements communicate with each other on "X" interconnection grid with external memory via high-capacity input/output bus. This approach to conditional operation nearly doubles speed of various arithmetic operations.
Parallel peak pruning for scalable SMP contour tree computation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Carr, Hamish A.; Weber, Gunther H.; Sewell, Christopher M.

As data sets grow to exascale, automated data analysis and visualisation are increasingly important, to intermediate human understanding and to reduce demands on disk storage via in situ analysis. Trends in architecture of high performance computing systems necessitate analysis algorithms to make effective use of combinations of massively multicore and distributed systems. One of the principal analytic tools is the contour tree, which analyses relationships between contours to identify features of more than local importance. Unfortunately, the predominant algorithms for computing the contour tree are explicitly serial, and founded on serial metaphors, which has limited the scalability of this formmore » of analysis. While there is some work on distributed contour tree computation, and separately on hybrid GPU-CPU computation, there is no efficient algorithm with strong formal guarantees on performance allied with fast practical performance. Here in this paper, we report the first shared SMP algorithm for fully parallel contour tree computation, withfor-mal guarantees of O(lgnlgt) parallel steps and O(n lgn) work, and implementations with up to 10x parallel speed up in OpenMP and up to 50x speed up in NVIDIA Thrust.« less
A Domain Decomposition Parallelization of the Fast Marching Method

NASA Technical Reports Server (NTRS)

Herrmann, M.

2003-01-01

In this paper, the first domain decomposition parallelization of the Fast Marching Method for level sets has been presented. Parallel speedup has been demonstrated in both the optimal and non-optimal domain decomposition case. The parallel performance of the proposed method is strongly dependent on load balancing separately the number of nodes on each side of the interface. A load imbalance of nodes on either side of the domain leads to an increase in communication and rollback operations. Furthermore, the amount of inter-domain communication can be reduced by aligning the inter-domain boundaries with the interface normal vectors. In the case of optimal load balancing and aligned inter-domain boundaries, the proposed parallel FMM algorithm is highly efficient, reaching efficiency factors of up to 0.98. Future work will focus on the extension of the proposed parallel algorithm to higher order accuracy. Also, to further enhance parallel performance, the coupling of the domain decomposition parallelization to the G(sub 0)-based parallelization will be investigated.

Fast neural net simulation with a DSP processor array.

PubMed

Muller, U A; Gunzinger, A; Guggenbuhl, W

1995-01-01

This paper describes the implementation of a fast neural net simulator on a novel parallel distributed-memory computer. A 60-processor system, named MUSIC (multiprocessor system with intelligent communication), is operational and runs the backpropagation algorithm at a speed of 330 million connection updates per second (continuous weight update) using 32-b floating-point precision. This is equal to 1.4 Gflops sustained performance. The complete system with 3.8 Gflops peak performance consumes less than 800 W of electrical power and fits into a 19-in rack. While reaching the speed of modern supercomputers, MUSIC still can be used as a personal desktop computer at a researcher's own disposal. In neural net simulation, this gives a computing performance to a single user which was unthinkable before. The system's real-time interfaces make it especially useful for embedded applications.
Real-time multi-DSP control of three-phase current-source unity power factor PWM rectifier

DOE Office of Scientific and Technical Information (OSTI.GOV)

Xiao Wang; Boon-Teck Ooi

1993-07-01

The design of a real-time multi-DSP controller for a high-quality six-valve three-phase current-source unity power factor PWM rectifier is discussed in this paper. With the decoupler preprocessor and the dynamic trilogic PWM trigger scheme, each of the three input currents can be controlled independently. Based on the a-b-c frame system model and the fast parallel computer control, the pole-placement control method is implemented successfully to achieve fast response in the ac currents. The low-frequency resonance in the ac filter L-C networks has been damped effectively. The experimental results are obtained from a 1-kVA bipolar transistor current-source PWM rectifier with amore » real-time controller using three TMS320C25 DSP's.« less
A fast multiparameter MRI approach for acute stroke assessment on a 3T clinical scanner: preliminary results in a non-human primate model with transient ischemic occlusion.

PubMed

Zhang, Xiaodong; Tong, Frank; Li, Chun-Xia; Yan, Yumei; Nair, Govind; Nagaoka, Tsukasa; Tanaka, Yoji; Zola, Stuart; Howell, Leonard

2014-04-01

Many MRI parameters have been explored and demonstrated the capability or potential to evaluate acute stroke injury, providing anatomical, microstructural, functional, or neurochemical information for diagnostic purposes and therapeutic development. However, the application of multiparameter MRI approach is hindered in clinic due to the very limited time window after stroke insult. Parallel imaging technique can accelerate MRI data acquisition dramatically and has been incorporated in modern clinical scanners and increasingly applied for various diagnostic purposes. In the present study, a fast multiparameter MRI approach including structural T1-weighted imaging (T1W), T2-weighted imaging (T2W), diffusion tensor imaging (DTI), T2-mapping, proton magnetic resonance spectroscopy, cerebral blood flow (CBF), and magnetization transfer (MT) imaging, was implemented and optimized for assessing acute stroke injury on a 3T clinical scanner. A macaque model of transient ischemic stroke induced by a minimal interventional approach was utilized for evaluating the multiparameter MRI approach. The preliminary results indicate the surgical procedure successfully induced ischemic occlusion in the cortex and/or subcortex in adult macaque monkeys (n=4). Application of parallel imaging technique substantially reduced the scanning duration of most MRI data acquisitions, allowing for fast and repeated evaluation of acute stroke injury. Hence, the use of the multiparameter MRI approach with up to five quantitative measures can provide significant advantages in preclinical or clinical studies of stroke disease.
Comparison of the different approaches to generate holograms from data acquired with a Kinect sensor

NASA Astrophysics Data System (ADS)

Kang, Ji-Hoon; Leportier, Thibault; Ju, Byeong-Kwon; Song, Jin Dong; Lee, Kwang-Hoon; Park, Min-Chul

2017-05-01

Data of real scenes acquired in real-time with a Kinect sensor can be processed with different approaches to generate a hologram. 3D models can be generated from a point cloud or a mesh representation. The advantage of the point cloud approach is that computation process is well established since it involves only diffraction and propagation of point sources between parallel planes. On the other hand, the mesh representation enables to reduce the number of elements necessary to represent the object. Then, even though the computation time for the contribution of a single element increases compared to a simple point, the total computation time can be reduced significantly. However, the algorithm is more complex since propagation of elemental polygons between non-parallel planes should be implemented. Finally, since a depth map of the scene is acquired at the same time than the intensity image, a depth layer approach can also be adopted. This technique is appropriate for a fast computation since propagation of an optical wavefront from one plane to another can be handled efficiently with the fast Fourier transform. Fast computation with depth layer approach is convenient for real time applications, but point cloud method is more appropriate when high resolution is needed. In this study, since Kinect can be used to obtain both point cloud and depth map, we examine the different approaches that can be adopted for hologram computation and compare their performance.
Fast implementation for compressive recovery of highly accelerated cardiac cine MRI using the balanced sparse model.

PubMed

Ting, Samuel T; Ahmad, Rizwan; Jin, Ning; Craft, Jason; Serafim da Silveira, Juliana; Xue, Hui; Simonetti, Orlando P

2017-04-01

Sparsity-promoting regularizers can enable stable recovery of highly undersampled magnetic resonance imaging (MRI), promising to improve the clinical utility of challenging applications. However, lengthy computation time limits the clinical use of these methods, especially for dynamic MRI with its large corpus of spatiotemporal data. Here, we present a holistic framework that utilizes the balanced sparse model for compressive sensing and parallel computing to reduce the computation time of cardiac MRI recovery methods. We propose a fast, iterative soft-thresholding method to solve the resulting ℓ1-regularized least squares problem. In addition, our approach utilizes a parallel computing environment that is fully integrated with the MRI acquisition software. The methodology is applied to two formulations of the multichannel MRI problem: image-based recovery and k-space-based recovery. Using measured MRI data, we show that, for a 224 × 144 image series with 48 frames, the proposed k-space-based approach achieves a mean reconstruction time of 2.35 min, a 24-fold improvement compared a reconstruction time of 55.5 min for the nonlinear conjugate gradient method, and the proposed image-based approach achieves a mean reconstruction time of 13.8 s. Our approach can be utilized to achieve fast reconstruction of large MRI datasets, thereby increasing the clinical utility of reconstruction techniques based on compressed sensing. Magn Reson Med 77:1505-1515, 2017. © 2016 International Society for Magnetic Resonance in Medicine. © 2016 International Society for Magnetic Resonance in Medicine.
A flexible algorithm for calculating pair interactions on SIMD architectures

NASA Astrophysics Data System (ADS)

Páll, Szilárd; Hess, Berk

2013-12-01

Calculating interactions or correlations between pairs of particles is typically the most time-consuming task in particle simulation or correlation analysis. Straightforward implementations using a double loop over particle pairs have traditionally worked well, especially since compilers usually do a good job of unrolling the inner loop. In order to reach high performance on modern CPU and accelerator architectures, single-instruction multiple-data (SIMD) parallelization has become essential. Avoiding memory bottlenecks is also increasingly important and requires reducing the ratio of memory to arithmetic operations. Moreover, when pairs only interact within a certain cut-off distance, good SIMD utilization can only be achieved by reordering input and output data, which quickly becomes a limiting factor. Here we present an algorithm for SIMD parallelization based on grouping a fixed number of particles, e.g. 2, 4, or 8, into spatial clusters. Calculating all interactions between particles in a pair of such clusters improves data reuse compared to the traditional scheme and results in a more efficient SIMD parallelization. Adjusting the cluster size allows the algorithm to map to SIMD units of various widths. This flexibility not only enables fast and efficient implementation on current CPUs and accelerator architectures like GPUs or Intel MIC, but it also makes the algorithm future-proof. We present the algorithm with an application to molecular dynamics simulations, where we can also make use of the effective buffering the method introduces.
A FPGA Implementation of the CAR-FAC Cochlear Model.

PubMed

Xu, Ying; Thakur, Chetan S; Singh, Ram K; Hamilton, Tara Julia; Wang, Runchun M; van Schaik, André

2018-01-01

This paper presents a digital implementation of the Cascade of Asymmetric Resonators with Fast-Acting Compression (CAR-FAC) cochlear model. The CAR part simulates the basilar membrane's (BM) response to sound. The FAC part models the outer hair cell (OHC), the inner hair cell (IHC), and the medial olivocochlear efferent system functions. The FAC feeds back to the CAR by moving the poles and zeros of the CAR resonators automatically. We have implemented a 70-section, 44.1 kHz sampling rate CAR-FAC system on an Altera Cyclone V Field Programmable Gate Array (FPGA) with 18% ALM utilization by using time-multiplexing and pipeline parallelizing techniques and present measurement results here. The fully digital reconfigurable CAR-FAC system is stable, scalable, easy to use, and provides an excellent input stage to more complex machine hearing tasks such as sound localization, sound segregation, speech recognition, and so on.
A FPGA Implementation of the CAR-FAC Cochlear Model

PubMed Central

Xu, Ying; Thakur, Chetan S.; Singh, Ram K.; Hamilton, Tara Julia; Wang, Runchun M.; van Schaik, André

2018-01-01

This paper presents a digital implementation of the Cascade of Asymmetric Resonators with Fast-Acting Compression (CAR-FAC) cochlear model. The CAR part simulates the basilar membrane's (BM) response to sound. The FAC part models the outer hair cell (OHC), the inner hair cell (IHC), and the medial olivocochlear efferent system functions. The FAC feeds back to the CAR by moving the poles and zeros of the CAR resonators automatically. We have implemented a 70-section, 44.1 kHz sampling rate CAR-FAC system on an Altera Cyclone V Field Programmable Gate Array (FPGA) with 18% ALM utilization by using time-multiplexing and pipeline parallelizing techniques and present measurement results here. The fully digital reconfigurable CAR-FAC system is stable, scalable, easy to use, and provides an excellent input stage to more complex machine hearing tasks such as sound localization, sound segregation, speech recognition, and so on. PMID:29692700
A practical implementation of multi-frequency widefield frequency-domain FLIM

PubMed Central

Chen, Hongtao

2013-01-01

Widefield frequency-domain fluorescence lifetime imaging microscopy (FD-FLIM) is a fast and accurate method to measure the fluorescence lifetime, especially in kinetic studies in biomedical researches. However, the small range of modulation frequencies available in commercial instruments makes this technique limited in its applications. Here we describe a practical implementation of multi-frequency widefield FD-FLIM using a pulsed supercontinuum laser and a direct digital synthesizer. In this instrument we use a pulse to modulate the image intensifier rather than the more conventional sine wave modulation. This allows parallel multi-frequency FLIM measurement using the Fast Fourier Transform and the cross-correlation technique, which permits precise and simultaneous isolation of individual frequencies. In addition, the pulse modulation at the cathode of image intensifier restored the loss of optical resolution caused by the defocusing effect when the voltage at the cathode is sinusoidally modulated. Furthermore, in our implementation of this technique, data can be graphically analyzed by the phasor method while data are acquired, which allows easy fit-free lifetime analysis of FLIM images. Here our measurements of standard fluorescent samples and a Föster resonance energy transfer pair demonstrate that the widefield multi-frequency FLIM system is a valuable and simple tool in fluorescence imaging studies. PMID:23296945
Implementation and performance of FDPS: a framework for developing parallel particle simulation codes

NASA Astrophysics Data System (ADS)

Iwasawa, Masaki; Tanikawa, Ataru; Hosono, Natsuki; Nitadori, Keigo; Muranushi, Takayuki; Makino, Junichiro

2016-08-01

We present the basic idea, implementation, measured performance, and performance model of FDPS (Framework for Developing Particle Simulators). FDPS is an application-development framework which helps researchers to develop simulation programs using particle methods for large-scale distributed-memory parallel supercomputers. A particle-based simulation program for distributed-memory parallel computers needs to perform domain decomposition, exchange of particles which are not in the domain of each computing node, and gathering of the particle information in other nodes which are necessary for interaction calculation. Also, even if distributed-memory parallel computers are not used, in order to reduce the amount of computation, algorithms such as the Barnes-Hut tree algorithm or the Fast Multipole Method should be used in the case of long-range interactions. For short-range interactions, some methods to limit the calculation to neighbor particles are required. FDPS provides all of these functions which are necessary for efficient parallel execution of particle-based simulations as "templates," which are independent of the actual data structure of particles and the functional form of the particle-particle interaction. By using FDPS, researchers can write their programs with the amount of work necessary to write a simple, sequential and unoptimized program of O(N2) calculation cost, and yet the program, once compiled with FDPS, will run efficiently on large-scale parallel supercomputers. A simple gravitational N-body program can be written in around 120 lines. We report the actual performance of these programs and the performance model. The weak scaling performance is very good, and almost linear speed-up was obtained for up to the full system of the K computer. The minimum calculation time per timestep is in the range of 30 ms (N = 107) to 300 ms (N = 109). These are currently limited by the time for the calculation of the domain decomposition and communication necessary for the interaction calculation. We discuss how we can overcome these bottlenecks.
libSRES: a C library for stochastic ranking evolution strategy for parameter estimation.

PubMed

Ji, Xinglai; Xu, Ying

2006-01-01

Estimation of kinetic parameters in a biochemical pathway or network represents a common problem in systems studies of biological processes. We have implemented a C library, named libSRES, to facilitate a fast implementation of computer software for study of non-linear biochemical pathways. This library implements a (mu, lambda)-ES evolutionary optimization algorithm that uses stochastic ranking as the constraint handling technique. Considering the amount of computing time it might require to solve a parameter-estimation problem, an MPI version of libSRES is provided for parallel implementation, as well as a simple user interface. libSRES is freely available and could be used directly in any C program as a library function. We have extensively tested the performance of libSRES on various pathway parameter-estimation problems and found its performance to be satisfactory. The source code (in C) is free for academic users at http://csbl.bmb.uga.edu/~jix/science/libSRES/
A fast radiative transfer model for visible through shortwave infrared spectral reflectances in clear and cloudy atmospheres

NASA Astrophysics Data System (ADS)

Wang, Chenxi; Yang, Ping; Nasiri, Shaima L.; Platnick, Steven; Baum, Bryan A.; Heidinger, Andrew K.; Liu, Xu

2013-02-01

A computationally efficient radiative transfer model (RTM) for calculating visible (VIS) through shortwave infrared (SWIR) reflectances is developed for use in satellite and airborne cloud property retrievals. The full radiative transfer equation (RTE) for combinations of cloud, aerosol, and molecular layers is solved approximately by using six independent RTEs that assume the plane-parallel approximation along with a single-scattering approximation for Rayleigh scattering. Each of the six RTEs can be solved analytically if the bidirectional reflectance/transmittance distribution functions (BRDF/BTDF) of the cloud/aerosol layers are known. The adding/doubling (AD) algorithm is employed to account for overlapped cloud/aerosol layers and non-Lambertian surfaces. Two approaches are used to mitigate the significant computational burden of the AD algorithm. First, the BRDF and BTDF of single cloud/aerosol layers are pre-computed using the discrete ordinates radiative transfer program (DISORT) implemented with 128 streams, and second, the required integral in the AD algorithm is numerically implemented on a twisted icosahedral mesh. A concise surface BRDF simulator associated with the MODIS land surface product (MCD43) is merged into a fast RTM to accurately account for non-isotropic surface reflectance. The resulting fast RTM is evaluated with respect to its computational accuracy and efficiency. The simulation bias between DISORT and the fast RTM is large (e.g., relative error >5%) only when both the solar zenith angle (SZA) and the viewing zenith angle (VZA) are large (i.e., SZA>45° and VZA>70°). For general situations, i.e., cloud/aerosol layers above a non-Lambertian surface, the fast RTM calculation rate is faster than that of the 128-stream DISORT by approximately two orders of magnitude.
Real-time field programmable gate array architecture for computer vision

NASA Astrophysics Data System (ADS)

Arias-Estrada, Miguel; Torres-Huitzil, Cesar

2001-01-01

This paper presents an architecture for real-time generic convolution of a mask and an image. The architecture is intended for fast low-level image processing. The field programmable gate array (FPGA)-based architecture takes advantage of the availability of registers in FPGAs to implement an efficient and compact module to process the convolutions. The architecture is designed to minimize the number of accesses to the image memory and it is based on parallel modules with internal pipeline operation in order to improve its performance. The architecture is prototyped in a FPGA, but it can be implemented on dedicated very- large-scale-integrated devices to reach higher clock frequencies. Complexity issues, FPGA resources utilization, FPGA limitations, and real-time performance are discussed. Some results are presented and discussed.
Rapid optimization of tension distribution for cable-driven parallel manipulators with redundant cables

NASA Astrophysics Data System (ADS)

Ouyang, Bo; Shang, Weiwei

2016-03-01

The solution of tension distributions is infinite for cable-driven parallel manipulators(CDPMs) with redundant cables. A rapid optimization method for determining the optimal tension distribution is presented. The new optimization method is primarily based on the geometry properties of a polyhedron and convex analysis. The computational efficiency of the optimization method is improved by the designed projection algorithm, and a fast algorithm is proposed to determine which two of the lines are intersected at the optimal point. Moreover, a method for avoiding the operating point on the lower tension limit is developed. Simulation experiments are implemented on a six degree-of-freedom(6-DOF) CDPM with eight cables, and the results indicate that the new method is one order of magnitude faster than the standard simplex method. The optimal distribution of tension distribution is thus rapidly established on real-time by the proposed method.
Raytracing and Direct-Drive Targets

NASA Astrophysics Data System (ADS)

Schmitt, Andrew J.; Bates, Jason; Fyfe, David; Eimerl, David

2013-10-01

Accurate simulation of the effects of laser imprinting and drive asymmetries in directly driven targets requires the ability to distinguish between raytrace noise and the intensity structure produced by the spatial and temporal incoherence of optical smoothing. We have developed and implemented a smoother raytrace algorithm for our mpi-parallel radiation hydrodynamics code, FAST3D. The underlying approach is to connect the rays into either sheets (in 2D) or volume-enclosing chunks (in 3D) so that the absorbed energy distribution continuously covers the propagation area illuminated by the laser. We will describe the status and show the different scalings encountered in 2D and 3D problems as the computational size, parallelization strategy, and number of rays is varied. Finally, we show results using the method in current NIKE experimental target simulations and in proposed symmetric and polar direct-drive target designs. Supported by US DoE/NNSA.
Solving Coupled Gross--Pitaevskii Equations on a Cluster of PlayStation 3 Computers

NASA Astrophysics Data System (ADS)

Edwards, Mark; Heward, Jeffrey; Clark, C. W.

2009-05-01

At Georgia Southern University we have constructed an 8+1--node cluster of Sony PlayStation 3 (PS3) computers with the intention of using this computing resource to solve problems related to the behavior of ultra--cold atoms in general with a particular emphasis on studying bose--bose and bose--fermi mixtures confined in optical lattices. As a first project that uses this computing resource, we have implemented a parallel solver of the coupled time--dependent, one--dimensional Gross--Pitaevskii (TDGP) equations. These equations govern the behavior of dual-- species bosonic mixtures. We chose the split--operator/FFT to solve the coupled 1D TDGP equations. The fast Fourier transform component of this solver can be readily parallelized on the PS3 cpu known as the Cell Broadband Engine (CellBE). Each CellBE chip contains a single 64--bit PowerPC Processor Element known as the PPE and eight ``Synergistic Processor Element'' identified as the SPE's. We report on this algorithm and compare its performance to a non--parallel solver as applied to modeling evaporative cooling in dual--species bosonic mixtures.
GRAPE-6A: A Single-Card GRAPE-6 for Parallel PC-GRAPE Cluster Systems

NASA Astrophysics Data System (ADS)

Fukushige, Toshiyuki; Makino, Junichiro; Kawai, Atsushi

2005-12-01

In this paper, we describe the design and performance of GRAPE-6A, a special-purpose computer for gravitational many-body simulations. It was designed to be used with a PC cluster, in which each node has one GRAPE-6A. Such a configuration is particularly cost-effective in running parallel tree algorithms. Though the use of parallel tree algorithms was possible with the original GRAPE-6 hardware, it was not very cost-effective since a single GRAPE-6 board was still too fast and too expensive. Therefore, we designed GRAPE-6A as a single PCI card to minimize the reproduction cost and to optimize the computing speed. The peak performance is 130 Gflops for one GRAPE-6A board and 3.1 Tflops for our 24 node cluster. We describe the implementation of the tree, TreePM and individual timestep algorithms on both a single GRAPE-6A system and GRAPE-6A cluster. Using the tree algorithm on our 16-node GRAPE-6A system, we can complete a collisionless simulation with 100 million particles (8000 steps) within 10 days.
Coherent diffraction imaging by moving a lens.

PubMed

Shen, Cheng; Tan, Jiubin; Wei, Ce; Liu, Zhengjun

2016-07-25

A moveable lens is used for determining amplitude and phase on the object plane. The extended fractional Fourier transform is introduced to address the single lens imaging. We put forward a fast algorithm for the transform by convolution. Combined with parallel iterative phase retrieval algorithm, it is applied to reconstruct the complex amplitude of the object. Compared with inline holography, the implementation of our method is simple and easy. Without the oversampling operation, the computational load is less. Also the proposed method has a superiority of accuracy over the direct focusing measurement for the imaging of small size objects.
An Approach in Radiation Therapy Treatment Planning: A Fast, GPU-Based Monte Carlo Method.

PubMed

Karbalaee, Mojtaba; Shahbazi-Gahrouei, Daryoush; Tavakoli, Mohammad B

2017-01-01

An accurate and fast radiation dose calculation is essential for successful radiation radiotherapy. The aim of this study was to implement a new graphic processing unit (GPU) based radiation therapy treatment planning for accurate and fast dose calculation in radiotherapy centers. A program was written for parallel running based on GPU. The code validation was performed by EGSnrc/DOSXYZnrc. Moreover, a semi-automatic, rotary, asymmetric phantom was designed and produced using a bone, the lung, and the soft tissue equivalent materials. All measurements were performed using a Mapcheck dosimeter. The accuracy of the code was validated using the experimental data, which was obtained from the anthropomorphic phantom as the gold standard. The findings showed that, compared with those of DOSXYZnrc in the virtual phantom and for most of the voxels (>95%), <3% dose-difference or 3 mm distance-to-agreement (DTA) was found. Moreover, considering the anthropomorphic phantom, compared to the Mapcheck dose measurements, <5% dose-difference or 5 mm DTA was observed. Fast calculation speed and high accuracy of GPU-based Monte Carlo method in dose calculation may be useful in routine radiation therapy centers as the core and main component of a treatment planning verification system.
Known-plaintext attack on the double phase encoding and its implementation with parallel hardware

NASA Astrophysics Data System (ADS)

Wei, Hengzheng; Peng, Xiang; Liu, Haitao; Feng, Songlin; Gao, Bruce Z.

2008-03-01

A known-plaintext attack on the double phase encryption scheme implemented with parallel hardware is presented. The double random phase encoding (DRPE) is one of the most representative optical cryptosystems developed in mid of 90's and derives quite a few variants since then. Although the DRPE encryption system has a strong power resisting to a brute-force attack, the inherent architecture of DRPE leaves a hidden trouble due to its linearity nature. Recently the real security strength of this opto-cryptosystem has been doubted and analyzed from the cryptanalysis point of view. In this presentation, we demonstrate that the optical cryptosystems based on DRPE architecture are vulnerable to known-plain text attack. With this attack the two encryption keys in the DRPE can be accessed with the help of the phase retrieval technique. In our approach, we adopt hybrid input-output algorithm (HIO) to recover the random phase key in the object domain and then infer the key in frequency domain. Only a plaintext-ciphertext pair is sufficient to create vulnerability. Moreover this attack does not need to select particular plaintext. The phase retrieval technique based on HIO is an iterative process performing Fourier transforms, so it fits very much into the hardware implementation of the digital signal processor (DSP). We make use of the high performance DSP to accomplish the known-plaintext attack. Compared with the software implementation, the speed of the hardware implementation is much fast. The performance of this DSP-based cryptanalysis system is also evaluated.

A highly efficient multi-core algorithm for clustering extremely large datasets

PubMed Central

2010-01-01

Background In recent years, the demand for computational power in computational biology has increased due to rapidly growing data sets from microarray and other high-throughput technologies. This demand is likely to increase. Standard algorithms for analyzing data, such as cluster algorithms, need to be parallelized for fast processing. Unfortunately, most approaches for parallelizing algorithms largely rely on network communication protocols connecting and requiring multiple computers. One answer to this problem is to utilize the intrinsic capabilities in current multi-core hardware to distribute the tasks among the different cores of one computer. Results We introduce a multi-core parallelization of the k-means and k-modes cluster algorithms based on the design principles of transactional memory for clustering gene expression microarray type data and categorial SNP data. Our new shared memory parallel algorithms show to be highly efficient. We demonstrate their computational power and show their utility in cluster stability and sensitivity analysis employing repeated runs with slightly changed parameters. Computation speed of our Java based algorithm was increased by a factor of 10 for large data sets while preserving computational accuracy compared to single-core implementations and a recently published network based parallelization. Conclusions Most desktop computers and even notebooks provide at least dual-core processors. Our multi-core algorithms show that using modern algorithmic concepts, parallelization makes it possible to perform even such laborious tasks as cluster sensitivity and cluster number estimation on the laboratory computer. PMID:20370922
Introducing GAMER: A fast and accurate method for ray-tracing galaxies using procedural noise

DOE Office of Scientific and Technical Information (OSTI.GOV)

Groeneboom, N. E.; Dahle, H., E-mail: nicolaag@astro.uio.no

2014-03-10

We developed a novel approach for fast and accurate ray-tracing of galaxies using procedural noise fields. Our method allows for efficient and realistic rendering of synthetic galaxy morphologies, where individual components such as the bulge, disk, stars, and dust can be synthesized in different wavelengths. These components follow empirically motivated overall intensity profiles but contain an additional procedural noise component that gives rise to complex natural patterns that mimic interstellar dust and star-forming regions. These patterns produce more realistic-looking galaxy images than using analytical expressions alone. The method is fully parallelized and creates accurate high- and low- resolution images thatmore » can be used, for example, in codes simulating strong and weak gravitational lensing. In addition to having a user-friendly graphical user interface, the C++ software package GAMER is easy to implement into an existing code.« less
Design of a tight frame of 2D shearlets based on a fast non-iterative analysis and synthesis algorithm

NASA Astrophysics Data System (ADS)

Goossens, Bart; Aelterman, Jan; Luong, Hi"p.; Pižurica, Aleksandra; Philips, Wilfried

2011-09-01

The shearlet transform is a recent sibling in the family of geometric image representations that provides a traditional multiresolution analysis combined with a multidirectional analysis. In this paper, we present a fast DFT-based analysis and synthesis scheme for the 2D discrete shearlet transform. Our scheme conforms to the continuous shearlet theory to high extent, provides perfect numerical reconstruction (up to floating point rounding errors) in a non-iterative scheme and is highly suitable for parallel implementation (e.g. FPGA, GPU). We show that our discrete shearlet representation is also a tight frame and the redundancy factor of the transform is around 2.6, independent of the number of analysis directions. Experimental denoising results indicate that the transform performs the same or even better than several related multiresolution transforms, while having a significantly lower redundancy factor.
Introducing GAMER: A Fast and Accurate Method for Ray-tracing Galaxies Using Procedural Noise

NASA Astrophysics Data System (ADS)

Groeneboom, N. E.; Dahle, H.

2014-03-01

We developed a novel approach for fast and accurate ray-tracing of galaxies using procedural noise fields. Our method allows for efficient and realistic rendering of synthetic galaxy morphologies, where individual components such as the bulge, disk, stars, and dust can be synthesized in different wavelengths. These components follow empirically motivated overall intensity profiles but contain an additional procedural noise component that gives rise to complex natural patterns that mimic interstellar dust and star-forming regions. These patterns produce more realistic-looking galaxy images than using analytical expressions alone. The method is fully parallelized and creates accurate high- and low- resolution images that can be used, for example, in codes simulating strong and weak gravitational lensing. In addition to having a user-friendly graphical user interface, the C++ software package GAMER is easy to implement into an existing code.
Unbiased free energy estimates in fast nonequilibrium transformations using Gaussian mixtures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Procacci, Piero

2015-04-21

In this paper, we present an improved method for obtaining unbiased estimates of the free energy difference between two thermodynamic states using the work distribution measured in nonequilibrium driven experiments connecting these states. The method is based on the assumption that any observed work distribution is given by a mixture of Gaussian distributions, whose normal components are identical in either direction of the nonequilibrium process, with weights regulated by the Crooks theorem. Using the prototypical example for the driven unfolding/folding of deca-alanine, we show that the predicted behavior of the forward and reverse work distributions, assuming a combination of onlymore » two Gaussian components with Crooks derived weights, explains surprisingly well the striking asymmetry in the observed distributions at fast pulling speeds. The proposed methodology opens the way for a perfectly parallel implementation of Jarzynski-based free energy calculations in complex systems.« less
The Parallel Implementation of Algorithms for Finding the Reflection Symmetry of the Binary Images

NASA Astrophysics Data System (ADS)

Fedotova, S.; Seredin, O.; Kushnir, O.

2017-05-01

In this paper, we investigate the exact method of searching an axis of binary image symmetry, based on brute-force search among all potential symmetry axes. As a measure of symmetry, we use the set-theoretic Jaccard similarity applied to two subsets of pixels of the image which is divided by some axis. Brute-force search algorithm definitely finds the axis of approximate symmetry which could be considered as ground-truth, but it requires quite a lot of time to process each image. As a first step of our contribution we develop the parallel version of the brute-force algorithm. It allows us to process large image databases and obtain the desired axis of approximate symmetry for each shape in database. Experimental studies implemented on "Butterflies" and "Flavia" datasets have shown that the proposed algorithm takes several minutes per image to find a symmetry axis. However, in case of real-world applications we need computational efficiency which allows solving the task of symmetry axis search in real or quasi-real time. So, for the task of fast shape symmetry calculation on the common multicore PC we elaborated another parallel program, which based on the procedure suggested before in (Fedotova, 2016). That method takes as an initial axis the axis obtained by superfast comparison of two skeleton primitive sub-chains. This process takes about 0.5 sec on the common PC, it is considerably faster than any of the optimized brute-force methods including ones implemented in supercomputer. In our experiments for 70 percent of cases the found axis coincides with the ground-truth one absolutely, and for the rest of cases it is very close to the ground-truth.
High performance Python for direct numerical simulations of turbulent flows

NASA Astrophysics Data System (ADS)

Mortensen, Mikael; Langtangen, Hans Petter

2016-06-01

Direct Numerical Simulations (DNS) of the Navier Stokes equations is an invaluable research tool in fluid dynamics. Still, there are few publicly available research codes and, due to the heavy number crunching implied, available codes are usually written in low-level languages such as C/C++ or Fortran. In this paper we describe a pure scientific Python pseudo-spectral DNS code that nearly matches the performance of C++ for thousands of processors and billions of unknowns. We also describe a version optimized through Cython, that is found to match the speed of C++. The solvers are written from scratch in Python, both the mesh, the MPI domain decomposition, and the temporal integrators. The solvers have been verified and benchmarked on the Shaheen supercomputer at the KAUST supercomputing laboratory, and we are able to show very good scaling up to several thousand cores. A very important part of the implementation is the mesh decomposition (we implement both slab and pencil decompositions) and 3D parallel Fast Fourier Transforms (FFT). The mesh decomposition and FFT routines have been implemented in Python using serial FFT routines (either NumPy, pyFFTW or any other serial FFT module), NumPy array manipulations and with MPI communications handled by MPI for Python (mpi4py). We show how we are able to execute a 3D parallel FFT in Python for a slab mesh decomposition using 4 lines of compact Python code, for which the parallel performance on Shaheen is found to be slightly better than similar routines provided through the FFTW library. For a pencil mesh decomposition 7 lines of code is required to execute a transform.
UNFOLD-SENSE: a parallel MRI method with self-calibration and artifact suppression.

PubMed

Madore, Bruno

2004-08-01

This work aims at improving the performance of parallel imaging by using it with our "unaliasing by Fourier-encoding the overlaps in the temporal dimension" (UNFOLD) temporal strategy. A self-calibration method called "self, hybrid referencing with UNFOLD and GRAPPA" (SHRUG) is presented. SHRUG combines the UNFOLD-based sensitivity mapping strategy introduced in the TSENSE method by Kellman et al. (5), with the strategy introduced in the GRAPPA method by Griswold et al. (10). SHRUG merges the two approaches to alleviate their respective limitations, and provides fast self-calibration at any given acceleration factor. UNFOLD-SENSE further includes an UNFOLD artifact suppression scheme to significantly suppress artifacts and amplified noise produced by parallel imaging. This suppression scheme, which was published previously (4), is related to another method that was presented independently as part of TSENSE. While the two are equivalent at accelerations < or = 2.0, the present approach is shown here to be significantly superior at accelerations > 2.0, with up to double the artifact suppression at high accelerations. Furthermore, a slight modification of Cartesian SENSE is introduced, which allows departures from purely Cartesian sampling grids. This technique, termed variable-density SENSE (vdSENSE), allows the variable-density data required by SHRUG to be reconstructed with the simplicity and fast processing of Cartesian SENSE. UNFOLD-SENSE is given by the combination of SHRUG for sensitivity mapping, vdSENSE for reconstruction, and UNFOLD for artifact/amplified noise suppression. The method was implemented, with online reconstruction, on both an SSFP and a myocardium-perfusion sequence. The results from six patients scanned with UNFOLD-SENSE are presented.
FoSSI: the family of simplified solver interfaces for the rapid development of parallel numerical atmosphere and ocean models

NASA Astrophysics Data System (ADS)

Frickenhaus, Stephan; Hiller, Wolfgang; Best, Meike

The portable software FoSSI is introduced that—in combination with additional free solver software packages—allows for an efficient and scalable parallel solution of large sparse linear equations systems arising in finite element model codes. FoSSI is intended to support rapid model code development, completely hiding the complexity of the underlying solver packages. In particular, the model developer need not be an expert in parallelization and is yet free to switch between different solver packages by simple modifications of the interface call. FoSSI offers an efficient and easy, yet flexible interface to several parallel solvers, most of them available on the web, such as PETSC, AZTEC, MUMPS, PILUT and HYPRE. FoSSI makes use of the concept of handles for vectors, matrices, preconditioners and solvers, that is frequently used in solver libraries. Hence, FoSSI allows for a flexible treatment of several linear equations systems and associated preconditioners at the same time, even in parallel on separate MPI-communicators. The second special feature in FoSSI is the task specifier, being a combination of keywords, each configuring a certain phase in the solver setup. This enables the user to control a solver over one unique subroutine. Furthermore, FoSSI has rather similar features for all solvers, making a fast solver intercomparison or exchange an easy task. FoSSI is a community software, proven in an adaptive 2D-atmosphere model and a 3D-primitive equation ocean model, both formulated in finite elements. The present paper discusses perspectives of an OpenMP-implementation of parallel iterative solvers based on domain decomposition methods. This approach to OpenMP solvers is rather attractive, as the code for domain-local operations of factorization, preconditioning and matrix-vector product can be readily taken from a sequential implementation that is also suitable to be used in an MPI-variant. Code development in this direction is in an advanced state under the name ScOPES: the Scalable Open Parallel sparse linear Equations Solver.
The novel high-performance 3-D MT inverse solver

NASA Astrophysics Data System (ADS)

Kruglyakov, Mikhail; Geraskin, Alexey; Kuvshinov, Alexey

2016-04-01

We present novel, robust, scalable, and fast 3-D magnetotelluric (MT) inverse solver. The solver is written in multi-language paradigm to make it as efficient, readable and maintainable as possible. Separation of concerns and single responsibility concepts go through implementation of the solver. As a forward modelling engine a modern scalable solver extrEMe, based on contracting integral equation approach, is used. Iterative gradient-type (quasi-Newton) optimization scheme is invoked to search for (regularized) inverse problem solution, and adjoint source approach is used to calculate efficiently the gradient of the misfit. The inverse solver is able to deal with highly detailed and contrasting models, allows for working (separately or jointly) with any type of MT responses, and supports massive parallelization. Moreover, different parallelization strategies implemented in the code allow optimal usage of available computational resources for a given problem statement. To parameterize an inverse domain the so-called mask parameterization is implemented, which means that one can merge any subset of forward modelling cells in order to account for (usually) irregular distribution of observation sites. We report results of 3-D numerical experiments aimed at analysing the robustness, performance and scalability of the code. In particular, our computational experiments carried out at different platforms ranging from modern laptops to HPC Piz Daint (6th supercomputer in the world) demonstrate practically linear scalability of the code up to thousands of nodes.
Efficient Multicriteria Protein Structure Comparison on Modern Processor Architectures

PubMed Central

Manolakos, Elias S.

2015-01-01

Fast increasing computational demand for all-to-all protein structures comparison (PSC) is a result of three confounding factors: rapidly expanding structural proteomics databases, high computational complexity of pairwise protein comparison algorithms, and the trend in the domain towards using multiple criteria for protein structures comparison (MCPSC) and combining results. We have developed a software framework that exploits many-core and multicore CPUs to implement efficient parallel MCPSC in modern processors based on three popular PSC methods, namely, TMalign, CE, and USM. We evaluate and compare the performance and efficiency of the two parallel MCPSC implementations using Intel's experimental many-core Single-Chip Cloud Computer (SCC) as well as Intel's Core i7 multicore processor. We show that the 48-core SCC is more efficient than the latest generation Core i7, achieving a speedup factor of 42 (efficiency of 0.9), making many-core processors an exciting emerging technology for large-scale structural proteomics. We compare and contrast the performance of the two processors on several datasets and also show that MCPSC outperforms its component methods in grouping related domains, achieving a high F-measure of 0.91 on the benchmark CK34 dataset. The software implementation for protein structure comparison using the three methods and combined MCPSC, along with the developed underlying rckskel algorithmic skeletons library, is available via GitHub. PMID:26605332
Efficient Multicriteria Protein Structure Comparison on Modern Processor Architectures.

PubMed

Sharma, Anuj; Manolakos, Elias S

2015-01-01

Fast increasing computational demand for all-to-all protein structures comparison (PSC) is a result of three confounding factors: rapidly expanding structural proteomics databases, high computational complexity of pairwise protein comparison algorithms, and the trend in the domain towards using multiple criteria for protein structures comparison (MCPSC) and combining results. We have developed a software framework that exploits many-core and multicore CPUs to implement efficient parallel MCPSC in modern processors based on three popular PSC methods, namely, TMalign, CE, and USM. We evaluate and compare the performance and efficiency of the two parallel MCPSC implementations using Intel's experimental many-core Single-Chip Cloud Computer (SCC) as well as Intel's Core i7 multicore processor. We show that the 48-core SCC is more efficient than the latest generation Core i7, achieving a speedup factor of 42 (efficiency of 0.9), making many-core processors an exciting emerging technology for large-scale structural proteomics. We compare and contrast the performance of the two processors on several datasets and also show that MCPSC outperforms its component methods in grouping related domains, achieving a high F-measure of 0.91 on the benchmark CK34 dataset. The software implementation for protein structure comparison using the three methods and combined MCPSC, along with the developed underlying rckskel algorithmic skeletons library, is available via GitHub.
MUTILS - a set of efficient modeling tools for multi-core CPUs implemented in MEX

NASA Astrophysics Data System (ADS)

Krotkiewski, Marcin; Dabrowski, Marcin

2013-04-01

The need for computational performance is common in scientific applications, and in particular in numerical simulations, where high resolution models require efficient processing of large amounts of data. Especially in the context of geological problems the need to increase the model resolution to resolve physical and geometrical complexities seems to have no limits. Alas, the performance of new generations of CPUs does not improve any longer by simply increasing clock speeds. Current industrial trends are to increase the number of computational cores. As a result, parallel implementations are required in order to fully utilize the potential of new processors, and to study more complex models. We target simulations on small to medium scale shared memory computers: laptops and desktop PCs with ~8 CPU cores and up to tens of GB of memory to high-end servers with ~50 CPU cores and hundereds of GB of memory. In this setting MATLAB is often the environment of choice for scientists that want to implement their own models with little effort. It is a useful general purpose mathematical software package, but due to its versatility some of its functionality is not as efficient as it could be. In particular, the challanges of modern multi-core architectures are not fully addressed. We have developed MILAMIN 2 - an efficient FEM modeling environment written in native MATLAB. Amongst others, MILAMIN provides functions to define model geometry, generate and convert structured and unstructured meshes (also through interfaces to external mesh generators), compute element and system matrices, apply boundary conditions, solve the system of linear equations, address non-linear and transient problems, and perform post-processing. MILAMIN strives to combine the ease of code development and the computational efficiency. Where possible, the code is optimized and/or parallelized within the MATLAB framework. Native MATLAB is augmented with the MUTILS library - a set of MEX functions that implement the computationally intensive, performance critical parts of the code, which we have identified to be bottlenecks. Here, we discuss the functionality and performance of the MUTILS library. Currently, it includes: 1. time and memory efficient assembly of sparse matrices for FEM simulations 2. parallel sparse matrix - vector product with optimizations speficic to symmetric matrices and multiple degrees of freedom per node 3. parallel point in triangle location and point in tetrahedron location for unstructured, adaptive 2D and 3D meshes (useful for 'marker in cell' type of methods) 4. parallel FEM interpolation for 2D and 3D meshes of elements of different types and orders, and for different number of degrees of freedom per node 5. a stand-alone, MEX implementation of the Conjugate Gradients iterative solver 6. interface to METIS graph partitioning and a fast implementation of RCM reordering
An efficient numerical method for solving the Boltzmann equation in multidimensions

NASA Astrophysics Data System (ADS)

Dimarco, Giacomo; Loubère, Raphaël; Narski, Jacek; Rey, Thomas

2018-01-01

In this paper we deal with the extension of the Fast Kinetic Scheme (FKS) (Dimarco and Loubère, 2013 [26]) originally constructed for solving the BGK equation, to the more challenging case of the Boltzmann equation. The scheme combines a robust and fast method for treating the transport part based on an innovative Lagrangian technique supplemented with conservative fast spectral schemes to treat the collisional operator by means of an operator splitting approach. This approach along with several implementation features related to the parallelization of the algorithm permits to construct an efficient simulation tool which is numerically tested against exact and reference solutions on classical problems arising in rarefied gas dynamic. We present results up to the 3 D × 3 D case for unsteady flows for the Variable Hard Sphere model which may serve as benchmark for future comparisons between different numerical methods for solving the multidimensional Boltzmann equation. For this reason, we also provide for each problem studied details on the computational cost and memory consumption as well as comparisons with the BGK model or the limit model of compressible Euler equations.
Optoelectronic analogs of self-programming neural nets - Architecture and methodologies for implementing fast stochastic learning by simulated annealing

NASA Technical Reports Server (NTRS)

Farhat, Nabil H.

1987-01-01

Self-organization and learning is a distinctive feature of neural nets and processors that sets them apart from conventional approaches to signal processing. It leads to self-programmability which alleviates the problem of programming complexity in artificial neural nets. In this paper architectures for partitioning an optoelectronic analog of a neural net into distinct layers with prescribed interconnectivity pattern to enable stochastic learning by simulated annealing in the context of a Boltzmann machine are presented. Stochastic learning is of interest because of its relevance to the role of noise in biological neural nets. Practical considerations and methodologies for appreciably accelerating stochastic learning in such a multilayered net are described. These include the use of parallel optical computing of the global energy of the net, the use of fast nonvolatile programmable spatial light modulators to realize fast plasticity, optical generation of random number arrays, and an adaptive noisy thresholding scheme that also makes stochastic learning more biologically plausible. The findings reported predict optoelectronic chips that can be used in the realization of optical learning machines.
High-Dose Neutron Detector Development Using 10B Coated Cells

DOE Office of Scientific and Technical Information (OSTI.GOV)

Menlove, Howard Olsen; Henzlova, Daniela

2016-11-08

During FY16 the boron-lined parallel-plate technology was optimized to fully benefit from its fast timing characteristics in order to enhance its high count rate capability. To facilitate high count rate capability, a novel fast amplifier with timing and operating properties matched to the detector characteristics was developed and implemented in the 8” boron plate detector that was purchased from PDT. Each of the 6 sealed-cells was connected to a fast amplifier with corresponding List mode readout from each amplifier. The FY16 work focused on improvements in the boron-10 coating materials and procedures at PDT to significantly improve the neutron detectionmore » efficiency. An improvement in the efficiency of a factor of 1.5 was achieved without increasing the metal backing area for the boron coating. This improvement has allowed us to operate the detector in gamma-ray backgrounds that are four orders of magnitude higher than was previously possible while maintaining a relatively high counting efficiency for neutrons. This improvement in the gamma-ray rejection is a key factor in the development of the high dose neutron detector.« less
A GPU-Accelerated Approach for Feature Tracking in Time-Varying Imagery Datasets.

PubMed

Peng, Chao; Sahani, Sandip; Rushing, John

2017-10-01

We propose a novel parallel connected component labeling (CCL) algorithm along with efficient out-of-core data management to detect and track feature regions of large time-varying imagery datasets. Our approach contributes to the big data field with parallel algorithms tailored for GPU architectures. We remove the data dependency between frames and achieve pixel-level parallelism. Due to the large size, the entire dataset cannot fit into cached memory. Frames have to be streamed through the memory hierarchy (disk to CPU main memory and then to GPU memory), partitioned, and processed as batches, where each batch is small enough to fit into the GPU. To reconnect the feature regions that are separated due to data partitioning, we present a novel batch merging algorithm to extract the region connection information across multiple batches in a parallel fashion. The information is organized in a memory-efficient structure and supports fast indexing on the GPU. Our experiment uses a commodity workstation equipped with a single GPU. The results show that our approach can efficiently process a weather dataset composed of terabytes of time-varying radar images. The advantages of our approach are demonstrated by comparing to the performance of an efficient CPU cluster implementation which is being used by the weather scientists.
Real-time implementations of image segmentation algorithms on shared memory multicore architecture: a survey (Conference Presentation)

NASA Astrophysics Data System (ADS)

Akil, Mohamed

2017-05-01

The real-time processing is getting more and more important in many image processing applications. Image segmentation is one of the most fundamental tasks image analysis. As a consequence, many different approaches for image segmentation have been proposed. The watershed transform is a well-known image segmentation tool. The watershed transform is a very data intensive task. To achieve acceleration and obtain real-time processing of watershed algorithms, parallel architectures and programming models for multicore computing have been developed. This paper focuses on the survey of the approaches for parallel implementation of sequential watershed algorithms on multicore general purpose CPUs: homogeneous multicore processor with shared memory. To achieve an efficient parallel implementation, it's necessary to explore different strategies (parallelization/distribution/distributed scheduling) combined with different acceleration and optimization techniques to enhance parallelism. In this paper, we give a comparison of various parallelization of sequential watershed algorithms on shared memory multicore architecture. We analyze the performance measurements of each parallel implementation and the impact of the different sources of overhead on the performance of the parallel implementations. In this comparison study, we also discuss the advantages and disadvantages of the parallel programming models. Thus, we compare the OpenMP (an application programming interface for multi-Processing) with Ptheads (POSIX Threads) to illustrate the impact of each parallel programming model on the performance of the parallel implementations.
Phosphoenolpyruvate carboxykinase 1 gene (Pck1) displays parallel evolution between Old World and New World fruit bats.

PubMed

Zhu, Lei; Yin, Qiuyuan; Irwin, David M; Zhang, Shuyi

2015-01-01

Bats are an ideal mammalian group for exploring adaptations to fasting due to their large variety of diets and because fasting is a regular part of their life cycle. Mammals fed on a carbohydrate-rich diet experience a rapid decrease in blood glucose levels during a fast, thus, the development of mechanisms to resist the consequences of regular fasts, experienced on a daily basis, must have been crucial in the evolution of frugivorous bats. Phosphoenolpyruvate carboxykinase 1 (PEPCK1, encoded by the Pck1 gene) is the rate-limiting enzyme in gluconeogenesis and is largely responsible for the maintenance of glucose homeostasis during fasting in fruit-eating bats. To test whether Pck1 has experienced adaptive evolution in frugivorous bats, we obtained Pck1 coding sequence from 20 species of bats, including five Old World fruit bats (OWFBs) (Pteropodidae) and two New World fruit bats (NWFBs) (Phyllostomidae). Our molecular evolutionary analyses of these sequences revealed that Pck1 was under purifying selection in both Old World and New World fruit bats with no evidence of positive selection detected in either ancestral branch leading to fruit bats. Interestingly, however, six specific amino acid substitutions were detected on the ancestral lineage of OWFBs. In addition, we found considerable evidence for parallel evolution, at the amino acid level, between the PEPCK1 sequences of Old World fruit bats and New World fruit bats. Test for parallel evolution showed that four parallel substitutions (Q276R, R503H, I558V and Q593R) were driven by natural selection. Our study provides evidence that Pck1 underwent parallel evolution between Old World and New World fruit bats, two lineages of mammals that feed on a carbohydrate-rich diet and experience regular periods of fasting as part of their life cycle.
Phosphoenolpyruvate Carboxykinase 1 Gene (Pck1) Displays Parallel Evolution between Old World and New World Fruit Bats

PubMed Central

Irwin, David M.; Zhang, Shuyi

2015-01-01

Bats are an ideal mammalian group for exploring adaptations to fasting due to their large variety of diets and because fasting is a regular part of their life cycle. Mammals fed on a carbohydrate-rich diet experience a rapid decrease in blood glucose levels during a fast, thus, the development of mechanisms to resist the consequences of regular fasts, experienced on a daily basis, must have been crucial in the evolution of frugivorous bats. Phosphoenolpyruvate carboxykinase 1 (PEPCK1, encoded by the Pck1 gene) is the rate-limiting enzyme in gluconeogenesis and is largely responsible for the maintenance of glucose homeostasis during fasting in fruit-eating bats. To test whether Pck1 has experienced adaptive evolution in frugivorous bats, we obtained Pck1 coding sequence from 20 species of bats, including five Old World fruit bats (OWFBs) (Pteropodidae) and two New World fruit bats (NWFBs) (Phyllostomidae). Our molecular evolutionary analyses of these sequences revealed that Pck1 was under purifying selection in both Old World and New World fruit bats with no evidence of positive selection detected in either ancestral branch leading to fruit bats. Interestingly, however, six specific amino acid substitutions were detected on the ancestral lineage of OWFBs. In addition, we found considerable evidence for parallel evolution, at the amino acid level, between the PEPCK1 sequences of Old World fruit bats and New World fruit bats. Test for parallel evolution showed that four parallel substitutions (Q276R, R503H, I558V and Q593R) were driven by natural selection. Our study provides evidence that Pck1 underwent parallel evolution between Old World and New World fruit bats, two lineages of mammals that feed on a carbohydrate-rich diet and experience regular periods of fasting as part of their life cycle. PMID:25807515

Robust Segmentation of Overlapping Cells in Histopathology Specimens Using Parallel Seed Detection and Repulsive Level Set

PubMed Central

Qi, Xin; Xing, Fuyong; Foran, David J.; Yang, Lin

2013-01-01

Automated image analysis of histopathology specimens could potentially provide support for early detection and improved characterization of breast cancer. Automated segmentation of the cells comprising imaged tissue microarrays (TMA) is a prerequisite for any subsequent quantitative analysis. Unfortunately, crowding and overlapping of cells present significant challenges for most traditional segmentation algorithms. In this paper, we propose a novel algorithm which can reliably separate touching cells in hematoxylin stained breast TMA specimens which have been acquired using a standard RGB camera. The algorithm is composed of two steps. It begins with a fast, reliable object center localization approach which utilizes single-path voting followed by mean-shift clustering. Next, the contour of each cell is obtained using a level set algorithm based on an interactive model. We compared the experimental results with those reported in the most current literature. Finally, performance was evaluated by comparing the pixel-wise accuracy provided by human experts with that produced by the new automated segmentation algorithm. The method was systematically tested on 234 image patches exhibiting dense overlap and containing more than 2200 cells. It was also tested on whole slide images including blood smears and tissue microarrays containing thousands of cells. Since the voting step of the seed detection algorithm is well suited for parallelization, a parallel version of the algorithm was implemented using graphic processing units (GPU) which resulted in significant speed-up over the C/C++ implementation. PMID:22167559
Bin-Hash Indexing: A Parallel Method for Fast Query Processing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bethel, Edward W; Gosink, Luke J.; Wu, Kesheng

2008-06-27

This paper presents a new parallel indexing data structure for answering queries. The index, called Bin-Hash, offers extremely high levels of concurrency, and is therefore well-suited for the emerging commodity of parallel processors, such as multi-cores, cell processors, and general purpose graphics processing units (GPU). The Bin-Hash approach first bins the base data, and then partitions and separately stores the values in each bin as a perfect spatial hash table. To answer a query, we first determine whether or not a record satisfies the query conditions based on the bin boundaries. For the bins with records that can not bemore » resolved, we examine the spatial hash tables. The procedures for examining the bin numbers and the spatial hash tables offer the maximum possible level of concurrency; all records are able to be evaluated by our procedure independently in parallel. Additionally, our Bin-Hash procedures access much smaller amounts of data than similar parallel methods, such as the projection index. This smaller data footprint is critical for certain parallel processors, like GPUs, where memory resources are limited. To demonstrate the effectiveness of Bin-Hash, we implement it on a GPU using the data-parallel programming language CUDA. The concurrency offered by the Bin-Hash index allows us to fully utilize the GPU's massive parallelism in our work; over 12,000 records can be simultaneously evaluated at any one time. We show that our new query processing method is an order of magnitude faster than current state-of-the-art CPU-based indexing technologies. Additionally, we compare our performance to existing GPU-based projection index strategies.« less
A fast non-Fourier method for Landau-fluid operators

DOE Office of Scientific and Technical Information (OSTI.GOV)

Dimits, A. M., E-mail: dimits1@llnl.gov; Joseph, I.; Umansky, M. V.

An efficient and versatile non-Fourier method for the computation of Landau-fluid (LF) closure operators [Hammett and Perkins, Phys. Rev. Lett. 64, 3019 (1990)] is presented, based on an approximation by a sum of modified-Helmholtz-equation solves (SMHS) in configuration space. This method can yield fast-Fourier-like scaling of the computational time requirements and also provides a very compact data representation of these operators, even for plasmas with large spatial nonuniformity. As a result, the method can give significant savings compared with direct application of “delocalization kernels” [e.g., Schurtz et al., Phys. Plasmas 7, 4238 (2000)], both in terms of computational cost andmore » memory requirements. The method is of interest for the implementation of Landau-fluid models in situations where the spatial nonuniformity, particular geometry, or boundary conditions render a Fourier implementation difficult or impossible. Systematic procedures have been developed to optimize the resulting operators for accuracy and computational cost. The four-moment Landau-fluid model of Hammett and Perkins has been implemented in the BOUT++ code using the SMHS method for LF closure. Excellent agreement has been obtained for the one-dimensional plasma density response function between driven initial-value calculations using this BOUT++ implementation and matrix eigenvalue calculations using both Fourier and SMHS non-Fourier implementations of the LF closures. The SMHS method also forms the basis for the implementation, which has been carried out in the BOUT++ code, of the parallel and toroidal drift-resonance LF closures. The method is a key enabling tool for the extension of gyro-Landau-fluid models [e.g., Beer and Hammett, Phys. Plasmas 3, 4046 (1996)] to codes that treat regions with strong profile variation, such as the tokamak edge and scrapeoff-layer.« less
A fast non-Fourier method for Landau-fluid operatorsa)

NASA Astrophysics Data System (ADS)

Dimits, A. M.; Joseph, I.; Umansky, M. V.

2014-05-01

An efficient and versatile non-Fourier method for the computation of Landau-fluid (LF) closure operators [Hammett and Perkins, Phys. Rev. Lett. 64, 3019 (1990)] is presented, based on an approximation by a sum of modified-Helmholtz-equation solves (SMHS) in configuration space. This method can yield fast-Fourier-like scaling of the computational time requirements and also provides a very compact data representation of these operators, even for plasmas with large spatial nonuniformity. As a result, the method can give significant savings compared with direct application of "delocalization kernels" [e.g., Schurtz et al., Phys. Plasmas 7, 4238 (2000)], both in terms of computational cost and memory requirements. The method is of interest for the implementation of Landau-fluid models in situations where the spatial nonuniformity, particular geometry, or boundary conditions render a Fourier implementation difficult or impossible. Systematic procedures have been developed to optimize the resulting operators for accuracy and computational cost. The four-moment Landau-fluid model of Hammett and Perkins has been implemented in the BOUT++ code using the SMHS method for LF closure. Excellent agreement has been obtained for the one-dimensional plasma density response function between driven initial-value calculations using this BOUT++ implementation and matrix eigenvalue calculations using both Fourier and SMHS non-Fourier implementations of the LF closures. The SMHS method also forms the basis for the implementation, which has been carried out in the BOUT++ code, of the parallel and toroidal drift-resonance LF closures. The method is a key enabling tool for the extension of gyro-Landau-fluid models [e.g., Beer and Hammett, Phys. Plasmas 3, 4046 (1996)] to codes that treat regions with strong profile variation, such as the tokamak edge and scrapeoff-layer.
FUX-Sim: Implementation of a fast universal simulation/reconstruction framework for X-ray systems.

PubMed

Abella, Monica; Serrano, Estefania; Garcia-Blas, Javier; García, Ines; de Molina, Claudia; Carretero, Jesus; Desco, Manuel

2017-01-01

The availability of digital X-ray detectors, together with advances in reconstruction algorithms, creates an opportunity for bringing 3D capabilities to conventional radiology systems. The downside is that reconstruction algorithms for non-standard acquisition protocols are generally based on iterative approaches that involve a high computational burden. The development of new flexible X-ray systems could benefit from computer simulations, which may enable performance to be checked before expensive real systems are implemented. The development of simulation/reconstruction algorithms in this context poses three main difficulties. First, the algorithms deal with large data volumes and are computationally expensive, thus leading to the need for hardware and software optimizations. Second, these optimizations are limited by the high flexibility required to explore new scanning geometries, including fully configurable positioning of source and detector elements. And third, the evolution of the various hardware setups increases the effort required for maintaining and adapting the implementations to current and future programming models. Previous works lack support for completely flexible geometries and/or compatibility with multiple programming models and platforms. In this paper, we present FUX-Sim, a novel X-ray simulation/reconstruction framework that was designed to be flexible and fast. Optimized implementation for different families of GPUs (CUDA and OpenCL) and multi-core CPUs was achieved thanks to a modularized approach based on a layered architecture and parallel implementation of the algorithms for both architectures. A detailed performance evaluation demonstrates that for different system configurations and hardware platforms, FUX-Sim maximizes performance with the CUDA programming model (5 times faster than other state-of-the-art implementations). Furthermore, the CPU and OpenCL programming models allow FUX-Sim to be executed over a wide range of hardware platforms.
Extreme Performance Scalable Operating Systems Final Progress Report (July 1, 2008 - October 31, 2011)

DOE Office of Scientific and Technical Information (OSTI.GOV)

Malony, Allen D; Shende, Sameer

This is the final progress report for the FastOS (Phase 2) (FastOS-2) project with Argonne National Laboratory and the University of Oregon (UO). The project started at UO on July 1, 2008 and ran until April 30, 2010, at which time a six-month no-cost extension began. The FastOS-2 work at UO delivered excellent results in all research work areas: * scalable parallel monitoring * kernel-level performance measurement * parallel I/0 system measurement * large-scale and hybrid application performance measurement * onlne scalable performance data reduction and analysis * binary instrumentation
Parallelizing alternating direction implicit solver on GPUs

USDA-ARS?s Scientific Manuscript database

We present a parallel Alternating Direction Implicit (ADI) solver on GPUs. Our implementation significantly improves existing implementations in two aspects. First, we address the scalability issue of existing Parallel Cyclic Reduction (PCR) implementations by eliminating their hardware resource con...
The geospatial data quality REST API for primary biodiversity data

PubMed Central

Otegui, Javier; Guralnick, Robert P.

2016-01-01

Summary: We present a REST web service to assess the geospatial quality of primary biodiversity data. It enables access to basic and advanced functions to detect completeness and consistency issues as well as general errors in the provided record or set of records. The API uses JSON for data interchange and efficient parallelization techniques for fast assessments of large datasets. Availability and implementation: The Geospatial Data Quality API is part of the VertNet set of APIs. It can be accessed at http://api-geospatial.vertnet-portal.appspot.com/geospatial and is already implemented in the VertNet data portal for quality reporting. Source code is freely available under GPL license from http://www.github.com/vertnet/api-geospatial. Contact: javier.otegui@gmail.com or rguralnick@flmnh.ufl.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:26833340
Shift-connected SIMD array architectures for digital optical computing systems, with algorithms for numerical transforms and partial differential equations

NASA Astrophysics Data System (ADS)

Drabik, Timothy J.; Lee, Sing H.

1986-11-01

The intrinsic parallelism characteristics of easily realizable optical SIMD arrays prompt their present consideration in the implementation of highly structured algorithms for the numerical solution of multidimensional partial differential equations and the computation of fast numerical transforms. Attention is given to a system, comprising several spatial light modulators (SLMs), an optical read/write memory, and a functional block, which performs simple, space-invariant shifts on images with sufficient flexibility to implement the fastest known methods for partial differential equations as well as a wide variety of numerical transforms in two or more dimensions. Either fixed or floating-point arithmetic may be used. A performance projection of more than 1 billion floating point operations/sec using SLMs with 1000 x 1000-resolution and operating at 1-MHz frame rates is made.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Chrisochoides, N.; Sukup, F.

In this paper we present a parallel implementation of the Bowyer-Watson (BW) algorithm using the task-parallel programming model. The BW algorithm constitutes an ideal mesh refinement strategy for implementing a large class of unstructured mesh generation techniques on both sequential and parallel computers, by preventing the need for global mesh refinement. Its implementation on distributed memory multicomputes using the traditional data-parallel model has been proven very inefficient due to excessive synchronization needed among processors. In this paper we demonstrate that with the task-parallel model we can tolerate synchronization costs inherent to data-parallel methods by exploring concurrency in the processor level.more » Our preliminary performance data indicate that the task- parallel approach: (i) is almost four times faster than the existing data-parallel methods, (ii) scales linearly, and (iii) introduces minimum overheads compared to the {open_quotes}best{close_quotes} sequential implementation of the BW algorithm.« less
streamgap-pepper: Effects of peppering streams with many small impacts

NASA Astrophysics Data System (ADS)

Bovy, Jo; Erkal, Denis; Sanders, Jason

2017-02-01

streamgap-pepper computes the effect of subhalo fly-bys on cold tidal streams based on the action-angle representation of streams. A line-of-parallel-angle approach is used to calculate the perturbed distribution function of a given stream segment by undoing the effect of all impacts. This approach allows one to compute the perturbed stream density and track in any coordinate system in minutes for realizations of the subhalo distribution down to 10^5 Msun, accounting for the stream's internal dispersion and overlapping impacts. This code uses galpy (ascl:1411.008) and the streampepperdf.py galpy extension, which implements the fast calculation of the perturbed stream structure.
Optical frequency comb Faraday rotation spectroscopy

NASA Astrophysics Data System (ADS)

Johansson, Alexandra C.; Westberg, Jonas; Wysocki, Gerard; Foltynowicz, Aleksandra

2018-05-01

We demonstrate optical frequency comb Faraday rotation spectroscopy (OFC-FRS) for broadband interference-free detection of paramagnetic species. The system is based on a femtosecond doubly resonant optical parametric oscillator and a fast-scanning Fourier transform spectrometer (FTS). The sample is placed in a DC magnetic field parallel to the light propagation. Efficient background suppression is implemented via switching the direction of the field on consecutive FTS scans and subtracting the consecutive spectra, which enables long-term averaging. In this first demonstration, we measure the entire Q- and R-branches of the fundamental band of nitric oxide in the 5.2-5.4 µm range and achieve good agreement with a theoretical model.
Maximum-Likelihood Estimation With a Contracting-Grid Search Algorithm

PubMed Central

Hesterman, Jacob Y.; Caucci, Luca; Kupinski, Matthew A.; Barrett, Harrison H.; Furenlid, Lars R.

2010-01-01

A fast search algorithm capable of operating in multi-dimensional spaces is introduced. As a sample application, we demonstrate its utility in the 2D and 3D maximum-likelihood position-estimation problem that arises in the processing of PMT signals to derive interaction locations in compact gamma cameras. We demonstrate that the algorithm can be parallelized in pipelines, and thereby efficiently implemented in specialized hardware, such as field-programmable gate arrays (FPGAs). A 2D implementation of the algorithm is achieved in Cell/BE processors, resulting in processing speeds above one million events per second, which is a 20× increase in speed over a conventional desktop machine. Graphics processing units (GPUs) are used for a 3D application of the algorithm, resulting in processing speeds of nearly 250,000 events per second which is a 250× increase in speed over a conventional desktop machine. These implementations indicate the viability of the algorithm for use in real-time imaging applications. PMID:20824155
IGA-ADS: Isogeometric analysis FEM using ADS solver

NASA Astrophysics Data System (ADS)

Łoś, Marcin M.; Woźniak, Maciej; Paszyński, Maciej; Lenharth, Andrew; Hassaan, Muhamm Amber; Pingali, Keshav

2017-08-01

In this paper we present a fast explicit solver for solution of non-stationary problems using L2 projections with isogeometric finite element method. The solver has been implemented within GALOIS framework. It enables parallel multi-core simulations of different time-dependent problems, in 1D, 2D, or 3D. We have prepared the solver framework in a way that enables direct implementation of the selected PDE and corresponding boundary conditions. In this paper we describe the installation, implementation of exemplary three PDEs, and execution of the simulations on multi-core Linux cluster nodes. We consider three case studies, including heat transfer, linear elasticity, as well as non-linear flow in heterogeneous media. The presented package generates output suitable for interfacing with Gnuplot and ParaView visualization software. The exemplary simulations show near perfect scalability on Gilbert shared-memory node with four Intel® Xeon® CPU E7-4860 processors, each possessing 10 physical cores (for a total of 40 cores).
Automated Software Acceleration in Programmable Logic for an Efficient NFFT Algorithm Implementation: A Case Study.

PubMed

Rodríguez, Manuel; Magdaleno, Eduardo; Pérez, Fernando; García, Cristhian

2017-03-28

Non-equispaced Fast Fourier transform (NFFT) is a very important algorithm in several technological and scientific areas such as synthetic aperture radar, computational photography, medical imaging, telecommunications, seismic analysis and so on. However, its computation complexity is high. In this paper, we describe an efficient NFFT implementation with a hardware coprocessor using an All-Programmable System-on-Chip (APSoC). This is a hybrid device that employs an Advanced RISC Machine (ARM) as Processing System with Programmable Logic for high-performance digital signal processing through parallelism and pipeline techniques. The algorithm has been coded in C language with pragma directives to optimize the architecture of the system. We have used the very novel Software Develop System-on-Chip (SDSoC) evelopment tool that simplifies the interface and partitioning between hardware and software. This provides shorter development cycles and iterative improvements by exploring several architectures of the global system. The computational results shows that hardware acceleration significantly outperformed the software based implementation.
Automated Software Acceleration in Programmable Logic for an Efficient NFFT Algorithm Implementation: A Case Study

PubMed Central

Rodríguez, Manuel; Magdaleno, Eduardo; Pérez, Fernando; García, Cristhian

2017-01-01

Non-equispaced Fast Fourier transform (NFFT) is a very important algorithm in several technological and scientific areas such as synthetic aperture radar, computational photography, medical imaging, telecommunications, seismic analysis and so on. However, its computation complexity is high. In this paper, we describe an efficient NFFT implementation with a hardware coprocessor using an All-Programmable System-on-Chip (APSoC). This is a hybrid device that employs an Advanced RISC Machine (ARM) as Processing System with Programmable Logic for high-performance digital signal processing through parallelism and pipeline techniques. The algorithm has been coded in C language with pragma directives to optimize the architecture of the system. We have used the very novel Software Develop System-on-Chip (SDSoC) evelopment tool that simplifies the interface and partitioning between hardware and software. This provides shorter development cycles and iterative improvements by exploring several architectures of the global system. The computational results shows that hardware acceleration significantly outperformed the software based implementation. PMID:28350358
Fast High Resolution Volume Carving for 3D Plant Shoot Reconstruction

PubMed Central

Scharr, Hanno; Briese, Christoph; Embgenbroich, Patrick; Fischbach, Andreas; Fiorani, Fabio; Müller-Linow, Mark

2017-01-01

Volume carving is a well established method for visual hull reconstruction and has been successfully applied in plant phenotyping, especially for 3d reconstruction of small plants and seeds. When imaging larger plants at still relatively high spatial resolution (≤1 mm), well known implementations become slow or have prohibitively large memory needs. Here we present and evaluate a computationally efficient algorithm for volume carving, allowing e.g., 3D reconstruction of plant shoots. It combines a well-known multi-grid representation called “Octree” with an efficient image region integration scheme called “Integral image.” Speedup with respect to less efficient octree implementations is about 2 orders of magnitude, due to the introduced refinement strategy “Mark and refine.” Speedup is about a factor 1.6 compared to a highly optimized GPU implementation using equidistant voxel grids, even without using any parallelization. We demonstrate the application of this method for trait derivation of banana and maize plants. PMID:29033961
A fast new algorithm for a robot neurocontroller using inverse QR decomposition

DOE Office of Scientific and Technical Information (OSTI.GOV)

Morris, A.S.; Khemaissia, S.

2000-01-01

A new adaptive neural network controller for robots is presented. The controller is based on direct adaptive techniques. Unlike many neural network controllers in the literature, inverse dynamical model evaluation is not required. A numerically robust, computationally efficient processing scheme for neutral network weight estimation is described, namely, the inverse QR decomposition (INVQR). The inverse QR decomposition and a weighted recursive least-squares (WRLS) method for neural network weight estimation is derived using Cholesky factorization of the data matrix. The algorithm that performs the efficient INVQR of the underlying space-time data matrix may be implemented in parallel on a triangular array.more » Furthermore, its systolic architecture is well suited for VLSI implementation. Another important benefit is well suited for VLSI implementation. Another important benefit of the INVQR decomposition is that it solves directly for the time-recursive least-squares filter vector, while avoiding the sequential back-substitution step required by the QR decomposition approaches.« less
Accelerating DNA analysis applications on GPU clusters

DOE Office of Scientific and Technical Information (OSTI.GOV)

Tumeo, Antonino; Villa, Oreste

DNA analysis is an emerging application of high performance bioinformatic. Modern sequencing machinery are able to provide, in few hours, large input streams of data which needs to be matched against exponentially growing databases known fragments. The ability to recognize these patterns effectively and fastly may allow extending the scale and the reach of the investigations performed by biology scientists. Aho-Corasick is an exact, multiple pattern matching algorithm often at the base of this application. High performance systems are a promising platform to accelerate this algorithm, which is computationally intensive but also inherently parallel. Nowadays, high performance systems also includemore » heterogeneous processing elements, such as Graphic Processing Units (GPUs), to further accelerate parallel algorithms. Unfortunately, the Aho-Corasick algorithm exhibits large performance variabilities, depending on the size of the input streams, on the number of patterns to search and on the number of matches, and poses significant challenges on current high performance software and hardware implementations. An adequate mapping of the algorithm on the target architecture, coping with the limit of the underlining hardware, is required to reach the desired high throughputs. Load balancing also plays a crucial role when considering the limited bandwidth among the nodes of these systems. In this paper we present an efficient implementation of the Aho-Corasick algorithm for high performance clusters accelerated with GPUs. We discuss how we partitioned and adapted the algorithm to fit the Tesla C1060 GPU and then present a MPI based implementation for a heterogeneous high performance cluster. We compare this implementation to MPI and MPI with pthreads based implementations for a homogeneous cluster of x86 processors, discussing the stability vs. the performance and the scaling of the solutions, taking into consideration aspects such as the bandwidth among the different nodes.« less
GPU-based simulation of optical propagation through turbulence for active and passive imaging

NASA Astrophysics Data System (ADS)

Monnier, Goulven; Duval, François-Régis; Amram, Solène

2014-10-01

IMOTEP is a GPU-based (Graphical Processing Units) software relying on a fast parallel implementation of Fresnel diffraction through successive phase screens. Its applications include active imaging, laser telemetry and passive imaging through turbulence with anisoplanatic spatial and temporal fluctuations. Thanks to parallel implementation on GPU, speedups ranging from 40X to 70X are achieved. The present paper gives a brief overview of IMOTEP models, algorithms, implementation and user interface. It then focuses on major improvements recently brought to the anisoplanatic imaging simulation method. Previously, we took advantage of the computational power offered by the GPU to develop a simulation method based on large series of deterministic realisations of the PSF distorted by turbulence. The phase screen propagation algorithm, by reproducing higher moments of the incident wavefront distortion, provides realistic PSFs. However, we first used a coarse gaussian model to fit the numerical PSFs and characterise there spatial statistics through only 3 parameters (two-dimensional displacements of centroid and width). Meanwhile, this approach was unable to reproduce the effects related to the details of the PSF structure, especially the "speckles" leading to prominent high-frequency content in short-exposure images. To overcome this limitation, we recently implemented a new empirical model of the PSF, based on Principal Components Analysis (PCA), ought to catch most of the PSF complexity. The GPU implementation allows estimating and handling efficiently the numerous (up to several hundreds) principal components typically required under the strong turbulence regime. A first demanding computational step involves PCA, phase screen propagation and covariance estimates. In a second step, realistic instantaneous images, fully accounting for anisoplanatic effects, are quickly generated. Preliminary results are presented.

The OpenMP Implementation of NAS Parallel Benchmarks and its Performance

NASA Technical Reports Server (NTRS)

Jin, Hao-Qiang; Frumkin, Michael; Yan, Jerry

1999-01-01

As the new ccNUMA architecture became popular in recent years, parallel programming with compiler directives on these machines has evolved to accommodate new needs. In this study, we examine the effectiveness of OpenMP directives for parallelizing the NAS Parallel Benchmarks. Implementation details will be discussed and performance will be compared with the MPI implementation. We have demonstrated that OpenMP can achieve very good results for parallelization on a shared memory system, but effective use of memory and cache is very important.
Research on the Application of Fast-steering Mirror in Stellar Interferometer

NASA Astrophysics Data System (ADS)

Mei, R.; Hu, Z. W.; Xu, T.; Sun, C. S.

2017-07-01

For a stellar interferometer, the fast-steering mirror (FSM) is widely utilized to correct wavefront tilt caused by atmospheric turbulence and internal instrumental vibration due to its high resolution and fast response frequency. In this study, the non-coplanar error between the FSM and actuator deflection axis introduced by manufacture, assembly, and adjustment is analyzed. Via a numerical method, the additional optical path difference (OPD) caused by above factors is studied, and its effects on tracking accuracy of stellar interferometer are also discussed. On the other hand, the starlight parallelism between the beams of two arms is one of the main factors of the loss of fringe visibility. By analyzing the influence of wavefront tilt caused by the atmospheric turbulence on fringe visibility, a simple and efficient real-time correction scheme of starlight parallelism is proposed based on a single array detector. The feasibility of this scheme is demonstrated by laboratory experiment. The results show that starlight parallelism meets the requirement of stellar interferometer in wavefront tilt preliminarily after the correction of fast-steering mirror.
Implementation of a model of emergency care in an Australian hospital.

PubMed

Millichamp, Tracey; Bakon, Shannon; Christensen, Martin; Stock, Kate; Howarth, Sarah

2017-11-10

Emergency departments are characterised by a fast-paced, quick turnover and high acuity workload, therefore appropriate staffing is vital to ensure positive patient outcomes. Models of care are frameworks in which safe and effective patient-to-nurse ratios can be ensured. The aim of this study was to implement a supportive and transparent model of emergency nursing care that provides structure - regardless of nursing staff profile, business or other demands; improvement to nursing workloads; and promotes individual responsibility and accountability for patient care. A convergent parallel mixed-method approach was used. Quantitative data were analysed using descriptive statistics and the qualitative data used a thematic analysis to identify recurrent themes. Data post-implementation of the model of emergency nursing care indicate improved staff satisfaction in relation to workload, patient care and support structures. The development and implementation of a model of care in an emergency department improved staff workload and staff's perception of their ability to provide care. ©2017 RCN Publishing Company Ltd. All rights reserved. Not to be copied, transmitted or recorded in any way, in whole or part, without prior permission of the publishers.
Accelerating population balance-Monte Carlo simulation for coagulation dynamics from the Markov jump model, stochastic algorithm and GPU parallel computing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Xu, Zuwei; Zhao, Haibo, E-mail: klinsmannzhb@163.com; Zheng, Chuguang

2015-01-15

This paper proposes a comprehensive framework for accelerating population balance-Monte Carlo (PBMC) simulation of particle coagulation dynamics. By combining Markov jump model, weighted majorant kernel and GPU (graphics processing unit) parallel computing, a significant gain in computational efficiency is achieved. The Markov jump model constructs a coagulation-rule matrix of differentially-weighted simulation particles, so as to capture the time evolution of particle size distribution with low statistical noise over the full size range and as far as possible to reduce the number of time loopings. Here three coagulation rules are highlighted and it is found that constructing appropriate coagulation rule providesmore » a route to attain the compromise between accuracy and cost of PBMC methods. Further, in order to avoid double looping over all simulation particles when considering the two-particle events (typically, particle coagulation), the weighted majorant kernel is introduced to estimate the maximum coagulation rates being used for acceptance–rejection processes by single-looping over all particles, and meanwhile the mean time-step of coagulation event is estimated by summing the coagulation kernels of rejected and accepted particle pairs. The computational load of these fast differentially-weighted PBMC simulations (based on the Markov jump model) is reduced greatly to be proportional to the number of simulation particles in a zero-dimensional system (single cell). Finally, for a spatially inhomogeneous multi-dimensional (multi-cell) simulation, the proposed fast PBMC is performed in each cell, and multiple cells are parallel processed by multi-cores on a GPU that can implement the massively threaded data-parallel tasks to obtain remarkable speedup ratio (comparing with CPU computation, the speedup ratio of GPU parallel computing is as high as 200 in a case of 100 cells with 10 000 simulation particles per cell). These accelerating approaches of PBMC are demonstrated in a physically realistic Brownian coagulation case. The computational accuracy is validated with benchmark solution of discrete-sectional method. The simulation results show that the comprehensive approach can attain very favorable improvement in cost without sacrificing computational accuracy.« less
Fast I/O for Massively Parallel Applications

NASA Technical Reports Server (NTRS)

OKeefe, Matthew T.

1996-01-01

The two primary goals for this report were the design, contruction and modeling of parallel disk arrays for scientific visualization and animation, and a study of the IO requirements of highly parallel applications. In addition, further work in parallel display systems required to project and animate the very high-resolution frames resulting from our supercomputing simulations in ocean circulation and compressible gas dynamics.
Parallel optimization algorithms and their implementation in VLSI design

NASA Technical Reports Server (NTRS)

Lee, G.; Feeley, J. J.

1991-01-01

Two new parallel optimization algorithms based on the simplex method are described. They may be executed by a SIMD parallel processor architecture and be implemented in VLSI design. Several VLSI design implementations are introduced. An application example is reported to demonstrate that the algorithms are effective.
Parallel, confocal, and complete spectrum imager for fluorescent detection of high-density microarray

NASA Astrophysics Data System (ADS)

Bogdanov, Valery L.; Boyce-Jacino, Michael

1999-05-01

Confined arrays of biochemical probes deposited on a solid support surface (analytical microarray or 'chip') provide an opportunity to analysis multiple reactions simultaneously. Microarrays are increasingly used in genetics, medicine and environment scanning as research and analytical instruments. A power of microarray technology comes from its parallelism which grows with array miniaturization, minimization of reagent volume per reaction site and reaction multiplexing. An optical detector of microarray signals should combine high sensitivity, spatial and spectral resolution. Additionally, low-cost and a high processing rate are needed to transfer microarray technology into biomedical practice. We designed an imager that provides confocal and complete spectrum detection of entire fluorescently-labeled microarray in parallel. Imager uses microlens array, non-slit spectral decomposer, and high- sensitive detector (cooled CCD). Two imaging channels provide a simultaneous detection of localization, integrated and spectral intensities for each reaction site in microarray. A dimensional matching between microarray and imager's optics eliminates all in moving parts in instrumentation, enabling highly informative, fast and low-cost microarray detection. We report theory of confocal hyperspectral imaging with microlenses array and experimental data for implementation of developed imager to detect fluorescently labeled microarray with a density approximately 103 sites per cm2.
A FAST ITERATIVE METHOD FOR SOLVING THE EIKONAL EQUATION ON TETRAHEDRAL DOMAINS

PubMed Central

Fu, Zhisong; Kirby, Robert M.; Whitaker, Ross T.

2014-01-01

Generating numerical solutions to the eikonal equation and its many variations has a broad range of applications in both the natural and computational sciences. Efficient solvers on cutting-edge, parallel architectures require new algorithms that may not be theoretically optimal, but that are designed to allow asynchronous solution updates and have limited memory access patterns. This paper presents a parallel algorithm for solving the eikonal equation on fully unstructured tetrahedral meshes. The method is appropriate for the type of fine-grained parallelism found on modern massively-SIMD architectures such as graphics processors and takes into account the particular constraints and capabilities of these computing platforms. This work builds on previous work for solving these equations on triangle meshes; in this paper we adapt and extend previous two-dimensional strategies to accommodate three-dimensional, unstructured, tetrahedralized domains. These new developments include a local update strategy with data compaction for tetrahedral meshes that provides solutions on both serial and parallel architectures, with a generalization to inhomogeneous, anisotropic speed functions. We also propose two new update schemes, specialized to mitigate the natural data increase observed when moving to three dimensions, and the data structures necessary for efficiently mapping data to parallel SIMD processors in a way that maintains computational density. Finally, we present descriptions of the implementations for a single CPU, as well as multicore CPUs with shared memory and SIMD architectures, with comparative results against state-of-the-art eikonal solvers. PMID:25221418
Development of radiation tolerant monolithic active pixel sensors with fast column parallel read-out

NASA Astrophysics Data System (ADS)

Koziel, M.; Dorokhov, A.; Fontaine, J.-C.; De Masi, R.; Winter, M.

2010-12-01

Monolithic active pixel sensors (MAPS) [1] (Turchetta et al., 2001) are being developed at IPHC—Strasbourg to equip the EUDET telescope [2] (Haas, 2006) and vertex detectors for future high energy physics experiments, including the STAR upgrade at RHIC [3] (T.S. Collaboration, 2005) and the CBM experiment at FAIR/GSI [4] (Heuser, 2006). High granularity, low material budget and high read-out speed are systematically required for most applications, complemented, for some of them, with high radiation tolerance. A specific column-parallel architecture, implemented in the MIMOSA-22 sensor, was developed to achieve fast read-out MAPS. Previous studies of the front-end architecture integrated in this sensor, which includes in-pixel amplification, have shown that the fixed pattern noise increase consecutive to ionizing radiation can be controlled by means of a negative feedback [5] (Hu-Guo et al., 2008). However, an unexpected rise of the temporal noise was observed. A second version of this chip (MIMOSA-22bis) was produced in order to search for possible improvements of the radiation tolerance, regarding this type of noise. In this prototype, the feedback transistor was tuned in order to mitigate the sensitivity of the pixel to ionizing radiation. The performances of the pixels after irradiation were investigated for two types of feedback transistors: enclosed layout transistor (ELT) [6] (Snoeys et al., 2000) and "standard" transistor with either large or small transconductance. The noise performance of all test structures was studied in various conditions (expected in future experiments) regarding temperature, integration time and ionizing radiation dose. Test results are presented in this paper. Based on these observations, ideas for further improvement of the radiation tolerance of column parallel MAPS are derived.
Parallel digital forensics infrastructure.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Liebrock, Lorie M.; Duggan, David Patrick

2009-10-01

This report documents the architecture and implementation of a Parallel Digital Forensics infrastructure. This infrastructure is necessary for supporting the design, implementation, and testing of new classes of parallel digital forensics tools. Digital Forensics has become extremely difficult with data sets of one terabyte and larger. The only way to overcome the processing time of these large sets is to identify and develop new parallel algorithms for performing the analysis. To support algorithm research, a flexible base infrastructure is required. A candidate architecture for this base infrastructure was designed, instantiated, and tested by this project, in collaboration with New Mexicomore » Tech. Previous infrastructures were not designed and built specifically for the development and testing of parallel algorithms. With the size of forensics data sets only expected to increase significantly, this type of infrastructure support is necessary for continued research in parallel digital forensics. This report documents the implementation of the parallel digital forensics (PDF) infrastructure architecture and implementation.« less
Introducing parallelism to histogramming functions for GEM systems

NASA Astrophysics Data System (ADS)

Krawczyk, Rafał D.; Czarski, Tomasz; Kolasinski, Piotr; Pozniak, Krzysztof T.; Linczuk, Maciej; Byszuk, Adrian; Chernyshova, Maryna; Juszczyk, Bartlomiej; Kasprowicz, Grzegorz; Wojenski, Andrzej; Zabolotny, Wojciech

2015-09-01

This article is an assessment of potential parallelization of histogramming algorithms in GEM detector system. Histogramming and preprocessing algorithms in MATLAB were analyzed with regard to adding parallelism. Preliminary implementation of parallel strip histogramming resulted in speedup. Analysis of algorithms parallelizability is presented. Overview of potential hardware and software support to implement parallel algorithm is discussed.
On-line range images registration with GPGPU

NASA Astrophysics Data System (ADS)

Będkowski, J.; Naruniec, J.

2013-03-01

This paper concerns implementation of algorithms in the two important aspects of modern 3D data processing: data registration and segmentation. Solution proposed for the first topic is based on the 3D space decomposition, while the latter on image processing and local neighbourhood search. Data processing is implemented by using NVIDIA compute unified device architecture (NIVIDIA CUDA) parallel computation. The result of the segmentation is a coloured map where different colours correspond to different objects, such as walls, floor and stairs. The research is related to the problem of collecting 3D data with a RGB-D camera mounted on a rotated head, to be used in mobile robot applications. Performance of the data registration algorithm is aimed for on-line processing. The iterative closest point (ICP) approach is chosen as a registration method. Computations are based on the parallel fast nearest neighbour search. This procedure decomposes 3D space into cubic buckets and, therefore, the time of the matching is deterministic. First technique of the data segmentation uses accele-rometers integrated with a RGB-D sensor to obtain rotation compensation and image processing method for defining pre-requisites of the known categories. The second technique uses the adapted nearest neighbour search procedure for obtaining normal vectors for each range point.
CUDA Optimization Strategies for Compute- and Memory-Bound Neuroimaging Algorithms

PubMed Central

Lee, Daren; Dinov, Ivo; Dong, Bin; Gutman, Boris; Yanovsky, Igor; Toga, Arthur W.

2011-01-01

As neuroimaging algorithms and technology continue to grow faster than CPU performance in complexity and image resolution, data-parallel computing methods will be increasingly important. The high performance, data-parallel architecture of modern graphical processing units (GPUs) can reduce computational times by orders of magnitude. However, its massively threaded architecture introduces challenges when GPU resources are exceeded. This paper presents optimization strategies for compute- and memory-bound algorithms for the CUDA architecture. For compute-bound algorithms, the registers are reduced through variable reuse via shared memory and the data throughput is increased through heavier thread workloads and maximizing the thread configuration for a single thread block per multiprocessor. For memory-bound algorithms, fitting the data into the fast but limited GPU resources is achieved through reorganizing the data into self-contained structures and employing a multi-pass approach. Memory latencies are reduced by selecting memory resources whose cache performance are optimized for the algorithm's access patterns. We demonstrate the strategies on two computationally expensive algorithms and achieve optimized GPU implementations that perform up to 6× faster than unoptimized ones. Compared to CPU implementations, we achieve peak GPU speedups of 129× for the 3D unbiased nonlinear image registration technique and 93× for the non-local means surface denoising algorithm. PMID:21159404
CUDA optimization strategies for compute- and memory-bound neuroimaging algorithms.

PubMed

Lee, Daren; Dinov, Ivo; Dong, Bin; Gutman, Boris; Yanovsky, Igor; Toga, Arthur W

2012-06-01

As neuroimaging algorithms and technology continue to grow faster than CPU performance in complexity and image resolution, data-parallel computing methods will be increasingly important. The high performance, data-parallel architecture of modern graphical processing units (GPUs) can reduce computational times by orders of magnitude. However, its massively threaded architecture introduces challenges when GPU resources are exceeded. This paper presents optimization strategies for compute- and memory-bound algorithms for the CUDA architecture. For compute-bound algorithms, the registers are reduced through variable reuse via shared memory and the data throughput is increased through heavier thread workloads and maximizing the thread configuration for a single thread block per multiprocessor. For memory-bound algorithms, fitting the data into the fast but limited GPU resources is achieved through reorganizing the data into self-contained structures and employing a multi-pass approach. Memory latencies are reduced by selecting memory resources whose cache performance are optimized for the algorithm's access patterns. We demonstrate the strategies on two computationally expensive algorithms and achieve optimized GPU implementations that perform up to 6× faster than unoptimized ones. Compared to CPU implementations, we achieve peak GPU speedups of 129× for the 3D unbiased nonlinear image registration technique and 93× for the non-local means surface denoising algorithm. Copyright © 2010 Elsevier Ireland Ltd. All rights reserved.
SDS: A Framework for Scientific Data Services

DOE Office of Scientific and Technical Information (OSTI.GOV)

Dong, Bin; Byna, Surendra; Wu, Kesheng

2013-10-31

Large-scale scientific applications typically write their data to parallel file systems with organizations designed to achieve fast write speeds. Analysis tasks frequently read the data in a pattern that is different from the write pattern, and therefore experience poor I/O performance. In this paper, we introduce a prototype framework for bridging the performance gap between write and read stages of data access from parallel file systems. We call this framework Scientific Data Services, or SDS for short. This initial implementation of SDS focuses on reorganizing previously written files into data layouts that benefit read patterns, and transparently directs read callsmore » to the reorganized data. SDS follows a client-server architecture. The SDS Server manages partial or full replicas of reorganized datasets and serves SDS Clients' requests for data. The current version of the SDS client library supports HDF5 programming interface for reading data. The client library intercepts HDF5 calls and transparently redirects them to the reorganized data. The SDS client library also provides a querying interface for reading part of the data based on user-specified selective criteria. We describe the design and implementation of the SDS client-server architecture, and evaluate the response time of the SDS Server and the performance benefits of SDS.« less
Parallel Implementation of the Discontinuous Galerkin Method

NASA Technical Reports Server (NTRS)

Baggag, Abdalkader; Atkins, Harold; Keyes, David

1999-01-01

This paper describes a parallel implementation of the discontinuous Galerkin method. Discontinuous Galerkin is a spatially compact method that retains its accuracy and robustness on non-smooth unstructured grids and is well suited for time dependent simulations. Several parallelization approaches are studied and evaluated. The most natural and symmetric of the approaches has been implemented in all object-oriented code used to simulate aeroacoustic scattering. The parallel implementation is MPI-based and has been tested on various parallel platforms such as the SGI Origin, IBM SP2, and clusters of SGI and Sun workstations. The scalability results presented for the SGI Origin show slightly superlinear speedup on a fixed-size problem due to cache effects.
Quartic scaling MP2 for solids: A highly parallelized algorithm in the plane wave basis

NASA Astrophysics Data System (ADS)

Schäfer, Tobias; Ramberger, Benjamin; Kresse, Georg

2017-03-01

We present a low-complexity algorithm to calculate the correlation energy of periodic systems in second-order Møller-Plesset (MP2) perturbation theory. In contrast to previous approximation-free MP2 codes, our implementation possesses a quartic scaling, O ( N 4 ) , with respect to the system size N and offers an almost ideal parallelization efficiency. The general issue that the correlation energy converges slowly with the number of basis functions is eased by an internal basis set extrapolation. The key concept to reduce the scaling is to eliminate all summations over virtual orbitals which can be elegantly achieved in the Laplace transformed MP2 formulation using plane wave basis sets and fast Fourier transforms. Analogously, this approach could allow us to calculate second order screened exchange as well as particle-hole ladder diagrams with a similar low complexity. Hence, the presented method can be considered as a step towards systematically improved correlation energies.
Multiplexed Oversampling Digitizer in 65 nm CMOS for Column-Parallel CCD Readout

DOE Office of Scientific and Technical Information (OSTI.GOV)

Grace, Carl; Walder, Jean-Pierre; von der Lippe, Henrik

2012-04-10

A digitizer designed to read out column-parallel charge-coupled devices (CCDs) used for high-speed X-ray imaging is presented. The digitizer is included as part of the High-Speed Image Preprocessor with Oversampling (HIPPO) integrated circuit. The digitizer module comprises a multiplexed, oversampling, 12-bit, 80 MS/s pipelined Analog-to-Digital Converter (ADC) and a bank of four fast-settling sample-and-hold amplifiers to instrument four analog channels. The ADC multiplexes and oversamples to reduce its area to allow integration that is pitch-matched to the columns of the CCD. Novel design techniques are used to enable oversampling and multiplexing with a reduced power penalty. The ADC exhibits 188more » ?V-rms noise which is less than 1 LSB at a 12-bit level. The prototype is implemented in a commercially available 65 nm CMOS process. The digitizer will lead to a proof-of-principle 2D 10 Gigapixel/s X-ray detector.« less
A set of parallel, implicit methods for a reconstructed discontinuous Galerkin method for compressible flows on 3D hybrid grids

DOE PAGES

Xia, Yidong; Luo, Hong; Frisbey, Megan; ...

2014-07-01

A set of implicit methods are proposed for a third-order hierarchical WENO reconstructed discontinuous Galerkin method for compressible flows on 3D hybrid grids. An attractive feature in these methods are the application of the Jacobian matrix based on the P1 element approximation, resulting in a huge reduction of memory requirement compared with DG (P2). Also, three approaches -- analytical derivation, divided differencing, and automatic differentiation (AD) are presented to construct the Jacobian matrix respectively, where the AD approach shows the best robustness. A variety of compressible flow problems are computed to demonstrate the fast convergence property of the implemented flowmore » solver. Furthermore, an SPMD (single program, multiple data) programming paradigm based on MPI is proposed to achieve parallelism. The numerical results on complex geometries indicate that this low-storage implicit method can provide a viable and attractive DG solution for complicated flows of practical importance.« less
An Automated Parallel Image Registration Technique Based on the Correlation of Wavelet Features

NASA Technical Reports Server (NTRS)

LeMoigne, Jacqueline; Campbell, William J.; Cromp, Robert F.; Zukor, Dorothy (Technical Monitor)

2001-01-01

With the increasing importance of multiple platform/multiple remote sensing missions, fast and automatic integration of digital data from disparate sources has become critical to the success of these endeavors. Our work utilizes maxima of wavelet coefficients to form the basic features of a correlation-based automatic registration algorithm. Our wavelet-based registration algorithm is tested successfully with data from the National Oceanic and Atmospheric Administration (NOAA) Advanced Very High Resolution Radiometer (AVHRR) and the Landsat/Thematic Mapper(TM), which differ by translation and/or rotation. By the choice of high-frequency wavelet features, this method is similar to an edge-based correlation method, but by exploiting the multi-resolution nature of a wavelet decomposition, our method achieves higher computational speeds for comparable accuracies. This algorithm has been implemented on a Single Instruction Multiple Data (SIMD) massively parallel computer, the MasPar MP-2, as well as on the CrayT3D, the Cray T3E and a Beowulf cluster of Pentium workstations.

Random-subset fitting of digital holograms for fast three-dimensional particle tracking [invited].

PubMed

Dimiduk, Thomas G; Perry, Rebecca W; Fung, Jerome; Manoharan, Vinothan N

2014-09-20

Fitting scattering solutions to time series of digital holograms is a precise way to measure three-dimensional dynamics of microscale objects such as colloidal particles. However, this inverse-problem approach is computationally expensive. We show that the computational time can be reduced by an order of magnitude or more by fitting to a random subset of the pixels in a hologram. We demonstrate our algorithm on experimentally measured holograms of micrometer-scale colloidal particles, and we show that 20-fold increases in speed, relative to fitting full frames, can be attained while introducing errors in the particle positions of 10 nm or less. The method is straightforward to implement and works for any scattering model. It also enables a parallelization strategy wherein random-subset fitting is used to quickly determine initial guesses that are subsequently used to fit full frames in parallel. This approach may prove particularly useful for studying rare events, such as nucleation, that can only be captured with high frame rates over long times.
A solution to neural field equations by a recurrent neural network method

NASA Astrophysics Data System (ADS)

Alharbi, Abir

2012-09-01

Neural field equations (NFE) are used to model the activity of neurons in the brain, it is introduced from a single neuron 'integrate-and-fire model' starting point. The neural continuum is spatially discretized for numerical studies, and the governing equations are modeled as a system of ordinary differential equations. In this article the recurrent neural network approach is used to solve this system of ODEs. This consists of a technique developed by combining the standard numerical method of finite-differences with the Hopfield neural network. The architecture of the net, energy function, updating equations, and algorithms are developed for the NFE model. A Hopfield Neural Network is then designed to minimize the energy function modeling the NFE. Results obtained from the Hopfield-finite-differences net show excellent performance in terms of accuracy and speed. The parallelism nature of the Hopfield approaches may make them easier to implement on fast parallel computers and give them the speed advantage over the traditional methods.
Real-time autocorrelator for fluorescence correlation spectroscopy based on graphical-processor-unit architecture: method, implementation, and comparative studies

NASA Astrophysics Data System (ADS)

Laracuente, Nicholas; Grossman, Carl

2013-03-01

We developed an algorithm and software to calculate autocorrelation functions from real-time photon-counting data using the fast, parallel capabilities of graphical processor units (GPUs). Recent developments in hardware and software have allowed for general purpose computing with inexpensive GPU hardware. These devices are more suited for emulating hardware autocorrelators than traditional CPU-based software applications by emphasizing parallel throughput over sequential speed. Incoming data are binned in a standard multi-tau scheme with configurable points-per-bin size and are mapped into a GPU memory pattern to reduce time-expensive memory access. Applications include dynamic light scattering (DLS) and fluorescence correlation spectroscopy (FCS) experiments. We ran the software on a 64-core graphics pci card in a 3.2 GHz Intel i5 CPU based computer running Linux. FCS measurements were made on Alexa-546 and Texas Red dyes in a standard buffer (PBS). Software correlations were compared to hardware correlator measurements on the same signals. Supported by HHMI and Swarthmore College
A parallel computer implementation of fast low-rank QR approximation of the Biot-Savart law

DOE Office of Scientific and Technical Information (OSTI.GOV)

White, D A; Fasenfest, B J; Stowell, M L

2005-11-07

In this paper we present a low-rank QR method for evaluating the discrete Biot-Savart law on parallel computers. It is assumed that the known current density and the unknown magnetic field are both expressed in a finite element expansion, and we wish to compute the degrees-of-freedom (DOF) in the basis function expansion of the magnetic field. The matrix that maps the current DOF to the field DOF is full, but if the spatial domain is properly partitioned the matrix can be written as a block matrix, with blocks representing distant interactions being low rank and having a compressed QR representation.more » The matrix partitioning is determined by the number of processors, the rank of each block (i.e. the compression) is determined by the specific geometry and is computed dynamically. In this paper we provide the algorithmic details and present computational results for large-scale computations.« less
Implementation and performance of parallel Prolog interpreter

DOE Office of Scientific and Technical Information (OSTI.GOV)

Wei, S.; Kale, L.V.; Balkrishna, R.

1988-01-01

In this paper, the authors discuss the implementation of a parallel Prolog interpreter on different parallel machines. The implementation is based on the REDUCE--OR process model which exploits both AND and OR parallelism in logic programs. It is machine independent as it runs on top of the chare-kernel--a machine-independent parallel programming system. The authors also give the performance of the interpreter running a diverse set of benchmark pargrams on parallel machines including shared memory systems: an Alliant FX/8, Sequent and a MultiMax, and a non-shared memory systems: Intel iPSC/32 hypercube, in addition to its performance on a multiprocessor simulation system.
SciSpark: In-Memory Map-Reduce for Earth Science Algorithms

NASA Astrophysics Data System (ADS)

Ramirez, P.; Wilson, B. D.; Whitehall, K. D.; Palamuttam, R. S.; Mattmann, C. A.; Shah, S.; Goodman, A.; Burke, W.

2016-12-01

We are developing a lightning fast Big Data technology called SciSpark based on ApacheTM Spark under a NASA AIST grant (PI Mattmann). Spark implements the map-reduce paradigm for parallel computing on a cluster, but emphasizes in-memory computation, "spilling" to disk only as needed, and so outperforms the disk-based Apache Hadoop by 100x in memory and by 10x on disk. SciSpark extends Spark to support Earth Science use in three ways: Efficient ingest of N-dimensional geo-located arrays (physical variables) from netCDF3/4, HDF4/5, and/or OPeNDAP URLS; Array operations for dense arrays in scala and Java using the ND4S/ND4J or Breeze libraries; Operations to "split" datasets across a Spark cluster by time or space or both. For example, a decade-long time-series of geo-variables can be split across time to enable parallel "speedups" of analysis by day, month, or season. Similarly, very high-resolution climate grids can be partitioned into spatial tiles for parallel operations across rows, columns, or blocks. In addition, using Spark's gateway into python, PySpark, one can utilize the entire ecosystem of numpy, scipy, etc. Finally, SciSpark Notebooks provide a modern eNotebook technology in which scala, python, or spark-sql codes are entered into cells in the Notebook and executed on the cluster, with results, plots, or graph visualizations displayed in "live widgets". We have exercised SciSpark by implementing three complex Use Cases: discovery and evolution of Mesoscale Convective Complexes (MCCs) in storms, yielding a graph of connected components; PDF Clustering of atmospheric state using parallel K-Means; and statistical "rollups" of geo-variables or model-to-obs. differences (i.e. mean, stddev, skewness, & kurtosis) by day, month, season, year, and multi-year. Geo-variables are ingested and split across the cluster using methods on the sciSparkContext object including netCDFVariables() for spatial decomposition and wholeNetCDFVariables() for time-series. The presentation will cover the architecture of SciSpark, the design of the scientific RDD (sRDD) data structures for N-dim. arrays, results from the three science Use Cases, example Notebooks, lessons learned from the algorithm implementations, and parallel performance metrics.
The PMS project: Poor man's supercomputer

NASA Astrophysics Data System (ADS)

Csikor, F.; Fodor, Z.; Hegedüs, P.; Horváth, V. K.; Katz, S. D.; Piróth, A.

2001-02-01

We briefly describe the Poor Man's Supercomputer (PMS) project carried out at Eötvös University, Budapest. The goal was to construct a cost effective, scalable, fast parallel computer to perform numerical calculations of physical problems that can be implemented on a lattice with nearest neighbour interactions. To this end we developed the PMS architecture using PC components and designed a special, low cost communication hardware and the driver software for Linux OS. Our first implementation of PMS includes 32 nodes (PMS1). The performance of PMS1 was tested by Lattice Gauge Theory simulations. Using pure SU(3) gauge theory or the bosonic part of the minimal supersymmetric extention of the standard model (MSSM) on PMS1 we obtained 3 / Mflops and 0.60 / Mflops price-to-sustained performance ratio for double and single precision operations, respectively. The design of the special hardware and the communication driver are freely available upon request for non-profit organizations.
The Level 0 Pixel Trigger system for the ALICE experiment

NASA Astrophysics Data System (ADS)

Aglieri Rinella, G.; Kluge, A.; Krivda, M.; ALICE Silicon Pixel Detector project

2007-01-01

The ALICE Silicon Pixel Detector contains 1200 readout chips. Fast-OR signals indicate the presence of at least one hit in the 8192 pixel matrix of each chip. The 1200 bits are transmitted every 100 ns on 120 data readout optical links using the G-Link protocol. The Pixel Trigger System extracts and processes them to deliver an input signal to the Level 0 trigger processor targeting a latency of 800 ns. The system is compact, modular and based on FPGA devices. The architecture allows the user to define and implement various trigger algorithms. The system uses advanced 12-channel parallel optical fiber modules operating at 1310 nm as optical receivers and 12 deserializer chips closely packed in small area receiver boards. Alternative solutions with multi-channel G-Link deserializers implemented directly in programmable hardware devices were investigated. The design of the system and the progress of the ALICE Pixel Trigger project are described in this paper.
Improvements on non-equilibrium and transport Green function techniques: The next-generation TRANSIESTA

NASA Astrophysics Data System (ADS)

Papior, Nick; Lorente, Nicolás; Frederiksen, Thomas; García, Alberto; Brandbyge, Mads

2017-03-01

We present novel methods implemented within the non-equilibrium Green function code (NEGF) TRANSIESTA based on density functional theory (DFT). Our flexible, next-generation DFT-NEGF code handles devices with one or multiple electrodes (Ne ≥ 1) with individual chemical potentials and electronic temperatures. We describe its novel methods for electrostatic gating, contour optimizations, and assertion of charge conservation, as well as the newly implemented algorithms for optimized and scalable matrix inversion, performance-critical pivoting, and hybrid parallelization. Additionally, a generic NEGF "post-processing" code (TBTRANS/PHTRANS) for electron and phonon transport is presented with several novelties such as Hamiltonian interpolations, Ne ≥ 1 electrode capability, bond-currents, generalized interface for user-defined tight-binding transport, transmission projection using eigenstates of a projected Hamiltonian, and fast inversion algorithms for large-scale simulations easily exceeding 106 atoms on workstation computers. The new features of both codes are demonstrated and bench-marked for relevant test systems.
Increased Energy Delivery for Parallel Battery Packs with No Regulated Bus

NASA Astrophysics Data System (ADS)

Hsu, Chung-Ti

In this dissertation, a new approach to paralleling different battery types is presented. A method for controlling charging/discharging of different battery packs by using low-cost bi-directional switches instead of DC-DC converters is proposed. The proposed system architecture, algorithms, and control techniques allow batteries with different chemistry, voltage, and SOC to be properly charged and discharged in parallel without causing safety problems. The physical design and cost for the energy management system is substantially reduced. Additionally, specific types of failures in the maximum power point tracking (MPPT) in a photovoltaic (PV) system when tracking only the load current of a DC-DC converter are analyzed. The periodic nonlinear load current will lead MPPT realized by the conventional perturb and observe (P&O) algorithm to be problematic. A modified MPPT algorithm is proposed and it still only requires typically measured signals, yet is suitable for both linear and periodic nonlinear loads. Moreover, for a modular DC-DC converter using several converters in parallel, the input power from PV panels is processed and distributed at the module level. Methods for properly implementing distributed MPPT are studied. A new approach to efficient MPPT under partial shading conditions is presented. The power stage architecture achieves fast input current change rate by combining a current-adjustable converter with a few converters operating at a constant current.
A Parallel Framework with Block Matrices of a Discrete Fourier Transform for Vector-Valued Discrete-Time Signals.

PubMed

Soto-Quiros, Pablo

2015-01-01

This paper presents a parallel implementation of a kind of discrete Fourier transform (DFT): the vector-valued DFT. The vector-valued DFT is a novel tool to analyze the spectra of vector-valued discrete-time signals. This parallel implementation is developed in terms of a mathematical framework with a set of block matrix operations. These block matrix operations contribute to analysis, design, and implementation of parallel algorithms in multicore processors. In this work, an implementation and experimental investigation of the mathematical framework are performed using MATLAB with the Parallel Computing Toolbox. We found that there is advantage to use multicore processors and a parallel computing environment to minimize the high execution time. Additionally, speedup increases when the number of logical processors and length of the signal increase.
Parallel MR imaging: a user's guide.

PubMed

Glockner, James F; Hu, Houchun H; Stanley, David W; Angelos, Lisa; King, Kevin

2005-01-01

Parallel imaging is a recently developed family of techniques that take advantage of the spatial information inherent in phased-array radiofrequency coils to reduce acquisition times in magnetic resonance imaging. In parallel imaging, the number of sampled k-space lines is reduced, often by a factor of two or greater, thereby significantly shortening the acquisition time. Parallel imaging techniques have only recently become commercially available, and the wide range of clinical applications is just beginning to be explored. The potential clinical applications primarily involve reduction in acquisition time, improved spatial resolution, or a combination of the two. Improvements in image quality can be achieved by reducing the echo train lengths of fast spin-echo and single-shot fast spin-echo sequences. Parallel imaging is particularly attractive for cardiac and vascular applications and will likely prove valuable as 3-T body and cardiovascular imaging becomes part of standard clinical practice. Limitations of parallel imaging include reduced signal-to-noise ratio and reconstruction artifacts. It is important to consider these limitations when deciding when to use these techniques. (c) RSNA, 2005.
A Robust and Scalable Software Library for Parallel Adaptive Refinement on Unstructured Meshes

NASA Technical Reports Server (NTRS)

Lou, John Z.; Norton, Charles D.; Cwik, Thomas A.

1999-01-01

The design and implementation of Pyramid, a software library for performing parallel adaptive mesh refinement (PAMR) on unstructured meshes, is described. This software library can be easily used in a variety of unstructured parallel computational applications, including parallel finite element, parallel finite volume, and parallel visualization applications using triangular or tetrahedral meshes. The library contains a suite of well-designed and efficiently implemented modules that perform operations in a typical PAMR process. Among these are mesh quality control during successive parallel adaptive refinement (typically guided by a local-error estimator), parallel load-balancing, and parallel mesh partitioning using the ParMeTiS partitioner. The Pyramid library is implemented in Fortran 90 with an interface to the Message-Passing Interface (MPI) library, supporting code efficiency, modularity, and portability. An EM waveguide filter application, adaptively refined using the Pyramid library, is illustrated.
FastID: Extremely Fast Forensic DNA Comparisons

DTIC Science & Technology

2017-05-19

FastID: Extremely Fast Forensic DNA Comparisons Darrell O. Ricke, PhD Bioengineering Systems & Technologies Massachusetts Institute of...Technology Lincoln Laboratory Lexington, MA USA Darrell.Ricke@ll.mit.edu Abstract—Rapid analysis of DNA forensic samples can have a critical impact on...time sensitive investigations. Analysis of forensic DNA samples by massively parallel sequencing is creating the next gold standard for DNA
Precompetition taper and nutritional strategies: special reference to training during Ramadan intermittent fast.

PubMed

Mujika, Iñigo; Chaouachi, Anis; Chamari, Karim

2010-06-01

A marked reduction in the training load in the lead-up to major competitions allows athletes to reduce the fatigue induced by intense training and improve competition performance. This tapered training phase is based on the reduction in training volume while maintaining pretaper training intensity and frequency. In parallel to training load reductions, nutritional strategies characterised by lowered energy intakes need to be implemented to match lowered energy expenditure. The Ramadan intermittent fast imposes constrained nutritional practices on Muslim athletes, inducing a shift to a greater reliance on fat oxidation to meet energy needs and a possible increase in protein breakdown. The training load is often reduced during Ramadan to match the absence of energy and fluid intake during daylight, which implies a risk of losing training induced adaptations. Should coaches and athletes decide to reduce the training load during Ramadan, the key role of training intensity in retaining training induced adaptations should be kept in mind. However, experienced elite Muslim athletes are able to maintain their usual training load during this month of intermittent fasting without decrements in measures of fitness and with only minor adverse effects.
Performance Improvements of the CYCOFOS Flow Model

NASA Astrophysics Data System (ADS)

Radhakrishnan, Hari; Moulitsas, Irene; Syrakos, Alexandros; Zodiatis, George; Nikolaides, Andreas; Hayes, Daniel; Georgiou, Georgios C.

2013-04-01

The CYCOFOS-Cyprus Coastal Ocean Forecasting and Observing System has been operational since early 2002, providing daily sea current, temperature, salinity and sea level forecasting data for the next 4 and 10 days to end-users in the Levantine Basin, necessary for operational application in marine safety, particularly concerning oil spills and floating objects predictions. CYCOFOS flow model, similar to most of the coastal and sub-regional operational hydrodynamic forecasting systems of the MONGOOS-Mediterranean Oceanographic Network for Global Ocean Observing System is based on the POM-Princeton Ocean Model. CYCOFOS is nested with the MyOcean Mediterranean regional forecasting data and with SKIRON and ECMWF for surface forcing. The increasing demand for higher and higher resolution data to meet coastal and offshore downstream applications motivated the parallelization of the CYCOFOS POM model. This development was carried out in the frame of the IPcycofos project, funded by the Cyprus Research Promotion Foundation. The parallel processing provides a viable solution to satisfy these demands without sacrificing accuracy or omitting any physical phenomena. Prior to IPcycofos project, there are been several attempts to parallelise the POM, as for example the MP-POM. The existing parallel code models rely on the use of specific outdated hardware architectures and associated software. The objective of the IPcycofos project is to produce an operational parallel version of the CYCOFOS POM code that can replicate the results of the serial version of the POM code used in CYCOFOS. The parallelization of the CYCOFOS POM model use Message Passing Interface-MPI, implemented on commodity computing clusters running open source software and not depending on any specialized vendor hardware. The parallel CYCOFOS POM code constructed in a modular fashion, allowing a fast re-locatable downscaled implementation. The MPI takes advantage of the Cartesian nature of the POM mesh, and use the built-in functionality of MPI routines to split the mesh, using a weighting scheme, along longitude and latitude among the processors. Each server processor work on the model based on domain decomposition techniques. The new parallel CYCOFOS POM code has been benchmarked against the serial POM version of CYCOFOS for speed, accuracy, and resolution and the results are more than satisfactory. With a higher resolution CYCOFOS Levantine model domain the forecasts need much less time than the serial CYCOFOS POM coarser version, both with identical accuracy.
New NAS Parallel Benchmarks Results

NASA Technical Reports Server (NTRS)

Yarrow, Maurice; Saphir, William; VanderWijngaart, Rob; Woo, Alex; Kutler, Paul (Technical Monitor)

1997-01-01

NPB2 (NAS (NASA Advanced Supercomputing) Parallel Benchmarks 2) is an implementation, based on Fortran and the MPI (message passing interface) message passing standard, of the original NAS Parallel Benchmark specifications. NPB2 programs are run with little or no tuning, in contrast to NPB vendor implementations, which are highly optimized for specific architectures. NPB2 results complement, rather than replace, NPB results. Because they have not been optimized by vendors, NPB2 implementations approximate the performance a typical user can expect for a portable parallel program on distributed memory parallel computers. Together these results provide an insightful comparison of the real-world performance of high-performance computers. New NPB2 features: New implementation (CG), new workstation class problem sizes, new serial sample versions, more performance statistics.
Accelerated Profile HMM Searches

PubMed Central

Eddy, Sean R.

2011-01-01

Profile hidden Markov models (profile HMMs) and probabilistic inference methods have made important contributions to the theory of sequence database homology search. However, practical use of profile HMM methods has been hindered by the computational expense of existing software implementations. Here I describe an acceleration heuristic for profile HMMs, the “multiple segment Viterbi” (MSV) algorithm. The MSV algorithm computes an optimal sum of multiple ungapped local alignment segments using a striped vector-parallel approach previously described for fast Smith/Waterman alignment. MSV scores follow the same statistical distribution as gapped optimal local alignment scores, allowing rapid evaluation of significance of an MSV score and thus facilitating its use as a heuristic filter. I also describe a 20-fold acceleration of the standard profile HMM Forward/Backward algorithms using a method I call “sparse rescaling”. These methods are assembled in a pipeline in which high-scoring MSV hits are passed on for reanalysis with the full HMM Forward/Backward algorithm. This accelerated pipeline is implemented in the freely available HMMER3 software package. Performance benchmarks show that the use of the heuristic MSV filter sacrifices negligible sensitivity compared to unaccelerated profile HMM searches. HMMER3 is substantially more sensitive and 100- to 1000-fold faster than HMMER2. HMMER3 is now about as fast as BLAST for protein searches. PMID:22039361
Integrated Service Provisioning in an Ipv6 over ATM Research Network

DOE Office of Scientific and Technical Information (OSTI.GOV)

Eli Dart; Helen Chen; Jerry Friesen

1999-02-01

During the past few years, the worldwide Internet has grown at a phenomenal rate, which has spurred the proposal of innovative network technologies to support the fast, efficient and low-latency transport of a wide spectrum of multimedia traffic types. Existing network infrastructures have been plagued by their inability to provide for real-time application traffic as well as their general lack of resources and resilience to congestion. This work proposes to address these issues by implementing a prototype high-speed network infrastructure consisting of Internet Protocol Version 6 (IPv6) on top of an Asynchronous Transfer Mode (ATM) transport medium. Since ATM ismore » connection-oriented whereas IP uses a connection-less paradigm, the efficient integration of IPv6 over ATM is especially challenging and has generated much interest in the research community. We propose, in collaboration with an industry partner, to implement IPv6 over ATM using a unique approach that integrates IP over fast A TM hardware while still preserving IP's connection-less paradigm. This is achieved by replacing ATM's control software with IP's routing code and by caching IP's forwarding decisions in ATM's VPI/VCI translation tables. Prototype ''VR'' and distributed-parallel-computing applications will also be developed to exercise the realtime capability of our IPv6 over ATM network.« less
Sub-second pencil beam dose calculation on GPU for adaptive proton therapy.

PubMed

da Silva, Joakim; Ansorge, Richard; Jena, Rajesh

2015-06-21

Although proton therapy delivered using scanned pencil beams has the potential to produce better dose conformity than conventional radiotherapy, the created dose distributions are more sensitive to anatomical changes and patient motion. Therefore, the introduction of adaptive treatment techniques where the dose can be monitored as it is being delivered is highly desirable. We present a GPU-based dose calculation engine relying on the widely used pencil beam algorithm, developed for on-line dose calculation. The calculation engine was implemented from scratch, with each step of the algorithm parallelized and adapted to run efficiently on the GPU architecture. To ensure fast calculation, it employs several application-specific modifications and simplifications, and a fast scatter-based implementation of the computationally expensive kernel superposition step. The calculation time for a skull base treatment plan using two beam directions was 0.22 s on an Nvidia Tesla K40 GPU, whereas a test case of a cubic target in water from the literature took 0.14 s to calculate. The accuracy of the patient dose distributions was assessed by calculating the γ-index with respect to a gold standard Monte Carlo simulation. The passing rates were 99.2% and 96.7%, respectively, for the 3%/3 mm and 2%/2 mm criteria, matching those produced by a clinical treatment planning system.

Implementation theory of distortion-invariant pattern recognition for optical and digital signal processing systems

NASA Astrophysics Data System (ADS)

Lhamon, Michael Earl

A pattern recognition system which uses complex correlation filter banks requires proportionally more computational effort than single-real valued filters. This introduces increased computation burden but also introduces a higher level of parallelism, that common computing platforms fail to identify. As a result, we consider algorithm mapping to both optical and digital processors. For digital implementation, we develop computationally efficient pattern recognition algorithms, referred to as, vector inner product operators that require less computational effort than traditional fast Fourier methods. These algorithms do not need correlation and they map readily onto parallel digital architectures, which imply new architectures for optical processors. These filters exploit circulant-symmetric matrix structures of the training set data representing a variety of distortions. By using the same mathematical basis as with the vector inner product operations, we are able to extend the capabilities of more traditional correlation filtering to what we refer to as "Super Images". These "Super Images" are used to morphologically transform a complicated input scene into a predetermined dot pattern. The orientation of the dot pattern is related to the rotational distortion of the object of interest. The optical implementation of "Super Images" yields feature reduction necessary for using other techniques, such as artificial neural networks. We propose a parallel digital signal processor architecture based on specific pattern recognition algorithms but general enough to be applicable to other similar problems. Such an architecture is classified as a data flow architecture. Instead of mapping an algorithm to an architecture, we propose mapping the DSP architecture to a class of pattern recognition algorithms. Today's optical processing systems have difficulties implementing full complex filter structures. Typically, optical systems (like the 4f correlators) are limited to phase-only implementation with lower detection performance than full complex electronic systems. Our study includes pseudo-random pixel encoding techniques for approximating full complex filtering. Optical filter bank implementation is possible and they have the advantage of time averaging the entire filter bank at real time rates. Time-averaged optical filtering is computational comparable to billions of digital operations-per-second. For this reason, we believe future trends in high speed pattern recognition will involve hybrid architectures of both optical and DSP elements.
Convergence of highly parallel stray field calculation using the fast multipole method on irregular meshes

NASA Astrophysics Data System (ADS)

Palmesi, P.; Abert, C.; Bruckner, F.; Suess, D.

2018-05-01

Fast stray field calculation is commonly considered of great importance for micromagnetic simulations, since it is the most time consuming part of the simulation. The Fast Multipole Method (FMM) has displayed linear O(N) parallelization behavior on many cores. This article investigates the error of a recent FMM approach approximating sources using linear—instead of constant—finite elements in the singular integral for calculating the stray field and the corresponding potential. After measuring performance in an earlier manuscript, this manuscript investigates the convergence of the relative L2 error for several FMM simulation parameters. Various scenarios either calculating the stray field directly or via potential are discussed.
Fast and accurate mock catalogue generation for low-mass galaxies

NASA Astrophysics Data System (ADS)

Koda, Jun; Blake, Chris; Beutler, Florian; Kazin, Eyal; Marin, Felipe

2016-06-01

We present an accurate and fast framework for generating mock catalogues including low-mass haloes, based on an implementation of the COmoving Lagrangian Acceleration (COLA) technique. Multiple realisations of mock catalogues are crucial for analyses of large-scale structure, but conventional N-body simulations are too computationally expensive for the production of thousands of realizations. We show that COLA simulations can produce accurate mock catalogues with a moderate computation resource for low- to intermediate-mass galaxies in 1012 M⊙ haloes, both in real and redshift space. COLA simulations have accurate peculiar velocities, without systematic errors in the velocity power spectra for k ≤ 0.15 h Mpc-1, and with only 3-per cent error for k ≤ 0.2 h Mpc-1. We use COLA with 10 time steps and a Halo Occupation Distribution to produce 600 mock galaxy catalogues of the WiggleZ Dark Energy Survey. Our parallelized code for efficient generation of accurate halo catalogues is publicly available at github.com/junkoda/cola_halo.
Fast Segmentation From Blurred Data in 3D Fluorescence Microscopy.

PubMed

Storath, Martin; Rickert, Dennis; Unser, Michael; Weinmann, Andreas

2017-10-01

We develop a fast algorithm for segmenting 3D images from linear measurements based on the Potts model (or piecewise constant Mumford-Shah model). To that end, we first derive suitable space discretizations of the 3D Potts model, which are capable of dealing with 3D images defined on non-cubic grids. Our discretization allows us to utilize a specific splitting approach, which results in decoupled subproblems of moderate size. The crucial point in the 3D setup is that the number of independent subproblems is so large that we can reasonably exploit the parallel processing capabilities of the graphics processing units (GPUs). Our GPU implementation is up to 18 times faster than the sequential CPU version. This allows to process even large volumes in acceptable runtimes. As a further contribution, we extend the algorithm in order to deal with non-negativity constraints. We demonstrate the efficiency of our method for combined image deconvolution and segmentation on simulated data and on real 3D wide field fluorescence microscopy data.
A Fast Radio Burst Search Method for VLBI Observation

NASA Astrophysics Data System (ADS)

Liu, Lei; Tong, Fengxian; Zheng, Weimin; Zhang, Juan; Tong, Li

2018-02-01

We introduce the cross-spectrum-based fast radio burst (FRB) search method for Very Long Baseline Interferometer (VLBI) observation. This method optimizes the fringe fitting scheme in geodetic VLBI data post-processing, which fully utilizes the cross-spectrum fringe phase information and therefore maximizes the power of single-pulse signals. Working with cross-spectrum greatly reduces the effect of radio frequency interference compared with using auto-power spectrum. Single-pulse detection confidence increases by cross-identifying detections from multiple baselines. By combining the power of multiple baselines, we may improve the detection sensitivity. Our method is similar to that of coherent beam forming, but without the computational expense to form a great number of beams to cover the whole field of view of our telescopes. The data processing pipeline designed for this method is easy to implement and parallelize, which can be deployed in various kinds of VLBI observations. In particular, we point out that VGOS observations are very suitable for FRB search.
Large-scale trench-normal mantle flow beneath central South America

NASA Astrophysics Data System (ADS)

Reiss, M. C.; Rümpker, G.; Wölbern, I.

2018-01-01

We investigate the anisotropic properties of the fore-arc region of the central Andean margin between 17-25°S by analyzing shear-wave splitting from teleseismic and local earthquakes from the Nazca slab. With partly over ten years of recording time, the data set is uniquely suited to address the long-standing debate about the mantle flow field at the South American margin and in particular whether the flow field beneath the slab is parallel or perpendicular to the trench. Our measurements suggest two anisotropic layers located within the crust and mantle beneath the stations, respectively. The teleseismic measurements show a moderate change of fast polarizations from North to South along the trench ranging from parallel to subparallel to the absolute plate motion and, are oriented mostly perpendicular to the trench. Shear-wave splitting measurements from local earthquakes show fast polarizations roughly aligned trench-parallel but exhibit short-scale variations which are indicative of a relatively shallow origin. Comparisons between fast polarization directions from local earthquakes and the strike of the local fault systems yield a good agreement. To infer the parameters of the lower anisotropic layer we employ an inversion of the teleseismic waveforms based on two-layer models, where the anisotropy of the upper (crustal) layer is constrained by the results from the local splitting. The waveform inversion yields a mantle layer that is best characterized by a fast axis parallel to the absolute plate motion which is more-or-less perpendicular to the trench. This orientation is likely caused by a combination of the fossil crystallographic preferred orientation of olivine within the slab and entrained mantle flow beneath the slab. The anisotropy within the crust of the overriding continental plate is explained by the shape-preferred orientation of micro-cracks in relation to local fault zones which are oriented parallel to the overall strike of the Andean range. Our results do not provide any evidence for a significant contribution of trench-parallel mantle flow beneath the subducting slab.
Fast Multipole Methods for Three-Dimensional N-body Problems

NASA Technical Reports Server (NTRS)

Koumoutsakos, P.

1995-01-01

We are developing computational tools for the simulations of three-dimensional flows past bodies undergoing arbitrary motions. High resolution viscous vortex methods have been developed that allow for extended simulations of two-dimensional configurations such as vortex generators. Our objective is to extend this methodology to three dimensions and develop a robust computational scheme for the simulation of such flows. A fundamental issue in the use of vortex methods is the ability of employing efficiently large numbers of computational elements to resolve the large range of scales that exist in complex flows. The traditional cost of the method scales as Omicron (N(sup 2)) as the N computational elements/particles induce velocities at each other, making the method unacceptable for simulations involving more than a few tens of thousands of particles. In the last decade fast methods have been developed that have operation counts of Omicron (N log N) or Omicron (N) (referred to as BH and GR respectively) depending on the details of the algorithm. These methods are based on the observation that the effect of a cluster of particles at a certain distance may be approximated by a finite series expansion. In order to exploit this observation we need to decompose the element population spatially into clusters of particles and build a hierarchy of clusters (a tree data structure) - smaller neighboring clusters combine to form a cluster of the next size up in the hierarchy and so on. This hierarchy of clusters allows one to determine efficiently when the approximation is valid. This algorithm is an N-body solver that appears in many fields of engineering and science. Some examples of its diverse use are in astrophysics, molecular dynamics, micro-magnetics, boundary element simulations of electromagnetic problems, and computer animation. More recently these N-body solvers have been implemented and applied in simulations involving vortex methods. Koumoutsakos and Leonard (1995) implemented the GR scheme in two dimensions for vector computer architectures allowing for simulations of bluff body flows using millions of particles. Winckelmans presented three-dimensional, viscous simulations of interacting vortex rings, using vortons and an implementation of a BH scheme for parallel computer architectures. Bhatt presented a vortex filament method to perform inviscid vortex ring interactions, with an alternative implementation of a BH scheme for a Connection Machine parallel computer architecture.
The multigrid preconditioned conjugate gradient method

NASA Technical Reports Server (NTRS)

Tatebe, Osamu

1993-01-01

A multigrid preconditioned conjugate gradient method (MGCG method), which uses the multigrid method as a preconditioner of the PCG method, is proposed. The multigrid method has inherent high parallelism and improves convergence of long wavelength components, which is important in iterative methods. By using this method as a preconditioner of the PCG method, an efficient method with high parallelism and fast convergence is obtained. First, it is considered a necessary condition of the multigrid preconditioner in order to satisfy requirements of a preconditioner of the PCG method. Next numerical experiments show a behavior of the MGCG method and that the MGCG method is superior to both the ICCG method and the multigrid method in point of fast convergence and high parallelism. This fast convergence is understood in terms of the eigenvalue analysis of the preconditioned matrix. From this observation of the multigrid preconditioner, it is realized that the MGCG method converges in very few iterations and the multigrid preconditioner is a desirable preconditioner of the conjugate gradient method.
Optical information-processing systems and architectures II; Proceedings of the Meeting, San Diego, CA, July 9-13, 1990

NASA Astrophysics Data System (ADS)

Javidi, Bahram

The present conference discusses topics in the fields of neural networks, acoustooptic signal processing, pattern recognition, phase-only processing, nonlinear signal processing, image processing, optical computing, and optical information processing. Attention is given to the optical implementation of an inner-product neural associative memory, optoelectronic associative recall via motionless-head/parallel-readout optical disk, a compact real-time acoustooptic image correlator, a multidimensional synthetic estimation filter, and a light-efficient joint transform optical correlator. Also discussed are a high-resolution spatial light modulator, compact real-time interferometric Fourier-transform processors, a fast decorrelation algorithm for permutation arrays, the optical interconnection of optical modules, and carry-free optical binary adders.
Parallel excitation-emission multiplexed fluorescence lifetime confocal microscopy for live cell imaging.

PubMed

Zhao, Ming; Li, Yu; Peng, Leilei

2014-05-05

We present a novel excitation-emission multiplexed fluorescence lifetime microscopy (FLIM) method that surpasses current FLIM techniques in multiplexing capability. The method employs Fourier multiplexing to simultaneously acquire confocal fluorescence lifetime images of multiple excitation wavelength and emission color combinations at 44,000 pixels/sec. The system is built with low-cost CW laser sources and standard PMTs with versatile spectral configuration, which can be implemented as an add-on to commercial confocal microscopes. The Fourier lifetime confocal method allows fast multiplexed FLIM imaging, which makes it possible to monitor multiple biological processes in live cells. The low cost and compatibility with commercial systems could also make multiplexed FLIM more accessible to biological research community.
Space applications of artificial intelligence; Proceedings of the Annual Goddard Conference, Greenbelt, MD, May 16, 17, 1989

NASA Technical Reports Server (NTRS)

Rash, James L. (Editor); Dent, Carolyn P. (Editor)

1989-01-01

Theoretical and implementation aspects of AI systems for space applications are discussed in reviews and reports. Sections are devoted to planning and scheduling, fault isolation and diagnosis, data management, modeling and simulation, and development tools and methods. Particular attention is given to a situated reasoning architecture for space repair and replace tasks, parallel plan execution with self-processing networks, the electrical diagnostics expert system for Spacelab life-sciences experiments, diagnostic tolerance for missing sensor data, the integration of perception and reasoning in fast neural modules, a connectionist model for dynamic control, and applications of fuzzy sets to the development of rule-based expert systems.
Parallelizing serial code for a distributed processing environment with an application to high frequency electromagnetic scattering

NASA Astrophysics Data System (ADS)

Work, Paul R.

1991-12-01

This thesis investigates the parallelization of existing serial programs in computational electromagnetics for use in a parallel environment. Existing algorithms for calculating the radar cross section of an object are covered, and a ray-tracing code is chosen for implementation on a parallel machine. Current parallel architectures are introduced and a suitable parallel machine is selected for the implementation of the chosen ray-tracing algorithm. The standard techniques for the parallelization of serial codes are discussed, including load balancing and decomposition considerations, and appropriate methods for the parallelization effort are selected. A load balancing algorithm is modified to increase the efficiency of the application, and a high level design of the structure of the serial program is presented. A detailed design of the modifications for the parallel implementation is also included, with both the high level and the detailed design specified in a high level design language called UNITY. The correctness of the design is proven using UNITY and standard logic operations. The theoretical and empirical results show that it is possible to achieve an efficient parallel application for a serial computational electromagnetic program where the characteristics of the algorithm and the target architecture critically influence the development of such an implementation.
Using a source-to-source transformation to introduce multi-threading into the AliRoot framework for a parallel event reconstruction

NASA Astrophysics Data System (ADS)

Lohn, Stefan B.; Dong, Xin; Carminati, Federico

2012-12-01

Chip-Multiprocessors are going to support massive parallelism by many additional physical and logical cores. Improving performance can no longer be obtained by increasing clock-frequency because the technical limits are almost reached. Instead, parallel execution must be used to gain performance. Resources like main memory, the cache hierarchy, bandwidth of the memory bus or links between cores and sockets are not going to be improved as fast. Hence, parallelism can only result into performance gains if the memory usage is optimized and the communication between threads is minimized. Besides concurrent programming has become a domain for experts. Implementing multi-threading is error prone and labor-intensive. A full reimplementation of the whole AliRoot source-code is unaffordable. This paper describes the effort to evaluate the adaption of AliRoot to the needs of multi-threading and to provide the capability of parallel processing by using a semi-automatic source-to-source transformation to address the problems as described before and to provide a straight-forward way of parallelization with almost no interference between threads. This makes the approach simple and reduces the required manual changes in the code. In a first step, unconditional thread-safety will be introduced to bring the original sequential and thread unaware source-code into the position of utilizing multi-threading. Afterwards further investigations have to be performed to point out candidates of classes that are useful to share amongst threads. Then in a second step, the transformation has to change the code to share these classes and finally to verify if there are anymore invalid interferences between threads.
Efficiency Analysis of the Parallel Implementation of the SIMPLE Algorithm on Multiprocessor Computers

NASA Astrophysics Data System (ADS)

Lashkin, S. V.; Kozelkov, A. S.; Yalozo, A. V.; Gerasimov, V. Yu.; Zelensky, D. K.

2017-12-01

This paper describes the details of the parallel implementation of the SIMPLE algorithm for numerical solution of the Navier-Stokes system of equations on arbitrary unstructured grids. The iteration schemes for the serial and parallel versions of the SIMPLE algorithm are implemented. In the description of the parallel implementation, special attention is paid to computational data exchange among processors under the condition of the grid model decomposition using fictitious cells. We discuss the specific features for the storage of distributed matrices and implementation of vector-matrix operations in parallel mode. It is shown that the proposed way of matrix storage reduces the number of interprocessor exchanges. A series of numerical experiments illustrates the effect of the multigrid SLAE solver tuning on the general efficiency of the algorithm; the tuning involves the types of the cycles used (V, W, and F), the number of iterations of a smoothing operator, and the number of cells for coarsening. Two ways (direct and indirect) of efficiency evaluation for parallelization of the numerical algorithm are demonstrated. The paper presents the results of solving some internal and external flow problems with the evaluation of parallelization efficiency by two algorithms. It is shown that the proposed parallel implementation enables efficient computations for the problems on a thousand processors. Based on the results obtained, some general recommendations are made for the optimal tuning of the multigrid solver, as well as for selecting the optimal number of cells per processor.
Fast parallel algorithm for slicing STL based on pipeline

NASA Astrophysics Data System (ADS)

Ma, Xulong; Lin, Feng; Yao, Bo

2016-05-01

In Additive Manufacturing field, the current researches of data processing mainly focus on a slicing process of large STL files or complicated CAD models. To improve the efficiency and reduce the slicing time, a parallel algorithm has great advantages. However, traditional algorithms can't make full use of multi-core CPU hardware resources. In the paper, a fast parallel algorithm is presented to speed up data processing. A pipeline mode is adopted to design the parallel algorithm. And the complexity of the pipeline algorithm is analyzed theoretically. To evaluate the performance of the new algorithm, effects of threads number and layers number are investigated by a serial of experiments. The experimental results show that the threads number and layers number are two remarkable factors to the speedup ratio. The tendency of speedup versus threads number reveals a positive relationship which greatly agrees with the Amdahl's law, and the tendency of speedup versus layers number also keeps a positive relationship agreeing with Gustafson's law. The new algorithm uses topological information to compute contours with a parallel method of speedup. Another parallel algorithm based on data parallel is used in experiments to show that pipeline parallel mode is more efficient. A case study at last shows a suspending performance of the new parallel algorithm. Compared with the serial slicing algorithm, the new pipeline parallel algorithm can make full use of the multi-core CPU hardware, accelerate the slicing process, and compared with the data parallel slicing algorithm, the new slicing algorithm in this paper adopts a pipeline parallel model, and a much higher speedup ratio and efficiency is achieved.
Object-Oriented Implementation of the NAS Parallel Benchmarks using Charm++

NASA Technical Reports Server (NTRS)

Krishnan, Sanjeev; Bhandarkar, Milind; Kale, Laxmikant V.

1996-01-01

This report describes experiences with implementing the NAS Computational Fluid Dynamics benchmarks using a parallel object-oriented language, Charm++. Our main objective in implementing the NAS CFD kernel benchmarks was to develop a code that could be used to easily experiment with different domain decomposition strategies and dynamic load balancing. We also wished to leverage the object-orientation provided by the Charm++ parallel object-oriented language, to develop reusable abstractions that would simplify the process of developing parallel applications. We first describe the Charm++ parallel programming model and the parallel object array abstraction, then go into detail about each of the Scalar Pentadiagonal (SP) and Lower/Upper Triangular (LU) benchmarks, along with performance results. Finally we conclude with an evaluation of the methodology used.
Experimental verification of the role of electron pressure in fast magnetic reconnection with a guide field

DOE PAGES

Fox, W.; Sciortino, F.; v. Stechow, A.; ...

2017-03-21

We report detailed laboratory observations of the structure of a reconnection current sheet in a two-fluid plasma regime with a guide magnetic field. We observe and quantitatively analyze the quadrupolar electron pressure variation in the ion-diffusion region, as originally predicted by extended magnetohydrodynamics simulations. The projection of the electron pressure gradient parallel to the magnetic field contributes significantly to balancing the parallel electric field, and the resulting cross-field electron jets in the reconnection layer are diamagnetic in origin. Furthermore, these results demonstrate how parallel and perpendicular force balance are coupled in guide field reconnection and confirm basic theoretical models ofmore » the importance of electron pressure gradients for obtaining fast magnetic reconnection.« less
Reconstruction for time-domain in vivo EPR 3D multigradient oximetric imaging--a parallel processing perspective.

PubMed

Dharmaraj, Christopher D; Thadikonda, Kishan; Fletcher, Anthony R; Doan, Phuc N; Devasahayam, Nallathamby; Matsumoto, Shingo; Johnson, Calvin A; Cook, John A; Mitchell, James B; Subramanian, Sankaran; Krishna, Murali C

2009-01-01

Three-dimensional Oximetric Electron Paramagnetic Resonance Imaging using the Single Point Imaging modality generates unpaired spin density and oxygen images that can readily distinguish between normal and tumor tissues in small animals. It is also possible with fast imaging to track the changes in tissue oxygenation in response to the oxygen content in the breathing air. However, this involves dealing with gigabytes of data for each 3D oximetric imaging experiment involving digital band pass filtering and background noise subtraction, followed by 3D Fourier reconstruction. This process is rather slow in a conventional uniprocessor system. This paper presents a parallelization framework using OpenMP runtime support and parallel MATLAB to execute such computationally intensive programs. The Intel compiler is used to develop a parallel C++ code based on OpenMP. The code is executed on four Dual-Core AMD Opteron shared memory processors, to reduce the computational burden of the filtration task significantly. The results show that the parallel code for filtration has achieved a speed up factor of 46.66 as against the equivalent serial MATLAB code. In addition, a parallel MATLAB code has been developed to perform 3D Fourier reconstruction. Speedup factors of 4.57 and 4.25 have been achieved during the reconstruction process and oximetry computation, for a data set with 23 x 23 x 23 gradient steps. The execution time has been computed for both the serial and parallel implementations using different dimensions of the data and presented for comparison. The reported system has been designed to be easily accessible even from low-cost personal computers through local internet (NIHnet). The experimental results demonstrate that the parallel computing provides a source of high computational power to obtain biophysical parameters from 3D EPR oximetric imaging, almost in real-time.
Automatic recognition of vector and parallel operations in a higher level language

NASA Technical Reports Server (NTRS)

Schneck, P. B.

1971-01-01

A compiler for recognizing statements of a FORTRAN program which are suited for fast execution on a parallel or pipeline machine such as Illiac-4, Star or ASC is described. The technique employs interval analysis to provide flow information to the vector/parallel recognizer. Where profitable the compiler changes scalar variables to subscripted variables. The output of the compiler is an extension to FORTRAN which shows parallel and vector operations explicitly.
Parallel Computing Strategies for Irregular Algorithms

NASA Technical Reports Server (NTRS)

Biswas, Rupak; Oliker, Leonid; Shan, Hongzhang; Biegel, Bryan (Technical Monitor)

2002-01-01

Parallel computing promises several orders of magnitude increase in our ability to solve realistic computationally-intensive problems, but relies on their efficient mapping and execution on large-scale multiprocessor architectures. Unfortunately, many important applications are irregular and dynamic in nature, making their effective parallel implementation a daunting task. Moreover, with the proliferation of parallel architectures and programming paradigms, the typical scientist is faced with a plethora of questions that must be answered in order to obtain an acceptable parallel implementation of the solution algorithm. In this paper, we consider three representative irregular applications: unstructured remeshing, sparse matrix computations, and N-body problems, and parallelize them using various popular programming paradigms on a wide spectrum of computer platforms ranging from state-of-the-art supercomputers to PC clusters. We present the underlying problems, the solution algorithms, and the parallel implementation strategies. Smart load-balancing, partitioning, and ordering techniques are used to enhance parallel performance. Overall results demonstrate the complexity of efficiently parallelizing irregular algorithms.

Parallelization of NAS Benchmarks for Shared Memory Multiprocessors

NASA Technical Reports Server (NTRS)

Waheed, Abdul; Yan, Jerry C.; Saini, Subhash (Technical Monitor)

1998-01-01

This paper presents our experiences of parallelizing the sequential implementation of NAS benchmarks using compiler directives on SGI Origin2000 distributed shared memory (DSM) system. Porting existing applications to new high performance parallel and distributed computing platforms is a challenging task. Ideally, a user develops a sequential version of the application, leaving the task of porting to new generations of high performance computing systems to parallelization tools and compilers. Due to the simplicity of programming shared-memory multiprocessors, compiler developers have provided various facilities to allow the users to exploit parallelism. Native compilers on SGI Origin2000 support multiprocessing directives to allow users to exploit loop-level parallelism in their programs. Additionally, supporting tools can accomplish this process automatically and present the results of parallelization to the users. We experimented with these compiler directives and supporting tools by parallelizing sequential implementation of NAS benchmarks. Results reported in this paper indicate that with minimal effort, the performance gain is comparable with the hand-parallelized, carefully optimized, message-passing implementations of the same benchmarks.
Computer-intensive simulation of solid-state NMR experiments using SIMPSON.

PubMed

Tošner, Zdeněk; Andersen, Rasmus; Stevensson, Baltzar; Edén, Mattias; Nielsen, Niels Chr; Vosegaard, Thomas

2014-09-01

Conducting large-scale solid-state NMR simulations requires fast computer software potentially in combination with efficient computational resources to complete within a reasonable time frame. Such simulations may involve large spin systems, multiple-parameter fitting of experimental spectra, or multiple-pulse experiment design using parameter scan, non-linear optimization, or optimal control procedures. To efficiently accommodate such simulations, we here present an improved version of the widely distributed open-source SIMPSON NMR simulation software package adapted to contemporary high performance hardware setups. The software is optimized for fast performance on standard stand-alone computers, multi-core processors, and large clusters of identical nodes. We describe the novel features for fast computation including internal matrix manipulations, propagator setups and acquisition strategies. For efficient calculation of powder averages, we implemented interpolation method of Alderman, Solum, and Grant, as well as recently introduced fast Wigner transform interpolation technique. The potential of the optimal control toolbox is greatly enhanced by higher precision gradients in combination with the efficient optimization algorithm known as limited memory Broyden-Fletcher-Goldfarb-Shanno. In addition, advanced parallelization can be used in all types of calculations, providing significant time reductions. SIMPSON is thus reflecting current knowledge in the field of numerical simulations of solid-state NMR experiments. The efficiency and novel features are demonstrated on the representative simulations. Copyright © 2014 Elsevier Inc. All rights reserved.
Scalable Unix commands for parallel processors : a high-performance implementation.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ong, E.; Lusk, E.; Gropp, W.

2001-06-22

We describe a family of MPI applications we call the Parallel Unix Commands. These commands are natural parallel versions of common Unix user commands such as ls, ps, and find, together with a few similar commands particular to the parallel environment. We describe the design and implementation of these programs and present some performance results on a 256-node Linux cluster. The Parallel Unix Commands are open source and freely available.
A Multi-Level Parallelization Concept for High-Fidelity Multi-Block Solvers

NASA Technical Reports Server (NTRS)

Hatay, Ferhat F.; Jespersen, Dennis C.; Guruswamy, Guru P.; Rizk, Yehia M.; Byun, Chansup; Gee, Ken; VanDalsem, William R. (Technical Monitor)

1997-01-01

The integration of high-fidelity Computational Fluid Dynamics (CFD) analysis tools with the industrial design process benefits greatly from the robust implementations that are transportable across a wide range of computer architectures. In the present work, a hybrid domain-decomposition and parallelization concept was developed and implemented into the widely-used NASA multi-block Computational Fluid Dynamics (CFD) packages implemented in ENSAERO and OVERFLOW. The new parallel solver concept, PENS (Parallel Euler Navier-Stokes Solver), employs both fine and coarse granularity in data partitioning as well as data coalescing to obtain the desired load-balance characteristics on the available computer platforms. This multi-level parallelism implementation itself introduces no changes to the numerical results, hence the original fidelity of the packages are identically preserved. The present implementation uses the Message Passing Interface (MPI) library for interprocessor message passing and memory accessing. By choosing an appropriate combination of the available partitioning and coalescing capabilities only during the execution stage, the PENS solver becomes adaptable to different computer architectures from shared-memory to distributed-memory platforms with varying degrees of parallelism. The PENS implementation on the IBM SP2 distributed memory environment at the NASA Ames Research Center obtains 85 percent scalable parallel performance using fine-grain partitioning of single-block CFD domains using up to 128 wide computational nodes. Multi-block CFD simulations of complete aircraft simulations achieve 75 percent perfect load-balanced executions using data coalescing and the two levels of parallelism. SGI PowerChallenge, SGI Origin 2000, and a cluster of workstations are the other platforms where the robustness of the implementation is tested. The performance behavior on the other computer platforms with a variety of realistic problems will be included as this on-going study progresses.
Motion streaks in fast motion rivalry cause orientation-selective suppression.

PubMed

Apthorp, Deborah; Wenderoth, Peter; Alais, David

2009-05-14

We studied binocular rivalry between orthogonally translating arrays of random Gaussian blobs and measured the strength of rivalry suppression for static oriented probes. Suppression depth was quantified by expressing monocular probe thresholds during dominance relative to thresholds during suppression. Rivalry between two fast motions or two slow motions was compared in order to test the suggestion that fast-moving objects leave oriented "motion streaks" due to temporal integration (W. S. Geisler, 1999). If fast motions do produce motion streaks, then fast motion rivalry might also entail rivalry between the orthogonal streak orientations. We tested this using a static oriented probe that was aligned either parallel to the motion trajectory (hence collinear with the "streaks") or was orthogonal to the trajectory, predicting that rivalry suppression would be greater for parallel probes, and only for rivalry between fast motions. Results confirmed that suppression depth did depend on probe orientation for fast motion but not for slow motion. Further experiments showed that threshold elevations for the oriented probe during suppression exhibited clear orientation tuning. However, orientation-tuned elevations were also present during dominance, suggesting within-channel masking as the basis of the extra-deep suppression. In sum, the presence of orientation-dependent suppression in fast motion rivalry is consistent with the "motion streaks" hypothesis.
Novel Optical Processor for Phased Array Antenna.

DTIC Science & Technology

1992-10-20

parallel glass slide into the signal beam optical loop. The parallel glass acts like a variable phase shifter to the signal beam simulating phase drift...A list of possible designs are given as follows , _ _ Velocity fa (100dB/cm) Lumit Wavelength I M2I1 TeO2 Longi 4.2 /m/ns about 3 GHz 1.4 4m 34 Fast...subject to achievable acoustic frequency, the preferred materials are the slow shear wave in TeO2 , the fast shear wave in TeO2 or the shear waves in
Implementation and Assessment of a Virtual Laboratory of Parallel Robots Developed for Engineering Students

ERIC Educational Resources Information Center

Gil, Arturo; Peidró, Adrián; Reinoso, Óscar; Marín, José María

2017-01-01

This paper presents a tool, LABEL, oriented to the teaching of parallel robotics. The application, organized as a set of tools developed using Easy Java Simulations, enables the study of the kinematics of parallel robotics. A set of classical parallel structures was implemented such that LABEL can solve the inverse and direct kinematic problem of…
Micropower Mixed-signal VLSI Independent Component Analysis for Gradient Flow Acoustic Source Separation.

PubMed

Stanaćević, Milutin; Li, Shuo; Cauwenberghs, Gert

2016-07-01

A parallel micro-power mixed-signal VLSI implementation of independent component analysis (ICA) with reconfigurable outer-product learning rules is presented. With the gradient sensing of the acoustic field over a miniature microphone array as a pre-processing method, the proposed ICA implementation can separate and localize up to 3 sources in mild reverberant environment. The ICA processor is implemented in 0.5 µm CMOS technology and occupies 3 mm × 3 mm area. At 16 kHz sampling rate, ASIC consumes 195 µW power from a 3 V supply. The outer-product implementation of natural gradient and Herault-Jutten ICA update rules demonstrates comparable performance to benchmark FastICA algorithm in ideal conditions and more robust performance in noisy and reverberant environment. Experiments demonstrate perceptually clear separation and precise localization over wide range of separation angles of two speech sources presented through speakers positioned at 1.5 m from the array on a conference room table. The presented ASIC leads to a extreme small form factor and low power consumption microsystem for source separation and localization required in applications like intelligent hearing aids and wireless distributed acoustic sensor arrays.
Property-driven functional verification technique for high-speed vision system-on-chip processor

NASA Astrophysics Data System (ADS)

Nshunguyimfura, Victor; Yang, Jie; Liu, Liyuan; Wu, Nanjian

2017-04-01

The implementation of functional verification in a fast, reliable, and effective manner is a challenging task in a vision chip verification process. The main reason for this challenge is the stepwise nature of existing functional verification techniques. This vision chip verification complexity is also related to the fact that in most vision chip design cycles, extensive efforts are focused on how to optimize chip metrics such as performance, power, and area. Design functional verification is not explicitly considered at an earlier stage at which the most sound decisions are made. In this paper, we propose a semi-automatic property-driven verification technique. The implementation of all verification components is based on design properties. We introduce a low-dimension property space between the specification space and the implementation space. The aim of this technique is to speed up the verification process for high-performance parallel processing vision chips. Our experimentation results show that the proposed technique can effectively improve the verification effort up to 20% for the complex vision chip design while reducing the simulation and debugging overheads.
Homemade Buckeye-Pi: A Learning Many-Node Platform for High-Performance Parallel Computing

NASA Astrophysics Data System (ADS)

Amooie, M. A.; Moortgat, J.

2017-12-01

We report on the "Buckeye-Pi" cluster, the supercomputer developed in The Ohio State University School of Earth Sciences from 128 inexpensive Raspberry Pi (RPi) 3 Model B single-board computers. Each RPi is equipped with fast Quad Core 1.2GHz ARMv8 64bit processor, 1GB of RAM, and 32GB microSD card for local storage. Therefore, the cluster has a total RAM of 128GB that is distributed on the individual nodes and a flash capacity of 4TB with 512 processors, while it benefits from low power consumption, easy portability, and low total cost. The cluster uses the Message Passing Interface protocol to manage the communications between each node. These features render our platform the most powerful RPi supercomputer to date and suitable for educational applications in high-performance-computing (HPC) and handling of large datasets. In particular, we use the Buckeye-Pi to implement optimized parallel codes in our in-house simulator for subsurface media flows with the goal of achieving a massively-parallelized scalable code. We present benchmarking results for the computational performance across various number of RPi nodes. We believe our project could inspire scientists and students to consider the proposed unconventional cluster architecture as a mainstream and a feasible learning platform for challenging engineering and scientific problems.
Analyzing and Visualizing Cosmological Simulations with ParaView

NASA Astrophysics Data System (ADS)

Woodring, Jonathan; Heitmann, Katrin; Ahrens, James; Fasel, Patricia; Hsu, Chung-Hsing; Habib, Salman; Pope, Adrian

2011-07-01

The advent of large cosmological sky surveys—ushering in the era of precision cosmology—has been accompanied by ever larger cosmological simulations. The analysis of these simulations, which currently encompass tens of billions of particles and up to a trillion particles in the near future, is often as daunting as carrying out the simulations in the first place. Therefore, the development of very efficient analysis tools combining qualitative and quantitative capabilities is a matter of some urgency. In this paper, we introduce new analysis features implemented within ParaView, a fully parallel, open-source visualization toolkit, to analyze large N-body simulations. A major aspect of ParaView is that it can live and operate on the same machines and utilize the same parallel power as the simulation codes themselves. In addition, data movement is in a serious bottleneck now and will become even more of an issue in the future; an interactive visualization and analysis tool that can handle data in situ is fast becoming essential. The new features in ParaView include particle readers and a very efficient halo finder that identifies friends-of-friends halos and determines common halo properties, including spherical overdensity properties. In combination with many other functionalities already existing within ParaView, such as histogram routines or interfaces to programming languages like Python, this enhanced version enables fast, interactive, and convenient analyses of large cosmological simulations. In addition, development paths are available for future extensions.
Parallel processing implementation for the coupled transport of photons and electrons using OpenMP

NASA Astrophysics Data System (ADS)

Doerner, Edgardo

2016-05-01

In this work the use of OpenMP to implement the parallel processing of the Monte Carlo (MC) simulation of the coupled transport for photons and electrons is presented. This implementation was carried out using a modified EGSnrc platform which enables the use of the Microsoft Visual Studio 2013 (VS2013) environment, together with the developing tools available in the Intel Parallel Studio XE 2015 (XE2015). The performance study of this new implementation was carried out in a desktop PC with a multi-core CPU, taking as a reference the performance of the original platform. The results were satisfactory, both in terms of scalability as parallelization efficiency.
AdiosStMan: Parallelizing Casacore Table Data System using Adaptive IO System

NASA Astrophysics Data System (ADS)

Wang, R.; Harris, C.; Wicenec, A.

2016-07-01

In this paper, we investigate the Casacore Table Data System (CTDS) used in the casacore and CASA libraries, and methods to parallelize it. CTDS provides a storage manager plugin mechanism for third-party developers to design and implement their own CTDS storage managers. Having this in mind, we looked into various storage backend techniques that can possibly enable parallel I/O for CTDS by implementing new storage managers. After carrying on benchmarks showing the excellent parallel I/O throughput of the Adaptive IO System (ADIOS), we implemented an ADIOS based parallel CTDS storage manager. We then applied the CASA MSTransform frequency split task to verify the ADIOS Storage Manager. We also ran a series of performance tests to examine the I/O throughput in a massively parallel scenario.
Method for implementation of recursive hierarchical segmentation on parallel computers

NASA Technical Reports Server (NTRS)

Tilton, James C. (Inventor)

2005-01-01

A method, computer readable storage, and apparatus for implementing a recursive hierarchical segmentation algorithm on a parallel computing platform. The method includes setting a bottom level of recursion that defines where a recursive division of an image into sections stops dividing, and setting an intermediate level of recursion where the recursive division changes from a parallel implementation into a serial implementation. The segmentation algorithm is implemented according to the set levels. The method can also include setting a convergence check level of recursion with which the first level of recursion communicates with when performing a convergence check.
The development of a scalable parallel 3-D CFD algorithm for turbomachinery. M.S. Thesis Final Report

NASA Technical Reports Server (NTRS)

Luke, Edward Allen

1993-01-01

Two algorithms capable of computing a transonic 3-D inviscid flow field about rotating machines are considered for parallel implementation. During the study of these algorithms, a significant new method of measuring the performance of parallel algorithms is developed. The theory that supports this new method creates an empirical definition of scalable parallel algorithms that is used to produce quantifiable evidence that a scalable parallel application was developed. The implementation of the parallel application and an automated domain decomposition tool are also discussed.
Bit-parallel arithmetic in a massively-parallel associative processor

NASA Technical Reports Server (NTRS)

Scherson, Isaac D.; Kramer, David A.; Alleyne, Brian D.

1992-01-01

A simple but powerful new architecture based on a classical associative processor model is presented. Algorithms for performing the four basic arithmetic operations both for integer and floating point operands are described. For m-bit operands, the proposed architecture makes it possible to execute complex operations in O(m) cycles as opposed to O(m exp 2) for bit-serial machines. A word-parallel, bit-parallel, massively-parallel computing system can be constructed using this architecture with VLSI technology. The operation of this system is demonstrated for the fast Fourier transform and matrix multiplication.
A Fast, Automatic Segmentation Algorithm for Locating and Delineating Touching Cell Boundaries in Imaged Histopathology

PubMed Central

Qi, Xin; Xing, Fuyong; Foran, David J.; Yang, Lin

2013-01-01

Summary Background Automated analysis of imaged histopathology specimens could potentially provide support for improved reliability in detection and classification in a range of investigative and clinical cancer applications. Automated segmentation of cells in the digitized tissue microarray (TMA) is often the prerequisite for quantitative analysis. However overlapping cells usually bring significant challenges for traditional segmentation algorithms. Objectives In this paper, we propose a novel, automatic algorithm to separate overlapping cells in stained histology specimens acquired using bright-field RGB imaging. Methods It starts by systematically identifying salient regions of interest throughout the image based upon their underlying visual content. The segmentation algorithm subsequently performs a quick, voting based seed detection. Finally, the contour of each cell is obtained using a repulsive level set deformable model using the seeds generated in the previous step. We compared the experimental results with the most current literature, and the pixel wise accuracy between human experts' annotation and those generated using the automatic segmentation algorithm. Results The method is tested with 100 image patches which contain more than 1000 overlapping cells. The overall precision and recall of the developed algorithm is 90% and 78%, respectively. We also implement the algorithm on GPU. The parallel implementation is 22 times faster than its C/C++ sequential implementation. Conclusion The proposed overlapping cell segmentation algorithm can accurately detect the center of each overlapping cell and effectively separate each of the overlapping cells. GPU is proven to be an efficient parallel platform for overlapping cell segmentation. PMID:22526139
High Performance Parallel Architectures

NASA Technical Reports Server (NTRS)

El-Ghazawi, Tarek; Kaewpijit, Sinthop

1998-01-01

Traditional remote sensing instruments are multispectral, where observations are collected at a few different spectral bands. Recently, many hyperspectral instruments, that can collect observations at hundreds of bands, have been operational. Furthermore, there have been ongoing research efforts on ultraspectral instruments that can produce observations at thousands of spectral bands. While these remote sensing technology developments hold great promise for new findings in the area of Earth and space science, they present many challenges. These include the need for faster processing of such increased data volumes, and methods for data reduction. Dimension Reduction is a spectral transformation, aimed at concentrating the vital information and discarding redundant data. One such transformation, which is widely used in remote sensing, is the Principal Components Analysis (PCA). This report summarizes our progress on the development of a parallel PCA and its implementation on two Beowulf cluster configuration; one with fast Ethernet switch and the other with a Myrinet interconnection. Details of the implementation and performance results, for typical sets of multispectral and hyperspectral NASA remote sensing data, are presented and analyzed based on the algorithm requirements and the underlying machine configuration. It will be shown that the PCA application is quite challenging and hard to scale on Ethernet-based clusters. However, the measurements also show that a high- performance interconnection network, such as Myrinet, better matches the high communication demand of PCA and can lead to a more efficient PCA execution.
Toward real-time Monte Carlo simulation using a commercial cloud computing infrastructure.

PubMed

Wang, Henry; Ma, Yunzhi; Pratx, Guillem; Xing, Lei

2011-09-07

Monte Carlo (MC) methods are the gold standard for modeling photon and electron transport in a heterogeneous medium; however, their computational cost prohibits their routine use in the clinic. Cloud computing, wherein computing resources are allocated on-demand from a third party, is a new approach for high performance computing and is implemented to perform ultra-fast MC calculation in radiation therapy. We deployed the EGS5 MC package in a commercial cloud environment. Launched from a single local computer with Internet access, a Python script allocates a remote virtual cluster. A handshaking protocol designates master and worker nodes. The EGS5 binaries and the simulation data are initially loaded onto the master node. The simulation is then distributed among independent worker nodes via the message passing interface, and the results aggregated on the local computer for display and data analysis. The described approach is evaluated for pencil beams and broad beams of high-energy electrons and photons. The output of cloud-based MC simulation is identical to that produced by single-threaded implementation. For 1 million electrons, a simulation that takes 2.58 h on a local computer can be executed in 3.3 min on the cloud with 100 nodes, a 47× speed-up. Simulation time scales inversely with the number of parallel nodes. The parallelization overhead is also negligible for large simulations. Cloud computing represents one of the most important recent advances in supercomputing technology and provides a promising platform for substantially improved MC simulation. In addition to the significant speed up, cloud computing builds a layer of abstraction for high performance parallel computing, which may change the way dose calculations are performed and radiation treatment plans are completed.
Bio-inspired approach for intelligent unattended ground sensors

NASA Astrophysics Data System (ADS)

Hueber, Nicolas; Raymond, Pierre; Hennequin, Christophe; Pichler, Alexander; Perrot, Maxime; Voisin, Philippe; Moeglin, Jean-Pierre

2015-05-01

Improving the surveillance capacity over wide zones requires a set of smart battery-powered Unattended Ground Sensors capable of issuing an alarm to a decision-making center. Only high-level information has to be sent when a relevant suspicious situation occurs. In this paper we propose an innovative bio-inspired approach that mimics the human bi-modal vision mechanism and the parallel processing ability of the human brain. The designed prototype exploits two levels of analysis: a low-level panoramic motion analysis, the peripheral vision, and a high-level event-focused analysis, the foveal vision. By tracking moving objects and fusing multiple criteria (size, speed, trajectory, etc.), the peripheral vision module acts as a fast relevant event detector. The foveal vision module focuses on the detected events to extract more detailed features (texture, color, shape, etc.) in order to improve the recognition efficiency. The implemented recognition core is able to acquire human knowledge and to classify in real-time a huge amount of heterogeneous data thanks to its natively parallel hardware structure. This UGS prototype validates our system approach under laboratory tests. The peripheral analysis module demonstrates a low false alarm rate whereas the foveal vision correctly focuses on the detected events. A parallel FPGA implementation of the recognition core succeeds in fulfilling the embedded application requirements. These results are paving the way of future reconfigurable virtual field agents. By locally processing the data and sending only high-level information, their energy requirements and electromagnetic signature are optimized. Moreover, the embedded Artificial Intelligence core enables these bio-inspired systems to recognize and learn new significant events. By duplicating human expertise in potentially hazardous places, our miniature visual event detector will allow early warning and contribute to better human decision making.

BarraCUDA - a fast short read sequence aligner using graphics processing units

PubMed Central

2012-01-01

Background With the maturation of next-generation DNA sequencing (NGS) technologies, the throughput of DNA sequencing reads has soared to over 600 gigabases from a single instrument run. General purpose computing on graphics processing units (GPGPU), extracts the computing power from hundreds of parallel stream processors within graphics processing cores and provides a cost-effective and energy efficient alternative to traditional high-performance computing (HPC) clusters. In this article, we describe the implementation of BarraCUDA, a GPGPU sequence alignment software that is based on BWA, to accelerate the alignment of sequencing reads generated by these instruments to a reference DNA sequence. Findings Using the NVIDIA Compute Unified Device Architecture (CUDA) software development environment, we ported the most computational-intensive alignment component of BWA to GPU to take advantage of the massive parallelism. As a result, BarraCUDA offers a magnitude of performance boost in alignment throughput when compared to a CPU core while delivering the same level of alignment fidelity. The software is also capable of supporting multiple CUDA devices in parallel to further accelerate the alignment throughput. Conclusions BarraCUDA is designed to take advantage of the parallelism of GPU to accelerate the alignment of millions of sequencing reads generated by NGS instruments. By doing this, we could, at least in part streamline the current bioinformatics pipeline such that the wider scientific community could benefit from the sequencing technology. BarraCUDA is currently available from http://seqbarracuda.sf.net PMID:22244497
An object-oriented approach to nested data parallelism

NASA Technical Reports Server (NTRS)

Sheffler, Thomas J.; Chatterjee, Siddhartha

1994-01-01

This paper describes an implementation technique for integrating nested data parallelism into an object-oriented language. Data-parallel programming employs sets of data called 'collections' and expresses parallelism as operations performed over the elements of a collection. When the elements of a collection are also collections, then there is the possibility for 'nested data parallelism.' Few current programming languages support nested data parallelism however. In an object-oriented framework, a collection is a single object. Its type defines the parallel operations that may be applied to it. Our goal is to design and build an object-oriented data-parallel programming environment supporting nested data parallelism. Our initial approach is built upon three fundamental additions to C++. We add new parallel base types by implementing them as classes, and add a new parallel collection type called a 'vector' that is implemented as a template. Only one new language feature is introduced: the 'foreach' construct, which is the basis for exploiting elementwise parallelism over collections. The strength of the method lies in the compilation strategy, which translates nested data-parallel C++ into ordinary C++. Extracting the potential parallelism in nested 'foreach' constructs is called 'flattening' nested parallelism. We show how to flatten 'foreach' constructs using a simple program transformation. Our prototype system produces vector code which has been successfully run on workstations, a CM-2, and a CM-5.
Parallel Implementation of a High Order Implicit Collocation Method for the Heat Equation

NASA Technical Reports Server (NTRS)

Kouatchou, Jules; Halem, Milton (Technical Monitor)

2000-01-01

We combine a high order compact finite difference approximation and collocation techniques to numerically solve the two dimensional heat equation. The resulting method is implicit arid can be parallelized with a strategy that allows parallelization across both time and space. We compare the parallel implementation of the new method with a classical implicit method, namely the Crank-Nicolson method, where the parallelization is done across space only. Numerical experiments are carried out on the SGI Origin 2000.
Data parallel sorting for particle simulation

NASA Technical Reports Server (NTRS)

Dagum, Leonardo

1992-01-01

Sorting on a parallel architecture is a communications intensive event which can incur a high penalty in applications where it is required. In the case of particle simulation, only integer sorting is necessary, and sequential implementations easily attain the minimum performance bound of O (N) for N particles. Parallel implementations, however, have to cope with the parallel sorting problem which, in addition to incurring a heavy communications cost, can make the minimun performance bound difficult to attain. This paper demonstrates how the sorting problem in a particle simulation can be reduced to a merging problem, and describes an efficient data parallel algorithm to solve this merging problem in a particle simulation. The new algorithm is shown to be optimal under conditions usual for particle simulation, and its fieldwise implementation on the Connection Machine is analyzed in detail. The new algorithm is about four times faster than a fieldwise implementation of radix sort on the Connection Machine.
Implementation of DFT application on ternary optical computer

NASA Astrophysics Data System (ADS)

Junjie, Peng; Youyi, Fu; Xiaofeng, Zhang; Shuai, Kong; Xinyu, Wei

2018-03-01

As its characteristics of huge number of data bits and low energy consumption, optical computing may be used in the applications such as DFT etc. which needs a lot of computation and can be implemented in parallel. According to this, DFT implementation methods in full parallel as well as in partial parallel are presented. Based on resources ternary optical computer (TOC), extensive experiments were carried out. Experimental results show that the proposed schemes are correct and feasible. They provide a foundation for further exploration of the applications on TOC that needs a large amount calculation and can be processed in parallel.
Smart photodetector arrays for error control in page-oriented optical memory

NASA Astrophysics Data System (ADS)

Schaffer, Maureen Elizabeth

1998-12-01

Page-oriented optical memories (POMs) have been proposed to meet high speed, high capacity storage requirements for input/output intensive computer applications. This technology offers the capability for storage and retrieval of optical data in two-dimensional pages resulting in high throughput data rates. Since currently measured raw bit error rates for these systems fall several orders of magnitude short of industry requirements for binary data storage, powerful error control codes must be adopted. These codes must be designed to take advantage of the two-dimensional memory output. In addition, POMs require an optoelectronic interface to transfer the optical data pages to one or more electronic host systems. Conventional charge coupled device (CCD) arrays can receive optical data in parallel, but the relatively slow serial electronic output of these devices creates a system bottleneck thereby eliminating the POM advantage of high transfer rates. Also, CCD arrays are "unintelligent" interfaces in that they offer little data processing capabilities. The optical data page can be received by two-dimensional arrays of "smart" photo-detector elements that replace conventional CCD arrays. These smart photodetector arrays (SPAs) can perform fast parallel data decoding and error control, thereby providing an efficient optoelectronic interface between the memory and the electronic computer. This approach optimizes the computer memory system by combining the massive parallelism and high speed of optics with the diverse functionality, low cost, and local interconnection efficiency of electronics. In this dissertation we examine the design of smart photodetector arrays for use as the optoelectronic interface for page-oriented optical memory. We review options and technologies for SPA fabrication, develop SPA requirements, and determine SPA scalability constraints with respect to pixel complexity, electrical power dissipation, and optical power limits. Next, we examine data modulation and error correction coding for the purpose of error control in the POM system. These techniques are adapted, where possible, for 2D data and evaluated as to their suitability for a SPA implementation in terms of BER, code rate, decoder time and pixel complexity. Our analysis shows that differential data modulation combined with relatively simple block codes known as array codes provide a powerful means to achieve the desired data transfer rates while reducing error rates to industry requirements. Finally, we demonstrate the first smart photodetector array designed to perform parallel error correction on an entire page of data and satisfy the sustained data rates of page-oriented optical memories. Our implementation integrates a monolithic PN photodiode array and differential input receiver for optoelectronic signal conversion with a cluster error correction code using 0.35-mum CMOS. This approach provides high sensitivity, low electrical power dissipation, and fast parallel correction of 2 x 2-bit cluster errors in an 8 x 8 bit code block to achieve corrected output data rates scalable to 102 Gbps in the current technology increasing to 1.88 Tbps in 0.1-mum CMOS.
A High Order, Locally-Adaptive Method for the Navier-Stokes Equations

NASA Astrophysics Data System (ADS)

Chan, Daniel

1998-11-01

I have extended the FOSLS method of Cai, Manteuffel and McCormick (1997) and implemented it within the framework of a spectral element formulation using the Legendre polynomial basis function. The FOSLS method solves the Navier-Stokes equations as a system of coupled first-order equations and provides the ellipticity that is needed for fast iterative matrix solvers like multigrid to operate efficiently. Each element is treated as an object and its properties are self-contained. Only C^0 continuity is imposed across element interfaces; this design allows local grid refinement and coarsening without the burden of having an elaborate data structure, since only information along element boundaries is needed. With the FORTRAN 90 programming environment, I can maintain a high computational efficiency by employing a hybrid parallel processing model. The OpenMP directives provides parallelism in the loop level which is executed in a shared-memory SMP and the MPI protocol allows the distribution of elements to a cluster of SMP's connected via a commodity network. This talk will provide timing results and a comparison with a second order finite difference method.
GPU-accelerated low-latency real-time searches for gravitational waves from compact binary coalescence

NASA Astrophysics Data System (ADS)

Liu, Yuan; Du, Zhihui; Chung, Shin Kee; Hooper, Shaun; Blair, David; Wen, Linqing

2012-12-01

We present a graphics processing unit (GPU)-accelerated time-domain low-latency algorithm to search for gravitational waves (GWs) from coalescing binaries of compact objects based on the summed parallel infinite impulse response (SPIIR) filtering technique. The aim is to facilitate fast detection of GWs with a minimum delay to allow prompt electromagnetic follow-up observations. To maximize the GPU acceleration, we apply an efficient batched parallel computing model that significantly reduces the number of synchronizations in SPIIR and optimizes the usage of the memory and hardware resource. Our code is tested on the CUDA ‘Fermi’ architecture in a GTX 480 graphics card and its performance is compared with a single core of Intel Core i7 920 (2.67 GHz). A 58-fold speedup is achieved while giving results in close agreement with the CPU implementation. Our result indicates that it is possible to conduct a full search for GWs from compact binary coalescence in real time with only one desktop computer equipped with a Fermi GPU card for the initial LIGO detectors which in the past required more than 100 CPUs.
Implementing Shared Memory Parallelism in MCBEND

NASA Astrophysics Data System (ADS)

Bird, Adam; Long, David; Dobson, Geoff

2017-09-01

MCBEND is a general purpose radiation transport Monte Carlo code from AMEC Foster Wheelers's ANSWERS® Software Service. MCBEND is well established in the UK shielding community for radiation shielding and dosimetry assessments. The existing MCBEND parallel capability effectively involves running the same calculation on many processors. This works very well except when the memory requirements of a model restrict the number of instances of a calculation that will fit on a machine. To more effectively utilise parallel hardware OpenMP has been used to implement shared memory parallelism in MCBEND. This paper describes the reasoning behind the choice of OpenMP, notes some of the challenges of multi-threading an established code such as MCBEND and assesses the performance of the parallel method implemented in MCBEND.
Random number generators for large-scale parallel Monte Carlo simulations on FPGA

NASA Astrophysics Data System (ADS)

Lin, Y.; Wang, F.; Liu, B.

2018-05-01

Through parallelization, field programmable gate array (FPGA) can achieve unprecedented speeds in large-scale parallel Monte Carlo (LPMC) simulations. FPGA presents both new constraints and new opportunities for the implementations of random number generators (RNGs), which are key elements of any Monte Carlo (MC) simulation system. Using empirical and application based tests, this study evaluates all of the four RNGs used in previous FPGA based MC studies and newly proposed FPGA implementations for two well-known high-quality RNGs that are suitable for LPMC studies on FPGA. One of the newly proposed FPGA implementations: a parallel version of additive lagged Fibonacci generator (Parallel ALFG) is found to be the best among the evaluated RNGs in fulfilling the needs of LPMC simulations on FPGA.
Fast analysis of molecular dynamics trajectories with graphics processing units-Radial distribution function histogramming

DOE Office of Scientific and Technical Information (OSTI.GOV)

Levine, Benjamin G., E-mail: ben.levine@temple.ed; Stone, John E., E-mail: johns@ks.uiuc.ed; Kohlmeyer, Axel, E-mail: akohlmey@temple.ed

2011-05-01

The calculation of radial distribution functions (RDFs) from molecular dynamics trajectory data is a common and computationally expensive analysis task. The rate limiting step in the calculation of the RDF is building a histogram of the distance between atom pairs in each trajectory frame. Here we present an implementation of this histogramming scheme for multiple graphics processing units (GPUs). The algorithm features a tiling scheme to maximize the reuse of data at the fastest levels of the GPU's memory hierarchy and dynamic load balancing to allow high performance on heterogeneous configurations of GPUs. Several versions of the RDF algorithm aremore » presented, utilizing the specific hardware features found on different generations of GPUs. We take advantage of larger shared memory and atomic memory operations available on state-of-the-art GPUs to accelerate the code significantly. The use of atomic memory operations allows the fast, limited-capacity on-chip memory to be used much more efficiently, resulting in a fivefold increase in performance compared to the version of the algorithm without atomic operations. The ultimate version of the algorithm running in parallel on four NVIDIA GeForce GTX 480 (Fermi) GPUs was found to be 92 times faster than a multithreaded implementation running on an Intel Xeon 5550 CPU. On this multi-GPU hardware, the RDF between two selections of 1,000,000 atoms each can be calculated in 26.9 s per frame. The multi-GPU RDF algorithms described here are implemented in VMD, a widely used and freely available software package for molecular dynamics visualization and analysis.« less
Fast Analysis of Molecular Dynamics Trajectories with Graphics Processing Units—Radial Distribution Function Histogramming

PubMed Central

Stone, John E.; Kohlmeyer, Axel

2011-01-01

The calculation of radial distribution functions (RDFs) from molecular dynamics trajectory data is a common and computationally expensive analysis task. The rate limiting step in the calculation of the RDF is building a histogram of the distance between atom pairs in each trajectory frame. Here we present an implementation of this histogramming scheme for multiple graphics processing units (GPUs). The algorithm features a tiling scheme to maximize the reuse of data at the fastest levels of the GPU’s memory hierarchy and dynamic load balancing to allow high performance on heterogeneous configurations of GPUs. Several versions of the RDF algorithm are presented, utilizing the specific hardware features found on different generations of GPUs. We take advantage of larger shared memory and atomic memory operations available on state-of-the-art GPUs to accelerate the code significantly. The use of atomic memory operations allows the fast, limited-capacity on-chip memory to be used much more efficiently, resulting in a fivefold increase in performance compared to the version of the algorithm without atomic operations. The ultimate version of the algorithm running in parallel on four NVIDIA GeForce GTX 480 (Fermi) GPUs was found to be 92 times faster than a multithreaded implementation running on an Intel Xeon 5550 CPU. On this multi-GPU hardware, the RDF between two selections of 1,000,000 atoms each can be calculated in 26.9 seconds per frame. The multi-GPU RDF algorithms described here are implemented in VMD, a widely used and freely available software package for molecular dynamics visualization and analysis. PMID:21547007
Gravitational tree-code on graphics processing units: implementation in CUDA

NASA Astrophysics Data System (ADS)

Gaburov, Evghenii; Bédorf, Jeroen; Portegies Zwart, Simon

2010-05-01

We present a new very fast tree-code which runs on massively parallel Graphical Processing Units (GPU) with NVIDIA CUDA architecture. The tree-construction and calculation of multipole moments is carried out on the host CPU, while the force calculation which consists of tree walks and evaluation of interaction list is carried out on the GPU. In this way we achieve a sustained performance of about 100GFLOP/s and data transfer rates of about 50GB/s. It takes about a second to compute forces on a million particles with an opening angle of θ ≈ 0.5. The code has a convenient user interface and is freely available for use. http://castle.strw.leidenuniv.nl/software/octgrav.html
Parallel excitation-emission multiplexed fluorescence lifetime confocal microscopy for live cell imaging

PubMed Central

Zhao, Ming; Li, Yu; Peng, Leilei

2014-01-01

We present a novel excitation-emission multiplexed fluorescence lifetime microscopy (FLIM) method that surpasses current FLIM techniques in multiplexing capability. The method employs Fourier multiplexing to simultaneously acquire confocal fluorescence lifetime images of multiple excitation wavelength and emission color combinations at 44,000 pixels/sec. The system is built with low-cost CW laser sources and standard PMTs with versatile spectral configuration, which can be implemented as an add-on to commercial confocal microscopes. The Fourier lifetime confocal method allows fast multiplexed FLIM imaging, which makes it possible to monitor multiple biological processes in live cells. The low cost and compatibility with commercial systems could also make multiplexed FLIM more accessible to biological research community. PMID:24921725
Integrated protocol for reliable and fast quantification and documentation of electrophoresis gels.

PubMed

Rehbein, Peter; Schwalbe, Harald

2015-06-01

Quantitative analysis of electrophoresis gels is an important part in molecular cloning, as well as in protein expression and purification. Parallel quantifications in yield and purity can be most conveniently obtained from densitometric analysis. This communication reports a comprehensive, reliable and simple protocol for gel quantification and documentation, applicable for single samples and with special features for protein expression screens. As major component of the protocol, the fully annotated code of a proprietary open source computer program for semi-automatic densitometric quantification of digitized electrophoresis gels is disclosed. The program ("GelQuant") is implemented for the C-based macro-language of the widespread integrated development environment of IGOR Pro. Copyright © 2014 Elsevier Inc. All rights reserved.
Large-scale trench-perpendicular mantle flow beneath northern Chile

NASA Astrophysics Data System (ADS)

Reiss, M. C.; Rumpker, G.; Woelbern, I.

2017-12-01

We investigate the anisotropic properties of the forearc region of the central Andean margin by analyzing shear-wave splitting from teleseismic and local earthquakes from the Nazca slab. The data stems from the Integrated Plate boundary Observatory Chile (IPOC) located in northern Chile, covering an approximately 120 km wide coastal strip between 17°-25° S with an average station spacing of 60 km. With partly over ten years of data, this data set is uniquely suited to address the long-standing debate about the mantle flow field at the South American margin and in particular whether the flow field beneath the slab is parallel or perpendicular to the trench. Our measurements yield two distinct anisotropic layers. The teleseismic measurements show a change of fast polarizations directions from North to South along the trench ranging from parallel to subparallel to the absolute plate motion and, given the geometry of absolute plate motion and strike of the trench, mostly perpendicular to the trench. Shear-wave splitting from local earthquakes shows fast polarizations roughly aligned trench-parallel but exhibit short-scale variations which are indicative of a relatively shallow source. Comparisons between fast polarization directions and the strike of the local fault systems yield a good agreement. We use forward modelling to test the influence of the upper layer on the teleseismic measurements. We show that the observed variations of teleseismic measurements along the trench are caused by the anisotropy in the upper layer. Accordingly, the mantle layer is best characterized by an anisotropic fast axes parallel to the absolute plate motion which is roughly trench-perpendicular. This anisotropy is likely caused by a combination of crystallographic preferred orientation of the mantle mineral olivine as fossilized anisotropy in the slab and entrained flow beneath the slab. We interpret the upper anisotropic layer to be confined to the crust of the overriding continental plate. This is explained by the shape-preferred orientation of micro-cracks in relation to local fault zones which are oriented parallel the overall strike of the Andean range. Our results do not provide any evidence for a significant contribution of trench-parallel mantle flow beneath the subducting slab to the measurements.
Higher order Larmor radius corrections to guiding-centre equations and application to fast ion equilibrium distributions

NASA Astrophysics Data System (ADS)

Lanthaler, S.; Pfefferlé, D.; Graves, J. P.; Cooper, W. A.

2017-04-01

An improved set of guiding-centre equations, expanded to one order higher in Larmor radius than usually written for guiding-centre codes, are derived for curvilinear flux coordinates and implemented into the orbit following code VENUS-LEVIS. Aside from greatly improving the correspondence between guiding-centre and full particle trajectories, the most important effect of the additional Larmor radius corrections is to modify the definition of the guiding-centre’s parallel velocity via the so-called Baños drift. The correct treatment of the guiding-centre push-forward with the Baños term leads to an anisotropic shift in the phase-space distribution of guiding-centres, consistent with the well-known magnetization term. The consequence of these higher order terms are quantified in three cases where energetic ions are usually followed with standard guiding-centre equations: (1) neutral beam injection in a MAST-like low aspect-ratio spherical equilibrium where the fast ion driven current is significantly larger with respect to previous calculations, (2) fast ion losses due to resonant magnetic perturbations where a lower lost fraction and a better confinement is confirmed, (3) alpha particles in the ripple field of the European DEMO where the effect is found to be marginal.
Sub-second pencil beam dose calculation on GPU for adaptive proton therapy

NASA Astrophysics Data System (ADS)

da Silva, Joakim; Ansorge, Richard; Jena, Rajesh

2015-06-01

Although proton therapy delivered using scanned pencil beams has the potential to produce better dose conformity than conventional radiotherapy, the created dose distributions are more sensitive to anatomical changes and patient motion. Therefore, the introduction of adaptive treatment techniques where the dose can be monitored as it is being delivered is highly desirable. We present a GPU-based dose calculation engine relying on the widely used pencil beam algorithm, developed for on-line dose calculation. The calculation engine was implemented from scratch, with each step of the algorithm parallelized and adapted to run efficiently on the GPU architecture. To ensure fast calculation, it employs several application-specific modifications and simplifications, and a fast scatter-based implementation of the computationally expensive kernel superposition step. The calculation time for a skull base treatment plan using two beam directions was 0.22 s on an Nvidia Tesla K40 GPU, whereas a test case of a cubic target in water from the literature took 0.14 s to calculate. The accuracy of the patient dose distributions was assessed by calculating the γ-index with respect to a gold standard Monte Carlo simulation. The passing rates were 99.2% and 96.7%, respectively, for the 3%/3 mm and 2%/2 mm criteria, matching those produced by a clinical treatment planning system.
Fast direct fourier reconstruction of radial and PROPELLER MRI data using the chirp transform algorithm on graphics hardware.

PubMed

Feng, Yanqiu; Song, Yanli; Wang, Cong; Xin, Xuegang; Feng, Qianjin; Chen, Wufan

2013-10-01

To develop and test a new algorithm for fast direct Fourier transform (DrFT) reconstruction of MR data on non-Cartesian trajectories composed of lines with equally spaced points. The DrFT, which is normally used as a reference in evaluating the accuracy of other reconstruction methods, can reconstruct images directly from non-Cartesian MR data without interpolation. However, DrFT reconstruction involves substantially intensive computation, which makes the DrFT impractical for clinical routine applications. In this article, the Chirp transform algorithm was introduced to accelerate the DrFT reconstruction of radial and Periodically Rotated Overlapping ParallEL Lines with Enhanced Reconstruction (PROPELLER) MRI data located on the trajectories that are composed of lines with equally spaced points. The performance of the proposed Chirp transform algorithm-DrFT algorithm was evaluated by using simulation and in vivo MRI data. After implementing the algorithm on a graphics processing unit, the proposed Chirp transform algorithm-DrFT algorithm achieved an acceleration of approximately one order of magnitude, and the speed-up factor was further increased to approximately three orders of magnitude compared with the traditional single-thread DrFT reconstruction. Implementation the Chirp transform algorithm-DrFT algorithm on the graphics processing unit can efficiently calculate the DrFT reconstruction of the radial and PROPELLER MRI data. Copyright © 2012 Wiley Periodicals, Inc.
Enhanced control of a flexure-jointed micromanipulation system using a vision-based servoing approach

NASA Astrophysics Data System (ADS)

Chuthai, T.; Cole, M. O. T.; Wongratanaphisan, T.; Puangmali, P.

2018-01-01

This paper describes a high-precision motion control implementation for a flexure-jointed micromanipulator. A desktop experimental motion platform has been created based on a 3RUU parallel kinematic mechanism, driven by rotary voice coil actuators. The three arms supporting the platform have rigid links with compact flexure joints as integrated parts and are made by single-process 3D printing. The mechanism overall size is approximately 250x250x100 mm. The workspace is relatively large for a flexure-jointed mechanism, being approximately 20x20x6 mm. A servo-control implementation based on pseudo-rigid-body models (PRBM) of kinematic behavior combined with nonlinear-PID control has been developed. This is shown to achieve fast response with good noise-rejection and platform stability. However, large errors in absolute positioning occur due to deficiencies in the PRBM kinematics, which cannot accurately capture flexure compliance behavior. To overcome this problem, visual servoing is employed, where a digital microscopy system is used to directly measure the platform position by image processing. By adopting nonlinear PID feedback of measured angles for the actuated joints as inner control loops, combined with auxiliary feedback of vision-based measurements, the absolute positioning error can be eliminated. With controller gain tuning, fast dynamic response and low residual vibration of the end platform can be achieved with absolute positioning accuracy within ±1 micron.

A sample implementation for parallelizing Divide-and-Conquer algorithms on the GPU.

PubMed

Mei, Gang; Zhang, Jiayin; Xu, Nengxiong; Zhao, Kunyang

2018-01-01

The strategy of Divide-and-Conquer (D&C) is one of the frequently used programming patterns to design efficient algorithms in computer science, which has been parallelized on shared memory systems and distributed memory systems. Tzeng and Owens specifically developed a generic paradigm for parallelizing D&C algorithms on modern Graphics Processing Units (GPUs). In this paper, by following the generic paradigm proposed by Tzeng and Owens, we provide a new and publicly available GPU implementation of the famous D&C algorithm, QuickHull, to give a sample and guide for parallelizing D&C algorithms on the GPU. The experimental results demonstrate the practicality of our sample GPU implementation. Our research objective in this paper is to present a sample GPU implementation of a classical D&C algorithm to help interested readers to develop their own efficient GPU implementations with fewer efforts.
FPGA-based prototype storage system with phase change memory

NASA Astrophysics Data System (ADS)

Li, Gezi; Chen, Xiaogang; Chen, Bomy; Li, Shunfen; Zhou, Mi; Han, Wenbing; Song, Zhitang

2016-10-01

With the ever-increasing amount of data being stored via social media, mobile telephony base stations, and network devices etc. the database systems face severe bandwidth bottlenecks when moving vast amounts of data from storage to the processing nodes. At the same time, Storage Class Memory (SCM) technologies such as Phase Change Memory (PCM) with unique features like fast read access, high density, non-volatility, byte-addressability, positive response to increasing temperature, superior scalability, and zero standby leakage have changed the landscape of modern computing and storage systems. In such a scenario, we present a storage system called FLEET which can off-load partial or whole SQL queries to the storage engine from CPU. FLEET uses an FPGA rather than conventional CPUs to implement the off-load engine due to its highly parallel nature. We have implemented an initial prototype of FLEET with PCM-based storage. The results demonstrate that significant performance and CPU utilization gains can be achieved by pushing selected query processing components inside in PCM-based storage.
On the inversion of geodetic integrals defined over the sphere using 1-D FFT

NASA Astrophysics Data System (ADS)

García, R. V.; Alejo, C. A.

2005-08-01

An iterative method is presented which performs inversion of integrals defined over the sphere. The method is based on one-dimensional fast Fourier transform (1-D FFT) inversion and is implemented with the projected Landweber technique, which is used to solve constrained least-squares problems reducing the associated 1-D cyclic-convolution error. The results obtained are as precise as the direct matrix inversion approach, but with better computational efficiency. A case study uses the inversion of Hotine’s integral to obtain gravity disturbances from geoid undulations. Numerical convergence is also analyzed and comparisons with respect to the direct matrix inversion method using conjugate gradient (CG) iteration are presented. Like the CG method, the number of iterations needed to get the optimum (i.e., small) error decreases as the measurement noise increases. Nevertheless, for discrete data given over a whole parallel band, the method can be applied directly without implementing the projected Landweber method, since no cyclic convolution error exists.
Adaptive independent joint control of manipulators - Theory and experiment

NASA Technical Reports Server (NTRS)

Seraji, H.

1988-01-01

The author presents a simple decentralized adaptive control scheme for multijoint robot manipulators based on the independent joint control concept. The proposed control scheme for each joint consists of a PID (proportional integral and differential) feedback controller and a position-velocity-acceleration feedforward controller, both with adjustable gains. The static and dynamic couplings that exist between the joint motions are compensated by the adaptive independent joint controllers while ensuring trajectory tracking. The proposed scheme is implemented on a MicroVAX II computer for motion control of the first three joints of a PUMA 560 arm. Experimental results are presented to demonstrate that trajectory tracking is achieved despite strongly coupled, highly nonlinear joint dynamics. The results confirm that the proposed decentralized adaptive control of manipulators is feasible, in spite of strong interactions between joint motions. The control scheme presented is computationally very fast and is amenable to parallel processing implementation within a distributed computing architecture, where each joint is controlled independently by a simple algorithm on a dedicated microprocessor.
Sigma: Strain-level inference of genomes from metagenomic analysis for biosurveillance

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ahn, Tae-Hyuk; Chai, Juanjuan; Pan, Chongle

Motivation: Metagenomic sequencing of clinical samples provides a promising technique for direct pathogen detection and characterization in biosurveillance. Taxonomic analysis at the strain level can be used to resolve serotypes of a pathogen in biosurveillance. Sigma was developed for strain-level identification and quantification of pathogens using their reference genomes based on metagenomic analysis. Results: Sigma provides not only accurate strain-level inferences, but also three unique capabilities: (i) Sigma quantifies the statistical uncertainty of its inferences, which includes hypothesis testing of identified genomes and confidence interval estimation of their relative abundances; (ii) Sigma enables strain variant calling by assigning metagenomic readsmore » to their most likely reference genomes; and (iii) Sigma supports parallel computing for fast analysis of large datasets. In conclusion, the algorithm performance was evaluated using simulated mock communities and fecal samples with spike-in pathogen strains. Availability and Implementation: Sigma was implemented in C++ with source codes and binaries freely available at http://sigma.omicsbio.org.« less
Sigma: Strain-level inference of genomes from metagenomic analysis for biosurveillance

DOE PAGES

Ahn, Tae-Hyuk; Chai, Juanjuan; Pan, Chongle

2014-09-29

Motivation: Metagenomic sequencing of clinical samples provides a promising technique for direct pathogen detection and characterization in biosurveillance. Taxonomic analysis at the strain level can be used to resolve serotypes of a pathogen in biosurveillance. Sigma was developed for strain-level identification and quantification of pathogens using their reference genomes based on metagenomic analysis. Results: Sigma provides not only accurate strain-level inferences, but also three unique capabilities: (i) Sigma quantifies the statistical uncertainty of its inferences, which includes hypothesis testing of identified genomes and confidence interval estimation of their relative abundances; (ii) Sigma enables strain variant calling by assigning metagenomic readsmore » to their most likely reference genomes; and (iii) Sigma supports parallel computing for fast analysis of large datasets. In conclusion, the algorithm performance was evaluated using simulated mock communities and fecal samples with spike-in pathogen strains. Availability and Implementation: Sigma was implemented in C++ with source codes and binaries freely available at http://sigma.omicsbio.org.« less
Development of an integrated electrochemical system for in vitro yeast viability testing.

PubMed

Adami, Andrea; Ress, Cristina; Collini, Cristian; Pedrotti, Severino; Lorenzelli, Leandro

2013-02-15

This work describes the development and testing of a microfabricated sensor for rapid cell growth monitoring, especially focused on yeast quality assessment for wine applications. The device consists of a NMOS ISFET sensor with Si(3)N(4) gate, able to indirectly monitor extracellular metabolism through pH variation of the medium, and a solid-state reference electrode implemented with PVC membranes doped with lipophilic salts (tetrabutylammonium-tetrabutylborate (TBA-TBB) and Potassium tetrakis(4-chlorphenyl)borate (KTClpB)). The use of a solid state reference electrode enables the implementation of a large number of cell assays in parallel, without the need of external conventional reference electrodes. Microbial growth testing has been performed both in standard culture conditions and on chip at different concentrations of ethanol in order to carry out a commonly used screening of wine yeast strains. Cell growth tests can be performed in few hours, providing a fast, sensitive and low cost analysis with respect to the conventional procedures. Copyright © 2012 Elsevier B.V. All rights reserved.
Externally Calibrated Parallel Imaging for 3D Multispectral Imaging Near Metallic Implants Using Broadband Ultrashort Echo Time Imaging

PubMed Central

Wiens, Curtis N.; Artz, Nathan S.; Jang, Hyungseok; McMillan, Alan B.; Reeder, Scott B.

2017-01-01

Purpose To develop an externally calibrated parallel imaging technique for three-dimensional multispectral imaging (3D-MSI) in the presence of metallic implants. Theory and Methods A fast, ultrashort echo time (UTE) calibration acquisition is proposed to enable externally calibrated parallel imaging techniques near metallic implants. The proposed calibration acquisition uses a broadband radiofrequency (RF) pulse to excite the off-resonance induced by the metallic implant, fully phase-encoded imaging to prevent in-plane distortions, and UTE to capture rapidly decaying signal. The performance of the externally calibrated parallel imaging reconstructions was assessed using phantoms and in vivo examples. Results Phantom and in vivo comparisons to self-calibrated parallel imaging acquisitions show that significant reductions in acquisition times can be achieved using externally calibrated parallel imaging with comparable image quality. Acquisition time reductions are particularly large for fully phase-encoded methods such as spectrally resolved fully phase-encoded three-dimensional (3D) fast spin-echo (SR-FPE), in which scan time reductions of up to 8 min were obtained. Conclusion A fully phase-encoded acquisition with broadband excitation and UTE enabled externally calibrated parallel imaging for 3D-MSI, eliminating the need for repeated calibration regions at each frequency offset. Significant reductions in acquisition time can be achieved, particularly for fully phase-encoded methods like SR-FPE. PMID:27403613
National Combustion Code: Parallel Implementation and Performance

NASA Technical Reports Server (NTRS)

Quealy, A.; Ryder, R.; Norris, A.; Liu, N.-S.

2000-01-01

The National Combustion Code (NCC) is being developed by an industry-government team for the design and analysis of combustion systems. CORSAIR-CCD is the current baseline reacting flow solver for NCC. This is a parallel, unstructured grid code which uses a distributed memory, message passing model for its parallel implementation. The focus of the present effort has been to improve the performance of the NCC flow solver to meet combustor designer requirements for model accuracy and analysis turnaround time. Improving the performance of this code contributes significantly to the overall reduction in time and cost of the combustor design cycle. This paper describes the parallel implementation of the NCC flow solver and summarizes its current parallel performance on an SGI Origin 2000. Earlier parallel performance results on an IBM SP-2 are also included. The performance improvements which have enabled a turnaround of less than 15 hours for a 1.3 million element fully reacting combustion simulation are described.
Handling Big Data in Medical Imaging: Iterative Reconstruction with Large-Scale Automated Parallel Computation

PubMed Central

Lee, Jae H.; Yao, Yushu; Shrestha, Uttam; Gullberg, Grant T.; Seo, Youngho

2014-01-01

The primary goal of this project is to implement the iterative statistical image reconstruction algorithm, in this case maximum likelihood expectation maximum (MLEM) used for dynamic cardiac single photon emission computed tomography, on Spark/GraphX. This involves porting the algorithm to run on large-scale parallel computing systems. Spark is an easy-to- program software platform that can handle large amounts of data in parallel. GraphX is a graph analytic system running on top of Spark to handle graph and sparse linear algebra operations in parallel. The main advantage of implementing MLEM algorithm in Spark/GraphX is that it allows users to parallelize such computation without any expertise in parallel computing or prior knowledge in computer science. In this paper we demonstrate a successful implementation of MLEM in Spark/GraphX and present the performance gains with the goal to eventually make it useable in clinical setting. PMID:27081299
Handling Big Data in Medical Imaging: Iterative Reconstruction with Large-Scale Automated Parallel Computation.

PubMed

Lee, Jae H; Yao, Yushu; Shrestha, Uttam; Gullberg, Grant T; Seo, Youngho

2014-11-01

The primary goal of this project is to implement the iterative statistical image reconstruction algorithm, in this case maximum likelihood expectation maximum (MLEM) used for dynamic cardiac single photon emission computed tomography, on Spark/GraphX. This involves porting the algorithm to run on large-scale parallel computing systems. Spark is an easy-to- program software platform that can handle large amounts of data in parallel. GraphX is a graph analytic system running on top of Spark to handle graph and sparse linear algebra operations in parallel. The main advantage of implementing MLEM algorithm in Spark/GraphX is that it allows users to parallelize such computation without any expertise in parallel computing or prior knowledge in computer science. In this paper we demonstrate a successful implementation of MLEM in Spark/GraphX and present the performance gains with the goal to eventually make it useable in clinical setting.
Final report for the Tera Computer TTI CRADA

DOE Office of Scientific and Technical Information (OSTI.GOV)

Davidson, G.S.; Pavlakos, C.; Silva, C.

1997-01-01

Tera Computer and Sandia National Laboratories have completed a CRADA, which examined the Tera Multi-Threaded Architecture (MTA) for use with large codes of importance to industry and DOE. The MTA is an innovative architecture that uses parallelism to mask latency between memories and processors. The physical implementation is a parallel computer with high cross-section bandwidth and GaAs processors designed by Tera, which support many small computation threads and fast, lightweight context switches between them. When any thread blocks while waiting for memory accesses to complete, another thread immediately begins execution so that high CPU utilization is maintained. The Tera MTAmore » parallel computer has a single, global address space, which is appealing when porting existing applications to a parallel computer. This ease of porting is further enabled by compiler technology that helps break computations into parallel threads. DOE and Sandia National Laboratories were interested in working with Tera to further develop this computing concept. While Tera Computer would continue the hardware development and compiler research, Sandia National Laboratories would work with Tera to ensure that their compilers worked well with important Sandia codes, most particularly CTH, a shock physics code used for weapon safety computations. In addition to that important code, Sandia National Laboratories would complete research on a robotic path planning code, SANDROS, which is important in manufacturing applications, and would evaluate the MTA performance on this code. Finally, Sandia would work directly with Tera to develop 3D visualization codes, which would be appropriate for use with the MTA. Each of these tasks has been completed to the extent possible, given that Tera has just completed the MTA hardware. All of the CRADA work had to be done on simulators.« less
First results of the silicon telescope using an 'artificial retina' for fast track finding

DOE Office of Scientific and Technical Information (OSTI.GOV)

Neri, N.; Abba, A.; Caponio, F.

We present the first results of the prototype of a silicon tracker with trigger capabilities based on a novel approach for fast track finding. The working principle of the 'artificial retina' is inspired by the processing of visual images by the brain and it is based on extensive parallelization of data distribution and pattern recognition. The algorithm has been implemented in commercial FPGAs in three main logic modules: a switch for the routing of the detector hits, a pool of engines for the digital processing of the hits, and a block for the calculation of the track parameters. The architecturemore » is fully pipelined and allows the reconstruction of real-time tracks with a latency less then 100 clock cycles, corresponding to 0.25 microsecond at 400 MHz clock. The silicon telescope consists of 8 layers of single-sided silicon strip detectors with 512 strips each. The detector size is about 10 cm x 10 cm and the strip pitch is 183 μm. The detectors are read out by the Beetle chip, a custom ASICs developed for LHCb, which provides the measurement of the hit position and pulse height of 128 channels. The 'artificial retina' algorithm has been implemented on custom data acquisition boards based on FPGAs Xilinx Kintex 7 lx160. The parameters of the tracks detected are finally transferred to host PC via USB 3.0. The boards manage the read-out ASICs and the sampling of the analog channels. The read-out is performed at 40 MHz on 4 channels for each ASIC that corresponds to a decoding of the telescope information at 1.1 MHz. We report on the first results of the fast tracking device and compare with simulations. (authors)« less
Parallel fast multipole boundary element method applied to computational homogenization

NASA Astrophysics Data System (ADS)

Ptaszny, Jacek

2018-01-01

In the present work, a fast multipole boundary element method (FMBEM) and a parallel computer code for 3D elasticity problem is developed and applied to the computational homogenization of a solid containing spherical voids. The system of equation is solved by using the GMRES iterative solver. The boundary of the body is dicretized by using the quadrilateral serendipity elements with an adaptive numerical integration. Operations related to a single GMRES iteration, performed by traversing the corresponding tree structure upwards and downwards, are parallelized by using the OpenMP standard. The assignment of tasks to threads is based on the assumption that the tree nodes at which the moment transformations are initialized can be partitioned into disjoint sets of equal or approximately equal size and assigned to the threads. The achieved speedup as a function of number of threads is examined.
Electron Cooling and Isotropization during Magnetotail Current Sheet Thinning: Implications for Parallel Electric Fields

NASA Astrophysics Data System (ADS)

Lu, San; Artemyev, A. V.; Angelopoulos, V.

2017-11-01

Magnetotail current sheet thinning is a distinctive feature of substorm growth phase, during which magnetic energy is stored in the magnetospheric lobes. Investigation of charged particle dynamics in such thinning current sheets is believed to be important for understanding the substorm energy storage and the current sheet destabilization responsible for substorm expansion phase onset. We use Time History of Events and Macroscale Interactions during Substorms (THEMIS) B and C observations in 2008 and 2009 at 18 - 25 RE to show that during magnetotail current sheet thinning, the electron temperature decreases (cooling), and the parallel temperature decreases faster than the perpendicular temperature, leading to a decrease of the initially strong electron temperature anisotropy (isotropization). This isotropization cannot be explained by pure adiabatic cooling or by pitch angle scattering. We use test particle simulations to explore the mechanism responsible for the cooling and isotropization. We find that during the thinning, a fast decrease of a parallel electric field (directed toward the Earth) can speed up the electron parallel cooling, causing it to exceed the rate of perpendicular cooling, and thus lead to isotropization, consistent with observation. If the parallel electric field is too small or does not change fast enough, the electron parallel cooling is slower than the perpendicular cooling, so the parallel electron anisotropy grows, contrary to observation. The same isotropization can also be accomplished by an increasing parallel electric field directed toward the equatorial plane. Our study reveals the existence of a large-scale parallel electric field, which plays an important role in magnetotail particle dynamics during the current sheet thinning process.
Fast Automatic Segmentation of White Matter Streamlines Based on a Multi-Subject Bundle Atlas.

PubMed

Labra, Nicole; Guevara, Pamela; Duclap, Delphine; Houenou, Josselin; Poupon, Cyril; Mangin, Jean-François; Figueroa, Miguel

2017-01-01

This paper presents an algorithm for fast segmentation of white matter bundles from massive dMRI tractography datasets using a multisubject atlas. We use a distance metric to compare streamlines in a subject dataset to labeled centroids in the atlas, and label them using a per-bundle configurable threshold. In order to reduce segmentation time, the algorithm first preprocesses the data using a simplified distance metric to rapidly discard candidate streamlines in multiple stages, while guaranteeing that no false negatives are produced. The smaller set of remaining streamlines is then segmented using the original metric, thus eliminating any false positives from the preprocessing stage. As a result, a single-thread implementation of the algorithm can segment a dataset of almost 9 million streamlines in less than 6 minutes. Moreover, parallel versions of our algorithm for multicore processors and graphics processing units further reduce the segmentation time to less than 22 seconds and to 5 seconds, respectively. This performance enables the use of the algorithm in truly interactive applications for visualization, analysis, and segmentation of large white matter tractography datasets.
Using OpenMP vs. Threading Building Blocks for Medical Imaging on Multi-cores

NASA Astrophysics Data System (ADS)

Kegel, Philipp; Schellmann, Maraike; Gorlatch, Sergei

We compare two parallel programming approaches for multi-core systems: the well-known OpenMP and the recently introduced Threading Building Blocks (TBB) library by Intel®. The comparison is made using the parallelization of a real-world numerical algorithm for medical imaging. We develop several parallel implementations, and compare them w.r.t. programming effort, programming style and abstraction, and runtime performance. We show that TBB requires a considerable program re-design, whereas with OpenMP simple compiler directives are sufficient. While TBB appears to be less appropriate for parallelizing existing implementations, it fosters a good programming style and higher abstraction level for newly developed parallel programs. Our experimental measurements on a dual quad-core system demonstrate that OpenMP slightly outperforms TBB in our implementation.
Small-tip fast recovery imaging using non-slice-selective tailored tip-up pulses and RF-spoiling

PubMed Central

Nielsen, Jon-Fredrik; Yoon, Daehyun; Noll, Douglas C.

2012-01-01

Small-tip fast recovery (STFR) imaging is a new steady-state imaging sequence that is a potential alternative to balanced steady-state free precession (bSSFP). Under ideal imaging conditions, STFR may provide comparable signal-to-noise ratio (SNR) and image contrast as bSSFP, but without signal variations due to resonance offset. STFR relies on a tailored “tip-up”, or “fast recovery”, RF pulse to align the spins with the longitudinal axis after each data readout segment. The design of the tip-up pulse is based on the acquisition of a separate off-resonance (B0) map. Unfortunately, the design of fast (a few ms) slice- or slab-selective RF pulses that accurately tailor the excitation pattern to the local B0 inhomogeneity over the entire imaging volume remains a challenging and unsolved problem. We introduce a novel implementation of STFR imaging based on non-slice-selective tip-up pulses, which simplifies the RF design problem significantly. Out-of-slice magnetization pathways are suppressed using RF-spoiling. Brain images obtained with this technique show excellent gray/white matter contrast, and point to the possibility of rapid steady-state T2/T1-weighted imaging with intrinsic suppression of cerebrospinal fluid, through-plane vessel signal, and off-resonance artifacts. In the future we expect STFR imaging to benefit significantly from parallel excitation hardware and high-order gradient shim systems. PMID:22511367
The Mercury System: Embedding Computation into Disk Drives

DTIC Science & Technology

2004-08-20

enabling technologies to build extremely fast data search engines . We do this by moving the search closer to the data, and performing it in hardware...engine searches in parallel across a disk or disk surface 2. System Parallelism: Searching is off-loaded to search engines and main processor can
Parallel implementation of a Lagrangian-based model on an adaptive mesh in C++: Application to sea-ice

NASA Astrophysics Data System (ADS)

Samaké, Abdoulaye; Rampal, Pierre; Bouillon, Sylvain; Ólason, Einar

2017-12-01

We present a parallel implementation framework for a new dynamic/thermodynamic sea-ice model, called neXtSIM, based on the Elasto-Brittle rheology and using an adaptive mesh. The spatial discretisation of the model is done using the finite-element method. The temporal discretisation is semi-implicit and the advection is achieved using either a pure Lagrangian scheme or an Arbitrary Lagrangian Eulerian scheme (ALE). The parallel implementation presented here focuses on the distributed-memory approach using the message-passing library MPI. The efficiency and the scalability of the parallel algorithms are illustrated by the numerical experiments performed using up to 500 processor cores of a cluster computing system. The performance obtained by the proposed parallel implementation of the neXtSIM code is shown being sufficient to perform simulations for state-of-the-art sea ice forecasting and geophysical process studies over geographical domain of several millions squared kilometers like the Arctic region.

MPI_XSTAR: MPI-based Parallelization of the XSTAR Photoionization Program

NASA Astrophysics Data System (ADS)

Danehkar, Ashkbiz; Nowak, Michael A.; Lee, Julia C.; Smith, Randall K.

2018-02-01

We describe a program for the parallel implementation of multiple runs of XSTAR, a photoionization code that is used to predict the physical properties of an ionized gas from its emission and/or absorption lines. The parallelization program, called MPI_XSTAR, has been developed and implemented in the C++ language by using the Message Passing Interface (MPI) protocol, a conventional standard of parallel computing. We have benchmarked parallel multiprocessing executions of XSTAR, using MPI_XSTAR, against a serial execution of XSTAR, in terms of the parallelization speedup and the computing resource efficiency. Our experience indicates that the parallel execution runs significantly faster than the serial execution, however, the efficiency in terms of the computing resource usage decreases with increasing the number of processors used in the parallel computing.
Efficient Implementation of Multigrid Solvers on Message-Passing Parrallel Systems

NASA Technical Reports Server (NTRS)

Lou, John

1994-01-01

We discuss our implementation strategies for finite difference multigrid partial differential equation (PDE) solvers on message-passing systems. Our target parallel architecture is Intel parallel computers: the Delta and Paragon system.
Fast Confocal Raman Imaging Using a 2-D Multifocal Array for Parallel Hyperspectral Detection.

PubMed

Kong, Lingbo; Navas-Moreno, Maria; Chan, James W

2016-01-19

We present the development of a novel confocal hyperspectral Raman microscope capable of imaging at speeds up to 100 times faster than conventional point-scan Raman microscopy under high noise conditions. The microscope utilizes scanning galvomirrors to generate a two-dimensional (2-D) multifocal array at the sample plane, generating Raman signals simultaneously at each focus of the array pattern. The signals are combined into a single beam and delivered through a confocal pinhole before being focused through the slit of a spectrometer. To separate the signals from each row of the array, a synchronized scan mirror placed in front of the spectrometer slit positions the Raman signals onto different pixel rows of the detector. We devised an approach to deconvolve the superimposed signals and retrieve the individual spectra at each focal position within a given row. The galvomirrors were programmed to scan different focal arrays following Hadamard encoding patterns. A key feature of the Hadamard detection is the reconstruction of individual spectra with improved signal-to-noise ratio. Using polystyrene beads as test samples, we demonstrated not only that our system images faster than a conventional point-scan method but that it is especially advantageous under noisy conditions, such as when the CCD detector operates at fast read-out rates and high temperatures. This is the first demonstration of multifocal confocal Raman imaging in which parallel spectral detection is implemented along both axes of the CCD detector chip. We envision this novel 2-D multifocal spectral detection technique can be used to develop faster imaging spontaneous Raman microscopes with lower cost detectors.
Language Classification using N-grams Accelerated by FPGA-based Bloom Filters

DOE Office of Scientific and Technical Information (OSTI.GOV)

Jacob, A; Gokhale, M

N-Gram (n-character sequences in text documents) counting is a well-established technique used in classifying the language of text in a document. In this paper, n-gram processing is accelerated through the use of reconfigurable hardware on the XtremeData XD1000 system. Our design employs parallelism at multiple levels, with parallel Bloom Filters accessing on-chip RAM, parallel language classifiers, and parallel document processing. In contrast to another hardware implementation (HAIL algorithm) that uses off-chip SRAM for lookup, our highly scalable implementation uses only on-chip memory blocks. Our implementation of end-to-end language classification runs at 85x comparable software and 1.45x the competing hardware design.
Implementation of a Message Passing Interface into a Cloud-Resolving Model for Massively Parallel Computing

NASA Technical Reports Server (NTRS)

Juang, Hann-Ming Henry; Tao, Wei-Kuo; Zeng, Xi-Ping; Shie, Chung-Lin; Simpson, Joanne; Lang, Steve

2004-01-01

The capability for massively parallel programming (MPP) using a message passing interface (MPI) has been implemented into a three-dimensional version of the Goddard Cumulus Ensemble (GCE) model. The design for the MPP with MPI uses the concept of maintaining similar code structure between the whole domain as well as the portions after decomposition. Hence the model follows the same integration for single and multiple tasks (CPUs). Also, it provides for minimal changes to the original code, so it is easily modified and/or managed by the model developers and users who have little knowledge of MPP. The entire model domain could be sliced into one- or two-dimensional decomposition with a halo regime, which is overlaid on partial domains. The halo regime requires that no data be fetched across tasks during the computational stage, but it must be updated before the next computational stage through data exchange via MPI. For reproducible purposes, transposing data among tasks is required for spectral transform (Fast Fourier Transform, FFT), which is used in the anelastic version of the model for solving the pressure equation. The performance of the MPI-implemented codes (i.e., the compressible and anelastic versions) was tested on three different computing platforms. The major results are: 1) both versions have speedups of about 99% up to 256 tasks but not for 512 tasks; 2) the anelastic version has better speedup and efficiency because it requires more computations than that of the compressible version; 3) equal or approximately-equal numbers of slices between the x- and y- directions provide the fastest integration due to fewer data exchanges; and 4) one-dimensional slices in the x-direction result in the slowest integration due to the need for more memory relocation for computation.
Hardware Architectures for Data-Intensive Computing Problems: A Case Study for String Matching

DOE Office of Scientific and Technical Information (OSTI.GOV)

Tumeo, Antonino; Villa, Oreste; Chavarría-Miranda, Daniel

DNA analysis is an emerging application of high performance bioinformatic. Modern sequencing machinery are able to provide, in few hours, large input streams of data, which needs to be matched against exponentially growing databases of known fragments. The ability to recognize these patterns effectively and fastly may allow extending the scale and the reach of the investigations performed by biology scientists. Aho-Corasick is an exact, multiple pattern matching algorithm often at the base of this application. High performance systems are a promising platform to accelerate this algorithm, which is computationally intensive but also inherently parallel. Nowadays, high performance systems alsomore » include heterogeneous processing elements, such as Graphic Processing Units (GPUs), to further accelerate parallel algorithms. Unfortunately, the Aho-Corasick algorithm exhibits large performance variability, depending on the size of the input streams, on the number of patterns to search and on the number of matches, and poses significant challenges on current high performance software and hardware implementations. An adequate mapping of the algorithm on the target architecture, coping with the limit of the underlining hardware, is required to reach the desired high throughputs. In this paper, we discuss the implementation of the Aho-Corasick algorithm for GPU-accelerated high performance systems. We present an optimized implementation of Aho-Corasick for GPUs and discuss its tradeoffs on the Tesla T10 and he new Tesla T20 (codename Fermi) GPUs. We then integrate the optimized GPU code, respectively, in a MPI-based and in a pthreads-based load balancer to enable execution of the algorithm on clusters and large sharedmemory multiprocessors (SMPs) accelerated with multiple GPUs.« less
A framework for plasticity implementation on the SpiNNaker neural architecture.

PubMed

Galluppi, Francesco; Lagorce, Xavier; Stromatias, Evangelos; Pfeiffer, Michael; Plana, Luis A; Furber, Steve B; Benosman, Ryad B

2014-01-01

Many of the precise biological mechanisms of synaptic plasticity remain elusive, but simulations of neural networks have greatly enhanced our understanding of how specific global functions arise from the massively parallel computation of neurons and local Hebbian or spike-timing dependent plasticity rules. For simulating large portions of neural tissue, this has created an increasingly strong need for large scale simulations of plastic neural networks on special purpose hardware platforms, because synaptic transmissions and updates are badly matched to computing style supported by current architectures. Because of the great diversity of biological plasticity phenomena and the corresponding diversity of models, there is a great need for testing various hypotheses about plasticity before committing to one hardware implementation. Here we present a novel framework for investigating different plasticity approaches on the SpiNNaker distributed digital neural simulation platform. The key innovation of the proposed architecture is to exploit the reconfigurability of the ARM processors inside SpiNNaker, dedicating a subset of them exclusively to process synaptic plasticity updates, while the rest perform the usual neural and synaptic simulations. We demonstrate the flexibility of the proposed approach by showing the implementation of a variety of spike- and rate-based learning rules, including standard Spike-Timing dependent plasticity (STDP), voltage-dependent STDP, and the rate-based BCM rule. We analyze their performance and validate them by running classical learning experiments in real time on a 4-chip SpiNNaker board. The result is an efficient, modular, flexible and scalable framework, which provides a valuable tool for the fast and easy exploration of learning models of very different kinds on the parallel and reconfigurable SpiNNaker system.
Cross-Approximate Entropy parallel computation on GPUs for biomedical signal analysis. Application to MEG recordings.

PubMed

Martínez-Zarzuela, Mario; Gómez, Carlos; Díaz-Pernas, Francisco Javier; Fernández, Alberto; Hornero, Roberto

2013-10-01

Cross-Approximate Entropy (Cross-ApEn) is a useful measure to quantify the statistical dissimilarity of two time series. In spite of the advantage of Cross-ApEn over its one-dimensional counterpart (Approximate Entropy), only a few studies have applied it to biomedical signals, mainly due to its high computational cost. In this paper, we propose a fast GPU-based implementation of the Cross-ApEn that makes feasible its use over a large amount of multidimensional data. The scheme followed is fully scalable, thus maximizes the use of the GPU despite of the number of neural signals being processed. The approach consists in processing many trials or epochs simultaneously, with independence of its origin. In the case of MEG data, these trials can proceed from different input channels or subjects. The proposed implementation achieves an average speedup greater than 250× against a CPU parallel version running on a processor containing six cores. A dataset of 30 subjects containing 148 MEG channels (49 epochs of 1024 samples per channel) can be analyzed using our development in about 30min. The same processing takes 5 days on six cores and 15 days when running on a single core. The speedup is much larger if compared to a basic sequential Matlab(®) implementation, that would need 58 days per subject. To our knowledge, this is the first contribution of Cross-ApEn measure computation using GPUs. This study demonstrates that this hardware is, to the day, the best option for the signal processing of biomedical data with Cross-ApEn. Copyright © 2013 Elsevier Ireland Ltd. All rights reserved.
A framework for plasticity implementation on the SpiNNaker neural architecture

PubMed Central

Galluppi, Francesco; Lagorce, Xavier; Stromatias, Evangelos; Pfeiffer, Michael; Plana, Luis A.; Furber, Steve B.; Benosman, Ryad B.

2015-01-01

Many of the precise biological mechanisms of synaptic plasticity remain elusive, but simulations of neural networks have greatly enhanced our understanding of how specific global functions arise from the massively parallel computation of neurons and local Hebbian or spike-timing dependent plasticity rules. For simulating large portions of neural tissue, this has created an increasingly strong need for large scale simulations of plastic neural networks on special purpose hardware platforms, because synaptic transmissions and updates are badly matched to computing style supported by current architectures. Because of the great diversity of biological plasticity phenomena and the corresponding diversity of models, there is a great need for testing various hypotheses about plasticity before committing to one hardware implementation. Here we present a novel framework for investigating different plasticity approaches on the SpiNNaker distributed digital neural simulation platform. The key innovation of the proposed architecture is to exploit the reconfigurability of the ARM processors inside SpiNNaker, dedicating a subset of them exclusively to process synaptic plasticity updates, while the rest perform the usual neural and synaptic simulations. We demonstrate the flexibility of the proposed approach by showing the implementation of a variety of spike- and rate-based learning rules, including standard Spike-Timing dependent plasticity (STDP), voltage-dependent STDP, and the rate-based BCM rule. We analyze their performance and validate them by running classical learning experiments in real time on a 4-chip SpiNNaker board. The result is an efficient, modular, flexible and scalable framework, which provides a valuable tool for the fast and easy exploration of learning models of very different kinds on the parallel and reconfigurable SpiNNaker system. PMID:25653580
A sweep algorithm for massively parallel simulation of circuit-switched networks

NASA Technical Reports Server (NTRS)

Gaujal, Bruno; Greenberg, Albert G.; Nicol, David M.

1992-01-01

A new massively parallel algorithm is presented for simulating large asymmetric circuit-switched networks, controlled by a randomized-routing policy that includes trunk-reservation. A single instruction multiple data (SIMD) implementation is described, and corresponding experiments on a 16384 processor MasPar parallel computer are reported. A multiple instruction multiple data (MIMD) implementation is also described, and corresponding experiments on an Intel IPSC/860 parallel computer, using 16 processors, are reported. By exploiting parallelism, our algorithm increases the possible execution rate of such complex simulations by as much as an order of magnitude.
4P: fast computing of population genetics statistics from large DNA polymorphism panels

PubMed Central

Benazzo, Andrea; Panziera, Alex; Bertorelle, Giorgio

2015-01-01

Massive DNA sequencing has significantly increased the amount of data available for population genetics and molecular ecology studies. However, the parallel computation of simple statistics within and between populations from large panels of polymorphic sites is not yet available, making the exploratory analyses of a set or subset of data a very laborious task. Here, we present 4P (parallel processing of polymorphism panels), a stand-alone software program for the rapid computation of genetic variation statistics (including the joint frequency spectrum) from millions of DNA variants in multiple individuals and multiple populations. It handles a standard input file format commonly used to store DNA variation from empirical or simulation experiments. The computational performance of 4P was evaluated using large SNP (single nucleotide polymorphism) datasets from human genomes or obtained by simulations. 4P was faster or much faster than other comparable programs, and the impact of parallel computing using multicore computers or servers was evident. 4P is a useful tool for biologists who need a simple and rapid computer program to run exploratory population genetics analyses in large panels of genomic data. It is also particularly suitable to analyze multiple data sets produced in simulation studies. Unix, Windows, and MacOs versions are provided, as well as the source code for easier pipeline implementations. PMID:25628874
Parallel implementation of approximate atomistic models of the AMOEBA polarizable model

NASA Astrophysics Data System (ADS)

Demerdash, Omar; Head-Gordon, Teresa

2016-11-01

In this work we present a replicated data hybrid OpenMP/MPI implementation of a hierarchical progression of approximate classical polarizable models that yields speedups of up to ∼10 compared to the standard OpenMP implementation of the exact parent AMOEBA polarizable model. In addition, our parallel implementation exhibits reasonable weak and strong scaling. The resulting parallel software will prove useful for those who are interested in how molecular properties converge in the condensed phase with respect to the MBE, it provides a fruitful test bed for exploring different electrostatic embedding schemes, and offers an interesting possibility for future exascale computing paradigms.
Box schemes and their implementation on the iPSC/860

NASA Technical Reports Server (NTRS)

Chattot, J. J.; Merriam, M. L.

1991-01-01

Research on algoriths for efficiently solving fluid flow problems on massively parallel computers is continued in the present paper. Attention is given to the implementation of a box scheme on the iPSC/860, a massively parallel computer with a peak speed of 10 Gflops and a memory of 128 Mwords. A domain decomposition approach to parallelism is used.
A fast, parallel algorithm for distant-dependent calculation of crystal properties

NASA Astrophysics Data System (ADS)

Stein, Matthew

2017-12-01

A fast, parallel algorithm for distant-dependent calculation and simulation of crystal properties is presented along with speedup results and methods of application. An illustrative example is used to compute the Lennard-Jones lattice constants up to 32 significant figures for 4 ≤ p ≤ 30 in the simple cubic, face-centered cubic, body-centered cubic, hexagonal-close-pack, and diamond lattices. In most cases, the known precision of these constants is more than doubled, and in some cases, corrected from previously published figures. The tools and strategies to make this computation possible are detailed along with application to other potentials, including those that model defects.
Effects of ATC automation on precision approaches to closely space parallel runways

NASA Technical Reports Server (NTRS)

Slattery, R.; Lee, K.; Sanford, B.

1995-01-01

Improved navigational technology (such as the Microwave Landing System and the Global Positioning System) installed in modern aircraft will enable air traffic controllers to better utilize available airspace. Consequently, arrival traffic can fly approaches to parallel runways separated by smaller distances than are currently allowed. Previous simulation studies of advanced navigation approaches have found that controller workload is increased when there is a combination of aircraft that are capable of following advanced navigation routes and aircraft that are not. Research into Air Traffic Control automation at Ames Research Center has led to the development of the Center-TRACON Automation System (CTAS). The Final Approach Spacing Tool (FAST) is the component of the CTAS used in the TRACON area. The work in this paper examines, via simulation, the effects of FAST used for aircraft landing on closely spaced parallel runways. The simulation contained various combinations of aircraft, equipped and unequipped with advanced navigation systems. A set of simulations was run both manually and with an augmented set of FAST advisories to sequence aircraft, assign runways, and avoid conflicts. The results of the simulations are analyzed, measuring the airport throughput, aircraft delay, loss of separation, and controller workload.
Distributed Function Mining for Gene Expression Programming Based on Fast Reduction.

PubMed

Deng, Song; Yue, Dong; Yang, Le-chan; Fu, Xiong; Feng, Ya-zhou

2016-01-01

For high-dimensional and massive data sets, traditional centralized gene expression programming (GEP) or improved algorithms lead to increased run-time and decreased prediction accuracy. To solve this problem, this paper proposes a new improved algorithm called distributed function mining for gene expression programming based on fast reduction (DFMGEP-FR). In DFMGEP-FR, fast attribution reduction in binary search algorithms (FAR-BSA) is proposed to quickly find the optimal attribution set, and the function consistency replacement algorithm is given to solve integration of the local function model. Thorough comparative experiments for DFMGEP-FR, centralized GEP and the parallel gene expression programming algorithm based on simulated annealing (parallel GEPSA) are included in this paper. For the waveform, mushroom, connect-4 and musk datasets, the comparative results show that the average time-consumption of DFMGEP-FR drops by 89.09%%, 88.85%, 85.79% and 93.06%, respectively, in contrast to centralized GEP and by 12.5%, 8.42%, 9.62% and 13.75%, respectively, compared with parallel GEPSA. Six well-studied UCI test data sets demonstrate the efficiency and capability of our proposed DFMGEP-FR algorithm for distributed function mining.
Analysis and selection of optimal function implementations in massively parallel computer

DOEpatents

Archer, Charles Jens [Rochester, MN; Peters, Amanda [Rochester, MN; Ratterman, Joseph D [Rochester, MN

2011-05-31

An apparatus, program product and method optimize the operation of a parallel computer system by, in part, collecting performance data for a set of implementations of a function capable of being executed on the parallel computer system based upon the execution of the set of implementations under varying input parameters in a plurality of input dimensions. The collected performance data may be used to generate selection program code that is configured to call selected implementations of the function in response to a call to the function under varying input parameters. The collected performance data may be used to perform more detailed analysis to ascertain the comparative performance of the set of implementations of the function under the varying input parameters.
Fast Particle Methods for Multiscale Phenomena Simulations

NASA Technical Reports Server (NTRS)

Koumoutsakos, P.; Wray, A.; Shariff, K.; Pohorille, Andrew

2000-01-01

We are developing particle methods oriented at improving computational modeling capabilities of multiscale physical phenomena in : (i) high Reynolds number unsteady vortical flows, (ii) particle laden and interfacial flows, (iii)molecular dynamics studies of nanoscale droplets and studies of the structure, functions, and evolution of the earliest living cell. The unifying computational approach involves particle methods implemented in parallel computer architectures. The inherent adaptivity, robustness and efficiency of particle methods makes them a multidisciplinary computational tool capable of bridging the gap of micro-scale and continuum flow simulations. Using efficient tree data structures, multipole expansion algorithms, and improved particle-grid interpolation, particle methods allow for simulations using millions of computational elements, making possible the resolution of a wide range of length and time scales of these important physical phenomena.The current challenges in these simulations are in : [i] the proper formulation of particle methods in the molecular and continuous level for the discretization of the governing equations [ii] the resolution of the wide range of time and length scales governing the phenomena under investigation. [iii] the minimization of numerical artifacts that may interfere with the physics of the systems under consideration. [iv] the parallelization of processes such as tree traversal and grid-particle interpolations We are conducting simulations using vortex methods, molecular dynamics and smooth particle hydrodynamics, exploiting their unifying concepts such as : the solution of the N-body problem in parallel computers, highly accurate particle-particle and grid-particle interpolations, parallel FFT's and the formulation of processes such as diffusion in the context of particle methods. This approach enables us to transcend among seemingly unrelated areas of research.
Hybrid MPI-OpenMP Parallelism in the ONETEP Linear-Scaling Electronic Structure Code: Application to the Delamination of Cellulose Nanofibrils.

PubMed

Wilkinson, Karl A; Hine, Nicholas D M; Skylaris, Chris-Kriton

2014-11-11

We present a hybrid MPI-OpenMP implementation of Linear-Scaling Density Functional Theory within the ONETEP code. We illustrate its performance on a range of high performance computing (HPC) platforms comprising shared-memory nodes with fast interconnect. Our work has focused on applying OpenMP parallelism to the routines which dominate the computational load, attempting where possible to parallelize different loops from those already parallelized within MPI. This includes 3D FFT box operations, sparse matrix algebra operations, calculation of integrals, and Ewald summation. While the underlying numerical methods are unchanged, these developments represent significant changes to the algorithms used within ONETEP to distribute the workload across CPU cores. The new hybrid code exhibits much-improved strong scaling relative to the MPI-only code and permits calculations with a much higher ratio of cores to atoms. These developments result in a significantly shorter time to solution than was possible using MPI alone and facilitate the application of the ONETEP code to systems larger than previously feasible. We illustrate this with benchmark calculations from an amyloid fibril trimer containing 41,907 atoms. We use the code to study the mechanism of delamination of cellulose nanofibrils when undergoing sonification, a process which is controlled by a large number of interactions that collectively determine the structural properties of the fibrils. Many energy evaluations were needed for these simulations, and as these systems comprise up to 21,276 atoms this would not have been feasible without the developments described here.
NDL-v2.0: A new version of the numerical differentiation library for parallel architectures

NASA Astrophysics Data System (ADS)

Hadjidoukas, P. E.; Angelikopoulos, P.; Voglis, C.; Papageorgiou, D. G.; Lagaris, I. E.

2014-07-01

We present a new version of the numerical differentiation library (NDL) used for the numerical estimation of first and second order partial derivatives of a function by finite differencing. In this version we have restructured the serial implementation of the code so as to achieve optimal task-based parallelization. The pure shared-memory parallelization of the library has been based on the lightweight OpenMP tasking model allowing for the full extraction of the available parallelism and efficient scheduling of multiple concurrent library calls. On multicore clusters, parallelism is exploited by means of TORC, an MPI-based multi-threaded tasking library. The new MPI implementation of NDL provides optimal performance in terms of function calls and, furthermore, supports asynchronous execution of multiple library calls within legacy MPI programs. In addition, a Python interface has been implemented for all cases, exporting the functionality of our library to sequential Python codes. Catalog identifier: AEDG_v2_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEDG_v2_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 63036 No. of bytes in distributed program, including test data, etc.: 801872 Distribution format: tar.gz Programming language: ANSI Fortran-77, ANSI C, Python. Computer: Distributed systems (clusters), shared memory systems. Operating system: Linux, Unix. Has the code been vectorized or parallelized?: Yes. RAM: The library uses O(N) internal storage, N being the dimension of the problem. It can use up to O(N2) internal storage for Hessian calculations, if a task throttling factor has not been set by the user. Classification: 4.9, 4.14, 6.5. Catalog identifier of previous version: AEDG_v1_0 Journal reference of previous version: Comput. Phys. Comm. 180(2009)1404 Does the new version supersede the previous version?: Yes Nature of problem: The numerical estimation of derivatives at several accuracy levels is a common requirement in many computational tasks, such as optimization, solution of nonlinear systems, and sensitivity analysis. For a large number of scientific and engineering applications, the underlying functions correspond to simulation codes for which analytical estimation of derivatives is difficult or almost impossible. A parallel implementation that exploits systems with multiple CPUs is very important for large scale and computationally expensive problems. Solution method: Finite differencing is used with a carefully chosen step that minimizes the sum of the truncation and round-off errors. The parallel versions employ both OpenMP and MPI libraries. Reasons for new version: The updated version was motivated by our endeavors to extend a parallel Bayesian uncertainty quantification framework [1], by incorporating higher order derivative information as in most state-of-the-art stochastic simulation methods such as Stochastic Newton MCMC [2] and Riemannian Manifold Hamiltonian MC [3]. The function evaluations are simulations with significant time-to-solution, which also varies with the input parameters such as in [1, 4]. The runtime of the N-body-type of problem changes considerably with the introduction of a longer cut-off between the bodies. In the first version of the library, the OpenMP-parallel subroutines spawn a new team of threads and distribute the function evaluations with a PARALLEL DO directive. This limits the functionality of the library as multiple concurrent calls require nested parallelism support from the OpenMP environment. Therefore, either their function evaluations will be serialized or processor oversubscription is likely to occur due to the increased number of OpenMP threads. In addition, the Hessian calculations include two explicit parallel regions that compute first the diagonal and then the off-diagonal elements of the array. Due to the barrier between the two regions, the parallelism of the calculations is not fully exploited. These issues have been addressed in the new version by first restructuring the serial code and then running the function evaluations in parallel using OpenMP tasks. Although the MPI-parallel implementation of the first version is capable of fully exploiting the task parallelism of the PNDL routines, it does not utilize the caching mechanism of the serial code and, therefore, performs some redundant function evaluations in the Hessian and Jacobian calculations. This can lead to: (a) higher execution times if the number of available processors is lower than the total number of tasks, and (b) significant energy consumption due to wasted processor cycles. Overcoming these drawbacks, which become critical as the time of a single function evaluation increases, was the primary goal of this new version. Due to the code restructure, the MPI-parallel implementation (and the OpenMP-parallel in accordance) avoids redundant calls, providing optimal performance in terms of the number of function evaluations. Another limitation of the library was that the library subroutines were collective and synchronous calls. In the new version, each MPI process can issue any number of subroutines for asynchronous execution. We introduce two library calls that provide global and local task synchronizations, similarly to the BARRIER and TASKWAIT directives of OpenMP. The new MPI-implementation is based on TORC, a new tasking library for multicore clusters [5-7]. TORC improves the portability of the software, as it relies exclusively on the POSIX-Threads and MPI programming interfaces. It allows MPI processes to utilize multiple worker threads, offering a hybrid programming and execution environment similar to MPI+OpenMP, in a completely transparent way. Finally, to further improve the usability of our software, a Python interface has been implemented on top of both the OpenMP and MPI versions of the library. This allows sequential Python codes to exploit shared and distributed memory systems. Summary of revisions: The revised code improves the performance of both parallel (OpenMP and MPI) implementations. The functionality and the user-interface of the MPI-parallel version have been extended to support the asynchronous execution of multiple PNDL calls, issued by one or multiple MPI processes. A new underlying tasking library increases portability and allows MPI processes to have multiple worker threads. For both implementations, an interface to the Python programming language has been added. Restrictions: The library uses only double precision arithmetic. The MPI implementation assumes the homogeneity of the execution environment provided by the operating system. Specifically, the processes of a single MPI application must have identical address space and a user function resides at the same virtual address. In addition, address space layout randomization should not be used for the application. Unusual features: The software takes into account bound constraints, in the sense that only feasible points are used to evaluate the derivatives, and given the level of the desired accuracy, the proper formula is automatically employed. Running time: Running time depends on the function's complexity. The test run took 23 ms for the serial distribution, 25 ms for the OpenMP with 2 threads, 53 ms and 1.01 s for the MPI parallel distribution using 2 threads and 2 processes respectively and yield-time for idle workers equal to 10 ms. References: [1] P. Angelikopoulos, C. Paradimitriou, P. Koumoutsakos, Bayesian uncertainty quantification and propagation in molecular dynamics simulations: a high performance computing framework, J. Chem. Phys 137 (14). [2] H.P. Flath, L.C. Wilcox, V. Akcelik, J. Hill, B. van Bloemen Waanders, O. Ghattas, Fast algorithms for Bayesian uncertainty quantification in large-scale linear inverse problems based on low-rank partial Hessian approximations, SIAM J. Sci. Comput. 33 (1) (2011) 407-432. [3] M. Girolami, B. Calderhead, Riemann manifold Langevin and Hamiltonian Monte Carlo methods, J. R. Stat. Soc. Ser. B (Stat. Methodol.) 73 (2) (2011) 123-214. [4] P. Angelikopoulos, C. Paradimitriou, P. Koumoutsakos, Data driven, predictive molecular dynamics for nanoscale flow simulations under uncertainty, J. Phys. Chem. B 117 (47) (2013) 14808-14816. [5] P.E. Hadjidoukas, E. Lappas, V.V. Dimakopoulos, A runtime library for platform-independent task parallelism, in: PDP, IEEE, 2012, pp. 229-236. [6] C. Voglis, P.E. Hadjidoukas, D.G. Papageorgiou, I. Lagaris, A parallel hybrid optimization algorithm for fitting interatomic potentials, Appl. Soft Comput. 13 (12) (2013) 4481-4492. [7] P.E. Hadjidoukas, C. Voglis, V.V. Dimakopoulos, I. Lagaris, D.G. Papageorgiou, Supporting adaptive and irregular parallelism for non-linear numerical optimization, Appl. Math. Comput. 231 (2014) 544-559.

Parallel integer sorting with medium and fine-scale parallelism

NASA Technical Reports Server (NTRS)

Dagum, Leonardo

1993-01-01

Two new parallel integer sorting algorithms, queue-sort and barrel-sort, are presented and analyzed in detail. These algorithms do not have optimal parallel complexity, yet they show very good performance in practice. Queue-sort designed for fine-scale parallel architectures which allow the queueing of multiple messages to the same destination. Barrel-sort is designed for medium-scale parallel architectures with a high message passing overhead. The performance results from the implementation of queue-sort on a Connection Machine CM-2 and barrel-sort on a 128 processor iPSC/860 are given. The two implementations are found to be comparable in performance but not as good as a fully vectorized bucket sort on the Cray YMP.
A fast parallel 3D Poisson solver with longitudinal periodic and transverse open boundary conditions for space-charge simulations

NASA Astrophysics Data System (ADS)

Qiang, Ji

2017-10-01

A three-dimensional (3D) Poisson solver with longitudinal periodic and transverse open boundary conditions can have important applications in beam physics of particle accelerators. In this paper, we present a fast efficient method to solve the Poisson equation using a spectral finite-difference method. This method uses a computational domain that contains the charged particle beam only and has a computational complexity of O(Nu(logNmode)) , where Nu is the total number of unknowns and Nmode is the maximum number of longitudinal or azimuthal modes. This saves both the computational time and the memory usage of using an artificial boundary condition in a large extended computational domain. The new 3D Poisson solver is parallelized using a message passing interface (MPI) on multi-processor computers and shows a reasonable parallel performance up to hundreds of processor cores.
Fast data reconstructed method of Fourier transform imaging spectrometer based on multi-core CPU

NASA Astrophysics Data System (ADS)

Yu, Chunchao; Du, Debiao; Xia, Zongze; Song, Li; Zheng, Weijian; Yan, Min; Lei, Zhenggang

2017-10-01

Imaging spectrometer can gain two-dimensional space image and one-dimensional spectrum at the same time, which shows high utility in color and spectral measurements, the true color image synthesis, military reconnaissance and so on. In order to realize the fast reconstructed processing of the Fourier transform imaging spectrometer data, the paper designed the optimization reconstructed algorithm with OpenMP parallel calculating technology, which was further used for the optimization process for the HyperSpectral Imager of `HJ-1' Chinese satellite. The results show that the method based on multi-core parallel computing technology can control the multi-core CPU hardware resources competently and significantly enhance the calculation of the spectrum reconstruction processing efficiency. If the technology is applied to more cores workstation in parallel computing, it will be possible to complete Fourier transform imaging spectrometer real-time data processing with a single computer.
Ex-vessel neutron dosimetry analysis for westinghouse 4-loop XL pressurized water reactor plant using the RadTrack{sup TM} Code System with the 3D parallel discrete ordinates code RAPTOR-M3G

DOE Office of Scientific and Technical Information (OSTI.GOV)

Chen, J.; Alpan, F. A.; Fischer, G.A.

2011-07-01

Traditional two-dimensional (2D)/one-dimensional (1D) SYNTHESIS methodology has been widely used to calculate fast neutron (>1.0 MeV) fluence exposure to reactor pressure vessel in the belt-line region. However, it is expected that this methodology cannot provide accurate fast neutron fluence calculation at elevations far above or below the active core region. A three-dimensional (3D) parallel discrete ordinates calculation for ex-vessel neutron dosimetry on a Westinghouse 4-Loop XL Pressurized Water Reactor has been done. It shows good agreement between the calculated results and measured results. Furthermore, the results show very different fast neutron flux values at some of the former plate locationsmore » and elevations above and below an active core than those calculated by a 2D/1D SYNTHESIS method. This indicates that for certain irregular reactor internal structures, where the fast neutron flux has a very strong local effect, it is required to use a 3D transport method to calculate accurate fast neutron exposure. (authors)« less
Implementation of a Synchronized Oscillator Circuit for Fast Sensing and Labeling of Image Objects

PubMed Central

Kowalski, Jacek; Strzelecki, Michal; Kim, Hyongsuk

2011-01-01

We present an application-specific integrated circuit (ASIC) CMOS chip that implements a synchronized oscillator cellular neural network with a matrix size of 32 × 32 for object sensing and labeling in binary images. Networks of synchronized oscillators are a recently developed tool for image segmentation and analysis. Its parallel network operation is based on a “temporary correlation” theory that attempts to describe scene recognition as if performed by the human brain. The synchronized oscillations of neuron groups attract a person’s attention if he or she is focused on a coherent stimulus (image object). For more than one perceived stimulus, these synchronized patterns switch in time between different neuron groups, thus forming temporal maps that code several features of the analyzed scene. In this paper, a new oscillator circuit based on a mathematical model is proposed, and the network architecture and chip functional blocks are presented and discussed. The proposed chip is implemented in AMIS 0.35 μm C035M-D 5M/1P technology. An application of the proposed network chip for the segmentation of insulin-producing pancreatic islets in magnetic resonance liver images is presented. PMID:22163803
Real-Time Data Streaming and Storing Structure for the LHD's Fusion Plasma Experiments

NASA Astrophysics Data System (ADS)

Nakanishi, Hideya; Ohsuna, Masaki; Kojima, Mamoru; Imazu, Setsuo; Nonomura, Miki; Emoto, Masahiko; Yoshida, Masanobu; Iwata, Chie; Ida, Katsumi

2016-02-01

The LHD data acquisition and archiving system, i.e., LABCOM system, has been fully equipped with high-speed real-time acquisition, streaming, and storage capabilities. To deal with more than 100 MB/s continuously generated data at each data acquisition (DAQ) node, DAQ tasks have been implemented as multitasking and multithreaded ones in which the shared memory plays the most important role for inter-process fast and massive data handling. By introducing a 10-second time chunk named “subshot,” endless data streams can be stored into a consecutive series of fixed length data blocks so that they will soon become readable by other processes even while the write process is continuing. Real-time device and environmental monitoring are also implemented in the same way with further sparse resampling. The central data storage has been separated into two layers to be capable of receiving multiple 100 MB/s inflows in parallel. For the frontend layer, high-speed SSD arrays are used as the GlusterFS distributed filesystem which can provide max. 2 GB/s throughput. Those design optimizations would be informative for implementing the next-generation data archiving system in big physics, such as ITER.
Externally calibrated parallel imaging for 3D multispectral imaging near metallic implants using broadband ultrashort echo time imaging.

PubMed

Wiens, Curtis N; Artz, Nathan S; Jang, Hyungseok; McMillan, Alan B; Reeder, Scott B

2017-06-01

To develop an externally calibrated parallel imaging technique for three-dimensional multispectral imaging (3D-MSI) in the presence of metallic implants. A fast, ultrashort echo time (UTE) calibration acquisition is proposed to enable externally calibrated parallel imaging techniques near metallic implants. The proposed calibration acquisition uses a broadband radiofrequency (RF) pulse to excite the off-resonance induced by the metallic implant, fully phase-encoded imaging to prevent in-plane distortions, and UTE to capture rapidly decaying signal. The performance of the externally calibrated parallel imaging reconstructions was assessed using phantoms and in vivo examples. Phantom and in vivo comparisons to self-calibrated parallel imaging acquisitions show that significant reductions in acquisition times can be achieved using externally calibrated parallel imaging with comparable image quality. Acquisition time reductions are particularly large for fully phase-encoded methods such as spectrally resolved fully phase-encoded three-dimensional (3D) fast spin-echo (SR-FPE), in which scan time reductions of up to 8 min were obtained. A fully phase-encoded acquisition with broadband excitation and UTE enabled externally calibrated parallel imaging for 3D-MSI, eliminating the need for repeated calibration regions at each frequency offset. Significant reductions in acquisition time can be achieved, particularly for fully phase-encoded methods like SR-FPE. Magn Reson Med 77:2303-2309, 2017. © 2016 International Society for Magnetic Resonance in Medicine. © 2016 International Society for Magnetic Resonance in Medicine.
A compact linear accelerator based on a scalable microelectromechanical-system RF-structure

DOE PAGES

Persaud, A.; Ji, Q.; Feinberg, E.; ...

2017-06-08

Here, a new approach for a compact radio-frequency (RF) accelerator structure is presented. The new accelerator architecture is based on the Multiple Electrostatic Quadrupole Array Linear Accelerator (MEQALAC) structure that was first developed in the 1980s. The MEQALAC utilized RF resonators producing the accelerating fields and providing for higher beam currents through parallel beamlets focused using arrays of electrostatic quadrupoles (ESQs). While the early work obtained ESQs with lateral dimensions on the order of a few centimeters, using a printed circuit board (PCB), we reduce the characteristic dimension to the millimeter regime, while massively scaling up the potential number ofmore » parallel beamlets. Using Microelectromechanical systems scalable fabrication approaches, we are working on further red ucing the characteristic dimension to the sub-millimeter regime. The technology is based on RF-acceleration components and ESQs implemented in the PCB or silicon wafers where each beamlet passes through beam apertures in the wafer. The complete accelerator is then assembled by stacking these wafers. This approach has the potential for fast and inexpensive batch fabrication of the components and flexibility in system design for application specific beam energies and currents. For prototyping the accelerator architecture, the components have been fabricated using the PCB. In this paper, we present proof of concept results of the principal components using the PCB: RF acceleration and ESQ focusing. Finally, ongoing developments on implementing components in silicon and scaling of the accelerator technology to high currents and beam energies are discussed.« less
A compact linear accelerator based on a scalable microelectromechanical-system RF-structure

NASA Astrophysics Data System (ADS)

Persaud, A.; Ji, Q.; Feinberg, E.; Seidl, P. A.; Waldron, W. L.; Schenkel, T.; Lal, A.; Vinayakumar, K. B.; Ardanuc, S.; Hammer, D. A.

2017-06-01

A new approach for a compact radio-frequency (RF) accelerator structure is presented. The new accelerator architecture is based on the Multiple Electrostatic Quadrupole Array Linear Accelerator (MEQALAC) structure that was first developed in the 1980s. The MEQALAC utilized RF resonators producing the accelerating fields and providing for higher beam currents through parallel beamlets focused using arrays of electrostatic quadrupoles (ESQs). While the early work obtained ESQs with lateral dimensions on the order of a few centimeters, using a printed circuit board (PCB), we reduce the characteristic dimension to the millimeter regime, while massively scaling up the potential number of parallel beamlets. Using Microelectromechanical systems scalable fabrication approaches, we are working on further reducing the characteristic dimension to the sub-millimeter regime. The technology is based on RF-acceleration components and ESQs implemented in the PCB or silicon wafers where each beamlet passes through beam apertures in the wafer. The complete accelerator is then assembled by stacking these wafers. This approach has the potential for fast and inexpensive batch fabrication of the components and flexibility in system design for application specific beam energies and currents. For prototyping the accelerator architecture, the components have been fabricated using the PCB. In this paper, we present proof of concept results of the principal components using the PCB: RF acceleration and ESQ focusing. Ongoing developments on implementing components in silicon and scaling of the accelerator technology to high currents and beam energies are discussed.
A compact linear accelerator based on a scalable microelectromechanical-system RF-structure.

PubMed

Persaud, A; Ji, Q; Feinberg, E; Seidl, P A; Waldron, W L; Schenkel, T; Lal, A; Vinayakumar, K B; Ardanuc, S; Hammer, D A

2017-06-01

A new approach for a compact radio-frequency (RF) accelerator structure is presented. The new accelerator architecture is based on the Multiple Electrostatic Quadrupole Array Linear Accelerator (MEQALAC) structure that was first developed in the 1980s. The MEQALAC utilized RF resonators producing the accelerating fields and providing for higher beam currents through parallel beamlets focused using arrays of electrostatic quadrupoles (ESQs). While the early work obtained ESQs with lateral dimensions on the order of a few centimeters, using a printed circuit board (PCB), we reduce the characteristic dimension to the millimeter regime, while massively scaling up the potential number of parallel beamlets. Using Microelectromechanical systems scalable fabrication approaches, we are working on further reducing the characteristic dimension to the sub-millimeter regime. The technology is based on RF-acceleration components and ESQs implemented in the PCB or silicon wafers where each beamlet passes through beam apertures in the wafer. The complete accelerator is then assembled by stacking these wafers. This approach has the potential for fast and inexpensive batch fabrication of the components and flexibility in system design for application specific beam energies and currents. For prototyping the accelerator architecture, the components have been fabricated using the PCB. In this paper, we present proof of concept results of the principal components using the PCB: RF acceleration and ESQ focusing. Ongoing developments on implementing components in silicon and scaling of the accelerator technology to high currents and beam energies are discussed.
MEvoLib v1.0: the first molecular evolution library for Python.

PubMed

Álvarez-Jarreta, Jorge; Ruiz-Pesini, Eduardo

2016-10-28

Molecular evolution studies involve many different hard computational problems solved, in most cases, with heuristic algorithms that provide a nearly optimal solution. Hence, diverse software tools exist for the different stages involved in a molecular evolution workflow. We present MEvoLib, the first molecular evolution library for Python, providing a framework to work with different tools and methods involved in the common tasks of molecular evolution workflows. In contrast with already existing bioinformatics libraries, MEvoLib is focused on the stages involved in molecular evolution studies, enclosing the set of tools with a common purpose in a single high-level interface with fast access to their frequent parameterizations. The gene clustering from partial or complete sequences has been improved with a new method that integrates accessible external information (e.g. GenBank's features data). Moreover, MEvoLib adjusts the fetching process from NCBI databases to optimize the download bandwidth usage. In addition, it has been implemented using parallelization techniques to cope with even large-case scenarios. MEvoLib is the first library for Python designed to facilitate molecular evolution researches both for expert and novel users. Its unique interface for each common task comprises several tools with their most used parameterizations. It has also included a method to take advantage of biological knowledge to improve the gene partition of sequence datasets. Additionally, its implementation incorporates parallelization techniques to enhance computational costs when handling very large input datasets.
Efficient Parallelization of a Dynamic Unstructured Application on the Tera MTA

NASA Technical Reports Server (NTRS)

Oliker, Leonid; Biswas, Rupak

1999-01-01

The success of parallel computing in solving real-life computationally-intensive problems relies on their efficient mapping and execution on large-scale multiprocessor architectures. Many important applications are both unstructured and dynamic in nature, making their efficient parallel implementation a daunting task. This paper presents the parallelization of a dynamic unstructured mesh adaptation algorithm using three popular programming paradigms on three leading supercomputers. We examine an MPI message-passing implementation on the Cray T3E and the SGI Origin2OOO, a shared-memory implementation using cache coherent nonuniform memory access (CC-NUMA) of the Origin2OOO, and a multi-threaded version on the newly-released Tera Multi-threaded Architecture (MTA). We compare several critical factors of this parallel code development, including runtime, scalability, programmability, and memory overhead. Our overall results demonstrate that multi-threaded systems offer tremendous potential for quickly and efficiently solving some of the most challenging real-life problems on parallel computers.
Implementation of the American Society of Anesthesiology's guidelines to reduce prolonged fasting times in pediatric preoperative patients: a best practice implementation project.

PubMed

Costello, Carol M

2016-10-01

The American Society of Anesthesiology (ASA) guidelines for pediatric preoperative fasting have been a standard for well over a decade. However, use of protocols involving an excessive fasting duration exposes patients to the physiological impacts of fluid volume loss. The current project aimed to improve fluid supplementation during presurgical fasting in pediatric patients at an academic medical center. Specific objectives were to increase clinical staff knowledge regarding ASA fasting standards and implement them in specific pediatric patient populations. The Joanna Briggs Institute Practical Application of Clinical Evidence System and Getting Research into Practice tools were used. A baseline audit assessed compliance with best practice criteria regarding staff education, patient/family instruction and preoperative fasting times. Intervention outcomes were evaluated in a post implementation criteria audit. Although compliance with fasting less than 12 hours more than doubled, only half of these patients/parents adhered to the guidelines. No excessive fasting events were attributed to a language barrier. There were no insufficient fasting events. Moderate success with fasting compliance was demonstrated when patients/parents were taught the multi-step ASA non per os (nothing by mouth) instructions. This complexity may have contributed to non-compliance and pointed to the need for enhanced teaching strategies. No operative start delays related to insufficient fasting indicated surgical scheduling flexibility was not at risk, and anesthesia providers had adopted the guidelines. Interdisciplinary engagement in this project was significantly impacted by director level communication which will be a key strategy for future implementations.
Fast non-overlapping Schwarz domain decomposition methods for solving the neutron diffusion equation

NASA Astrophysics Data System (ADS)

Jamelot, Erell; Ciarlet, Patrick

2013-05-01

Studying numerically the steady state of a nuclear core reactor is expensive, in terms of memory storage and computational time. In order to address both requirements, one can use a domain decomposition method, implemented on a parallel computer. We present here such a method for the mixed neutron diffusion equations, discretized with Raviart-Thomas-Nédélec finite elements. This method is based on the Schwarz iterative algorithm with Robin interface conditions to handle communications. We analyse this method from the continuous point of view to the discrete point of view, and we give some numerical results in a realistic highly heterogeneous 3D configuration. Computations are carried out with the MINOS solver of the APOLLO3® neutronics code. APOLLO3 is a registered trademark in France.
Explicit integration with GPU acceleration for large kinetic networks

DOE Office of Scientific and Technical Information (OSTI.GOV)

Brock, Benjamin; Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN 37830; Belt, Andrew

2015-12-01

We demonstrate the first implementation of recently-developed fast explicit kinetic integration algorithms on modern graphics processing unit (GPU) accelerators. Taking as a generic test case a Type Ia supernova explosion with an extremely stiff thermonuclear network having 150 isotopic species and 1604 reactions coupled to hydrodynamics using operator splitting, we demonstrate the capability to solve of order 100 realistic kinetic networks in parallel in the same time that standard implicit methods can solve a single such network on a CPU. This orders-of-magnitude decrease in computation time for solving systems of realistic kinetic networks implies that important coupled, multiphysics problems inmore » various scientific and technical fields that were intractable, or could be simulated only with highly schematic kinetic networks, are now computationally feasible.« less
Adaptive correlation filter-based video stabilization without accumulative global motion estimation

NASA Astrophysics Data System (ADS)

Koh, Eunjin; Lee, Chanyong; Jeong, Dong Gil

2014-12-01

We present a digital video stabilization approach that provides both robustness and efficiency for practical applications. In this approach, we adopt a stabilization model that maintains spatio-temporal information of past input frames efficiently and can track original stabilization position. Because of the stabilization model, the proposed method does not need accumulative global motion estimation and can recover the original position even if there is a failure in interframe motion estimation. It can also intelligently overcome the situation of damaged or interrupted video sequences. Moreover, because it is simple and suitable to parallel scheme, we implement it on a commercial field programmable gate array and a graphics processing unit board with compute unified device architecture in a breeze. Experimental results show that the proposed approach is both fast and robust.
Conjugate gradient based projection - A new explicit methodology for frictional contact

NASA Technical Reports Server (NTRS)

Tamma, Kumar K.; Li, Maocheng; Sha, Desong

1993-01-01

With special attention towards the applicability to parallel computation or vectorization, a new and effective explicit approach for linear complementary formulations involving a conjugate gradient based projection methodology is proposed in this study for contact problems with Coulomb friction. The overall objectives are focussed towards providing an explicit methodology of computation for the complete contact problem with friction. In this regard, the primary idea for solving the linear complementary formulations stems from an established search direction which is projected to a feasible region determined by the non-negative constraint condition; this direction is then applied to the Fletcher-Reeves conjugate gradient method resulting in a powerful explicit methodology which possesses high accuracy, excellent convergence characteristics, fast computational speed and is relatively simple to implement for contact problems involving Coulomb friction.
A phase-based stereo vision system-on-a-chip.

PubMed

Díaz, Javier; Ros, Eduardo; Sabatini, Silvio P; Solari, Fabio; Mota, Sonia

2007-02-01

A simple and fast technique for depth estimation based on phase measurement has been adopted for the implementation of a real-time stereo system with sub-pixel resolution on an FPGA device. The technique avoids the attendant problem of phase warping. The designed system takes full advantage of the inherent processing parallelism and segmentation capabilities of FPGA devices to achieve a computation speed of 65megapixels/s, which can be arranged with a customized frame-grabber module to process 211frames/s at a size of 640x480 pixels. The processing speed achieved is higher than conventional camera frame rates, thus allowing the system to extract multiple estimations and be used as a platform to evaluate integration schemes of a population of neurons without increasing hardware resource demands.
Integration of perception and reasoning in fast neural modules

NASA Technical Reports Server (NTRS)

Fritz, David G.

1989-01-01

Artificial neural systems promise to integrate symbolic and sub-symbolic processing to achieve real time control of physical systems. Two potential alternatives exist. In one, neural nets can be used to front-end expert systems. The expert systems, in turn, are developed with varying degrees of parallelism, including their implementation in neural nets. In the other, rule-based reasoning and sensor data can be integrated within a single hybrid neural system. The hybrid system reacts as a unit to provide decisions (problem solutions) based on the simultaneous evaluation of data and rules. Discussed here is a model hybrid system based on the fuzzy cognitive map (FCM). The operation of the model is illustrated with the control of a hypothetical satellite that intelligently alters its attitude in space in response to an intersecting micrometeorite shower.
Collisionless stellar hydrodynamics as an efficient alternative to N-body methods

NASA Astrophysics Data System (ADS)

Mitchell, Nigel L.; Vorobyov, Eduard I.; Hensler, Gerhard

2013-01-01

The dominant constituents of the Universe's matter are believed to be collisionless in nature and thus their modelling in any self-consistent simulation is extremely important. For simulations that deal only with dark matter or stellar systems, the conventional N-body technique is fast, memory efficient and relatively simple to implement. However when extending simulations to include the effects of gas physics, mesh codes are at a distinct disadvantage compared to Smooth Particle Hydrodynamics (SPH) codes. Whereas implementing the N-body approach into SPH codes is fairly trivial, the particle-mesh technique used in mesh codes to couple collisionless stars and dark matter to the gas on the mesh has a series of significant scientific and technical limitations. These include spurious entropy generation resulting from discreteness effects, poor load balancing and increased communication overhead which spoil the excellent scaling in massively parallel grid codes. In this paper we propose the use of the collisionless Boltzmann moment equations as a means to model the collisionless material as a fluid on the mesh, implementing it into the massively parallel FLASH Adaptive Mesh Refinement (AMR) code. This approach which we term `collisionless stellar hydrodynamics' enables us to do away with the particle-mesh approach and since the parallelization scheme is identical to that used for the hydrodynamics, it preserves the excellent scaling of the FLASH code already demonstrated on peta-flop machines. We find that the classic hydrodynamic equations and the Boltzmann moment equations can be reconciled under specific conditions, allowing us to generate analytic solutions for collisionless systems using conventional test problems. We confirm the validity of our approach using a suite of demanding test problems, including the use of a modified Sod shock test. By deriving the relevant eigenvalues and eigenvectors of the Boltzmann moment equations, we are able to use high order accurate characteristic tracing methods with Riemann solvers to generate numerical solutions which show excellent agreement with our analytic solutions. We conclude by demonstrating the ability of our code to model complex phenomena by simulating the evolution of a two-armed spiral galaxy whose properties agree with those predicted by the swing amplification theory.

Pronounced zonation of seismic anisotropy in the Western Hellenic subduction zone and its geodynamic significance

NASA Astrophysics Data System (ADS)

Olive, Jean-Arthur; Pearce, Frederick; Rondenay, Stéphane; Behn, Mark D.

2014-04-01

Many subduction zones exhibit significant retrograde motion of their arc and trench. The observation of fast shear-wave velocities parallel to the trench in such settings has been inferred to represent trench-parallel mantle flow beneath a retreating slab. Here, we investigate this process by measuring seismic anisotropy in the shallow Aegean mantle. We carry out shear-wave splitting analysis on a dense array of seismometers across the Western Hellenic Subduction Zone, and find a pronounced zonation of anisotropy at the scale of the subduction zone. Fast SKS splitting directions subparallel to the trench-retreat direction dominate the region nearest to the trench. Fast splitting directions abruptly transition to trench-parallel above the corner of the mantle wedge, and rotate back to trench-normal over the back-arc. We argue that the trench-normal anisotropy near the trench is explained by entrainment of an asthenospheric layer beneath the shallow-dipping portion of the slab. Toward the volcanic arc this signature is overprinted by trench-parallel anisotropy in the mantle wedge, likely caused by a layer of strained serpentine immediately above the slab. Arcward steepening of the slab and horizontal divergence of mantle flow due to rollback may generate an additional component of sub-slab trench-parallel anisotropy in this region. Poloidal flow above the retreating slab is likely the dominant source of back-arc trench-normal anisotropy. We hypothesize that trench-normal anisotropy associated with significant entrainment of the asthenospheric mantle near the trench may be widespread but only observable at shallow-dipping subduction zones where stations nearest the trench do not overlie the mantle wedge.
MIST: An Open Source Environmental Modelling Programming Language Incorporating Easy to Use Data Parallelism.

NASA Astrophysics Data System (ADS)

Bellerby, Tim

2014-05-01

Model Integration System (MIST) is open-source environmental modelling programming language that directly incorporates data parallelism. The language is designed to enable straightforward programming structures, such as nested loops and conditional statements to be directly translated into sequences of whole-array (or more generally whole data-structure) operations. MIST thus enables the programmer to use well-understood constructs, directly relating to the mathematical structure of the model, without having to explicitly vectorize code or worry about details of parallelization. A range of common modelling operations are supported by dedicated language structures operating on cell neighbourhoods rather than individual cells (e.g.: the 3x3 local neighbourhood needed to implement an averaging image filter can be simply accessed from within a simple loop traversing all image pixels). This facility hides details of inter-process communication behind more mathematically relevant descriptions of model dynamics. The MIST automatic vectorization/parallelization process serves both to distribute work among available nodes and separately to control storage requirements for intermediate expressions - enabling operations on very large domains for which memory availability may be an issue. MIST is designed to facilitate efficient interpreter based implementations. A prototype open source interpreter is available, coded in standard FORTRAN 95, with tools to rapidly integrate existing FORTRAN 77 or 95 code libraries. The language is formally specified and thus not limited to FORTRAN implementation or to an interpreter-based approach. A MIST to FORTRAN compiler is under development and volunteers are sought to create an ANSI-C implementation. Parallel processing is currently implemented using OpenMP. However, parallelization code is fully modularised and could be replaced with implementations using other libraries. GPU implementation is potentially possible.
Engine-start Control Strategy of P2 Parallel Hybrid Electric Vehicle

NASA Astrophysics Data System (ADS)

Xiangyang, Xu; Siqi, Zhao; Peng, Dong

2017-12-01

A smooth and fast engine-start process is important to parallel hybrid electric vehicles with an electric motor mounted in front of the transmission. However, there are some challenges during the engine-start control. Firstly, the electric motor must simultaneously provide a stable driving torque to ensure the drivability and a compensative torque to drag the engine before ignition. Secondly, engine-start time is a trade-off control objective because both fast start and smooth start have to be considered. To solve these problems, this paper first analyzed the resistance of the engine start process, and established a physic model in MATLAB/Simulink. Then a model-based coordinated control strategy among engine, motor and clutch was developed. Two basic control strategy during fast start and smooth start process were studied. Simulation results showed that the control objectives were realized by applying given control strategies, which can meet different requirement from the driver.
SciSpark: Highly Interactive and Scalable Model Evaluation and Climate Metrics

NASA Astrophysics Data System (ADS)

Wilson, B. D.; Palamuttam, R. S.; Mogrovejo, R. M.; Whitehall, K. D.; Mattmann, C. A.; Verma, R.; Waliser, D. E.; Lee, H.

2015-12-01

Remote sensing data and climate model output are multi-dimensional arrays of massive sizes locked away in heterogeneous file formats (HDF5/4, NetCDF 3/4) and metadata models (HDF-EOS, CF) making it difficult to perform multi-stage, iterative science processing since each stage requires writing and reading data to and from disk. We are developing a lightning fast Big Data technology called SciSpark based on ApacheTM Spark under a NASA AIST grant (PI Mattmann). Spark implements the map-reduce paradigm for parallel computing on a cluster, but emphasizes in-memory computation, "spilling" to disk only as needed, and so outperforms the disk-based ApacheTM Hadoop by 100x in memory and by 10x on disk. SciSpark will enable scalable model evaluation by executing large-scale comparisons of A-Train satellite observations to model grids on a cluster of 10 to 1000 compute nodes. This 2nd generation capability for NASA's Regional Climate Model Evaluation System (RCMES) will compute simple climate metrics at interactive speeds, and extend to quite sophisticated iterative algorithms such as machine-learning based clustering of temperature PDFs, and even graph-based algorithms for searching for Mesocale Convective Complexes. We have implemented a parallel data ingest capability in which the user specifies desired variables (arrays) as several time-sorted lists of URL's (i.e. using OPeNDAP model.nc?varname, or local files). The specified variables are partitioned by time/space and then each Spark node pulls its bundle of arrays into memory to begin a computation pipeline. We also investigated the performance of several N-dim. array libraries (scala breeze, java jblas & netlib-java, and ND4J). We are currently developing science codes using ND4J and studying memory behavior on the JVM. On the pyspark side, many of our science codes already use the numpy and SciPy ecosystems. The talk will cover: the architecture of SciSpark, the design of the scientific RDD (sRDD) data structure, our efforts to integrate climate science algorithms in Python and Scala, parallel ingest and partitioning of A-Train satellite observations from HDF files and model grids from netCDF files, first parallel runs to compute comparison statistics and PDF's, and first metrics quantifying parallel speedups and memory & disk usage.
Segmentation of remotely sensed data using parallel region growing

NASA Technical Reports Server (NTRS)

Tilton, J. C.; Cox, S. C.

1983-01-01

The improved spatial resolution of the new earth resources satellites will increase the need for effective utilization of spatial information in machine processing of remotely sensed data. One promising technique is scene segmentation by region growing. Region growing can use spatial information in two ways: only spatially adjacent regions merge together, and merging criteria can be based on region-wide spatial features. A simple region growing approach is described in which the similarity criterion is based on region mean and variance (a simple spatial feature). An effective way to implement region growing for remote sensing is as an iterative parallel process on a large parallel processor. A straightforward parallel pixel-based implementation of the algorithm is explored and its efficiency is compared with sequential pixel-based, sequential region-based, and parallel region-based implementations. Experimental results from on aircraft scanner data set are presented, as is a discussioon of proposed improvements to the segmentation algorithm.
PCLIPS: Parallel CLIPS

NASA Technical Reports Server (NTRS)

Hall, Lawrence O.; Bennett, Bonnie H.; Tello, Ivan

1994-01-01

A parallel version of CLIPS 5.1 has been developed to run on Intel Hypercubes. The user interface is the same as that for CLIPS with some added commands to allow for parallel calls. A complete version of CLIPS runs on each node of the hypercube. The system has been instrumented to display the time spent in the match, recognize, and act cycles on each node. Only rule-level parallelism is supported. Parallel commands enable the assertion and retraction of facts to/from remote nodes working memory. Parallel CLIPS was used to implement a knowledge-based command, control, communications, and intelligence (C(sup 3)I) system to demonstrate the fusion of high-level, disparate sources. We discuss the nature of the information fusion problem, our approach, and implementation. Parallel CLIPS has also be used to run several benchmark parallel knowledge bases such as one to set up a cafeteria. Results show from running Parallel CLIPS with parallel knowledge base partitions indicate that significant speed increases, including superlinear in some cases, are possible.
Performance evaluation of canny edge detection on a tiled multicore architecture

NASA Astrophysics Data System (ADS)

Brethorst, Andrew Z.; Desai, Nehal; Enright, Douglas P.; Scrofano, Ronald

2011-01-01

In the last few years, a variety of multicore architectures have been used to parallelize image processing applications. In this paper, we focus on assessing the parallel speed-ups of different Canny edge detection parallelization strategies on the Tile64, a tiled multicore architecture developed by the Tilera Corporation. Included in these strategies are different ways Canny edge detection can be parallelized, as well as differences in data management. The two parallelization strategies examined were loop-level parallelism and domain decomposition. Loop-level parallelism is achieved through the use of OpenMP,1 and it is capable of parallelization across the range of values over which a loop iterates. Domain decomposition is the process of breaking down an image into subimages, where each subimage is processed independently, in parallel. The results of the two strategies show that for the same number of threads, programmer implemented, domain decomposition exhibits higher speed-ups than the compiler managed, loop-level parallelism implemented with OpenMP.
Dynamic overset grid communication on distributed memory parallel processors

NASA Technical Reports Server (NTRS)

Barszcz, Eric; Weeratunga, Sisira K.; Meakin, Robert L.

1993-01-01

A parallel distributed memory implementation of intergrid communication for dynamic overset grids is presented. Included are discussions of various options considered during development. Results are presented comparing an Intel iPSC/860 to a single processor Cray Y-MP. Results for grids in relative motion show the iPSC/860 implementation to be faster than the Cray implementation.
Optimisation of a parallel ocean general circulation model

NASA Astrophysics Data System (ADS)

Beare, M. I.; Stevens, D. P.

1997-10-01

This paper presents the development of a general-purpose parallel ocean circulation model, for use on a wide range of computer platforms, from traditional scalar machines to workstation clusters and massively parallel processors. Parallelism is provided, as a modular option, via high-level message-passing routines, thus hiding the technical intricacies from the user. An initial implementation highlights that the parallel efficiency of the model is adversely affected by a number of factors, for which optimisations are discussed and implemented. The resulting ocean code is portable and, in particular, allows science to be achieved on local workstations that could otherwise only be undertaken on state-of-the-art supercomputers.
Adaptive multiple super fast simulated annealing for stochastic microstructure reconstruction

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ryu, Seun; Lin, Guang; Sun, Xin

2013-01-01

Fast image reconstruction from statistical information is critical in image fusion from multimodality chemical imaging instrumentation to create high resolution image with large domain. Stochastic methods have been used widely in image reconstruction from two point correlation function. The main challenge is to increase the efficiency of reconstruction. A novel simulated annealing method is proposed for fast solution of image reconstruction. Combining the advantage of very fast cooling schedules, dynamic adaption and parallelization, the new simulation annealing algorithm increases the efficiencies by several orders of magnitude, making the large domain image fusion feasible.
High-Performance Psychometrics: The Parallel-E Parallel-M Algorithm for Generalized Latent Variable Models. Research Report. ETS RR-16-34

ERIC Educational Resources Information Center

von Davier, Matthias

2016-01-01

This report presents results on a parallel implementation of the expectation-maximization (EM) algorithm for multidimensional latent variable models. The developments presented here are based on code that parallelizes both the E step and the M step of the parallel-E parallel-M algorithm. Examples presented in this report include item response…
Parallel tempering for the traveling salesman problem

DOE Office of Scientific and Technical Information (OSTI.GOV)

Percus, Allon; Wang, Richard; Hyman, Jeffrey

We explore the potential of parallel tempering as a combinatorial optimization method, applying it to the traveling salesman problem. We compare simulation results of parallel tempering with a benchmark implementation of simulated annealing, and study how different choices of parameters affect the relative performance of the two methods. We find that a straightforward implementation of parallel tempering can outperform simulated annealing in several crucial respects. When parameters are chosen appropriately, both methods yield close approximation to the actual minimum distance for an instance with 200 nodes. However, parallel tempering yields more consistently accurate results when a series of independent simulationsmore » are performed. Our results suggest that parallel tempering might offer a simple but powerful alternative to simulated annealing for combinatorial optimization problems.« less
Are Fast Radio Bursts the Birthmark of Magnetars?

NASA Astrophysics Data System (ADS)

Lieu, Richard

2017-01-01

A model of fast radio bursts, which enlists young, short period extragalactic magnetars satisfying B/P > 2 × 1016 G s-1 (1 G = 1 statvolt cm-1) as the source, is proposed. When the parallel component {{\\boldsymbol{E}}}\\parallel of the surface electric field (under the scenario of a vacuum magnetosphere) of such pulsars approaches 5% of the critical field {E}c={m}e2{c}3/(e{\\hslash }), in strength, the field can readily decay via the Schwinger mechanism into electron-positron pairs, the back reaction of which causes {{\\boldsymbol{E}}}\\parallel to oscillate on a characteristic timescale smaller than the development of a spark gap. Thus, under this scenario, the open field line region of the pulsar magnetosphere is controlled by Schwinger pairs, and their large creation and acceleration rates enable the escaping pairs to coherently emit radio waves directly from the polar cap. The majority of the energy is emitted at frequencies ≲ 1 {GHz} where the coherent radiation has the highest yield, at a rate large enough to cause the magnetar to lose spin significantly over a timescale ≈ a few × {10}-3 s, the duration of a fast radio burst. Owing to the circumstellar environment of a young magnetar, however, the ≲1 GHz radiation is likely to be absorbed or reflected by the overlying matter. It is shown that the brightness of the remaining (observable) frequencies of ≈ 1 {GHz} and above are on a par with a typical fast radio burst. Unless some spin-up mechanism is available to recover the original high rotation rate that triggered the Schwinger mechanism, the fast radio burst will not be repeated again in the same magnetar.
Components of action potential repolarization in cerebellar parallel fibres.

PubMed

Pekala, Dobromila; Baginskas, Armantas; Szkudlarek, Hanna J; Raastad, Morten

2014-11-15

Repolarization of the presynaptic action potential is essential for transmitter release, excitability and energy expenditure. Little is known about repolarization in thin, unmyelinated axons forming en passant synapses, which represent the most common type of axons in the mammalian brain's grey matter.We used rat cerebellar parallel fibres, an example of typical grey matter axons, to investigate the effects of K(+) channel blockers on repolarization. We show that repolarization is composed of a fast tetraethylammonium (TEA)-sensitive component, determining the width and amplitude of the spike, and a slow margatoxin (MgTX)-sensitive depolarized after-potential (DAP). These two components could be recorded at the granule cell soma as antidromic action potentials and from the axons with a newly developed miniaturized grease-gap method. A considerable proportion of fast repolarization remained in the presence of TEA, MgTX, or both. This residual was abolished by the addition of quinine. The importance of proper control of fast repolarization was demonstrated by somatic recordings of antidromic action potentials. In these experiments, the relatively broad K(+) channel blocker 4-aminopyridine reduced the fast repolarization, resulting in bursts of action potentials forming on top of the DAP. We conclude that repolarization of the action potential in parallel fibres is supported by at least three groups of K(+) channels. Differences in their temporal profiles allow relatively independent control of the spike and the DAP, whereas overlap of their temporal profiles provides robust control of axonal bursting properties.
Efficient Scalable Median Filtering Using Histogram-Based Operations.

PubMed

Green, Oded

2018-05-01

Median filtering is a smoothing technique for noise removal in images. While there are various implementations of median filtering for a single-core CPU, there are few implementations for accelerators and multi-core systems. Many parallel implementations of median filtering use a sorting algorithm for rearranging the values within a filtering window and taking the median of the sorted value. While using sorting algorithms allows for simple parallel implementations, the cost of the sorting becomes prohibitive as the filtering windows grow. This makes such algorithms, sequential and parallel alike, inefficient. In this work, we introduce the first software parallel median filtering that is non-sorting-based. The new algorithm uses efficient histogram-based operations. These reduce the computational requirements of the new algorithm while also accessing the image fewer times. We show an implementation of our algorithm for both the CPU and NVIDIA's CUDA supported graphics processing unit (GPU). The new algorithm is compared with several other leading CPU and GPU implementations. The CPU implementation has near perfect linear scaling with a speedup on a quad-core system. The GPU implementation is several orders of magnitude faster than the other GPU implementations for mid-size median filters. For small kernels, and , comparison-based approaches are preferable as fewer operations are required. Lastly, the new algorithm is open-source and can be found in the OpenCV library.
Taxator-tk: precise taxonomic assignment of metagenomes by fast approximation of evolutionary neighborhoods

PubMed Central

Dröge, J.; Gregor, I.; McHardy, A. C.

2015-01-01

Motivation: Metagenomics characterizes microbial communities by random shotgun sequencing of DNA isolated directly from an environment of interest. An essential step in computational metagenome analysis is taxonomic sequence assignment, which allows identifying the sequenced community members and reconstructing taxonomic bins with sequence data for the individual taxa. For the massive datasets generated by next-generation sequencing technologies, this cannot be performed with de-novo phylogenetic inference methods. We describe an algorithm and the accompanying software, taxator-tk, which performs taxonomic sequence assignment by fast approximate determination of evolutionary neighbors from sequence similarities. Results: Taxator-tk was precise in its taxonomic assignment across all ranks and taxa for a range of evolutionary distances and for short as well as for long sequences. In addition to the taxonomic binning of metagenomes, it is well suited for profiling microbial communities from metagenome samples because it identifies bacterial, archaeal and eukaryotic community members without being affected by varying primer binding strengths, as in marker gene amplification, or copy number variations of marker genes across different taxa. Taxator-tk has an efficient, parallelized implementation that allows the assignment of 6 Gb of sequence data per day on a standard multiprocessor system with 10 CPU cores and microbial RefSeq as the genomic reference data. Availability and implementation: Taxator-tk source and binary program files are publicly available at http://algbio.cs.uni-duesseldorf.de/software/. Contact: Alice.McHardy@uni-duesseldorf.de Supplementary information: Supplementary data are available at Bioinformatics online. PMID:25388150
Parallel processing in the honeybee olfactory pathway: structure, function, and evolution.

PubMed

Rössler, Wolfgang; Brill, Martin F

2013-11-01

Animals face highly complex and dynamic olfactory stimuli in their natural environments, which require fast and reliable olfactory processing. Parallel processing is a common principle of sensory systems supporting this task, for example in visual and auditory systems, but its role in olfaction remained unclear. Studies in the honeybee focused on a dual olfactory pathway. Two sets of projection neurons connect glomeruli in two antennal-lobe hemilobes via lateral and medial tracts in opposite sequence with the mushroom bodies and lateral horn. Comparative studies suggest that this dual-tract circuit represents a unique adaptation in Hymenoptera. Imaging studies indicate that glomeruli in both hemilobes receive redundant sensory input. Recent simultaneous multi-unit recordings from projection neurons of both tracts revealed widely overlapping response profiles strongly indicating parallel olfactory processing. Whereas lateral-tract neurons respond fast with broad (generalistic) profiles, medial-tract neurons are odorant specific and respond slower. In analogy to "what-" and "where" subsystems in visual pathways, this suggests two parallel olfactory subsystems providing "what-" (quality) and "when" (temporal) information. Temporal response properties may support across-tract coincidence coding in higher centers. Parallel olfactory processing likely enhances perception of complex odorant mixtures to decode the diverse and dynamic olfactory world of a social insect.
An 81.6 μW FastICA processor for epileptic seizure detection.

PubMed

Yang, Chia-Hsiang; Shih, Yi-Hsin; Chiueh, Herming

2015-02-01

To improve the performance of epileptic seizure detection, independent component analysis (ICA) is applied to multi-channel signals to separate artifacts and signals of interest. FastICA is an efficient algorithm to compute ICA. To reduce the energy dissipation, eigenvalue decomposition (EVD) is utilized in the preprocessing stage to reduce the convergence time of iterative calculation of ICA components. EVD is computed efficiently through an array structure of processing elements running in parallel. Area-efficient EVD architecture is realized by leveraging the approximate Jacobi algorithm, leading to a 77.2% area reduction. By choosing proper memory element and reduced wordlength, the power and area of storage memory are reduced by 95.6% and 51.7%, respectively. The chip area is minimized through fixed-point implementation and architectural transformations. Given a latency constraint of 0.1 s, an 86.5% area reduction is achieved compared to the direct-mapped architecture. Fabricated in 90 nm CMOS, the core area of the chip is 0.40 mm(2). The FastICA processor, part of an integrated epileptic control SoC, dissipates 81.6 μW at 0.32 V. The computation delay of a frame of 256 samples for 8 channels is 84.2 ms. Compared to prior work, 0.5% power dissipation, 26.7% silicon area, and 3.4 × computation speedup are achieved. The performance of the chip was verified by human dataset.
Matrix-Free Polynomial-Based Nonlinear Least Squares Optimized Preconditioning and its Application to Discontinuous Galerkin Discretizations of the Euler Equations

DTIC Science & Technology

2015-06-01

cient parallel code for applying the operator. Our method constructs a polynomial preconditioner using a nonlinear least squares (NLLS) algorithm. We show...apply the underlying operator. Such a preconditioner can be very attractive in scenarios where one has a highly efficient parallel code for applying...repeatedly solve a large system of linear equations where one has an extremely fast parallel code for applying an underlying fixed linear operator
Parallel STEPS: Large Scale Stochastic Spatial Reaction-Diffusion Simulation with High Performance Computers

PubMed Central

Chen, Weiliang; De Schutter, Erik

2017-01-01

Stochastic, spatial reaction-diffusion simulations have been widely used in systems biology and computational neuroscience. However, the increasing scale and complexity of models and morphologies have exceeded the capacity of any serial implementation. This led to the development of parallel solutions that benefit from the boost in performance of modern supercomputers. In this paper, we describe an MPI-based, parallel operator-splitting implementation for stochastic spatial reaction-diffusion simulations with irregular tetrahedral meshes. The performance of our implementation is first examined and analyzed with simulations of a simple model. We then demonstrate its application to real-world research by simulating the reaction-diffusion components of a published calcium burst model in both Purkinje neuron sub-branch and full dendrite morphologies. Simulation results indicate that our implementation is capable of achieving super-linear speedup for balanced loading simulations with reasonable molecule density and mesh quality. In the best scenario, a parallel simulation with 2,000 processes runs more than 3,600 times faster than its serial SSA counterpart, and achieves more than 20-fold speedup relative to parallel simulation with 100 processes. In a more realistic scenario with dynamic calcium influx and data recording, the parallel simulation with 1,000 processes and no load balancing is still 500 times faster than the conventional serial SSA simulation. PMID:28239346

Parallel STEPS: Large Scale Stochastic Spatial Reaction-Diffusion Simulation with High Performance Computers.

PubMed

Chen, Weiliang; De Schutter, Erik

2017-01-01

Stochastic, spatial reaction-diffusion simulations have been widely used in systems biology and computational neuroscience. However, the increasing scale and complexity of models and morphologies have exceeded the capacity of any serial implementation. This led to the development of parallel solutions that benefit from the boost in performance of modern supercomputers. In this paper, we describe an MPI-based, parallel operator-splitting implementation for stochastic spatial reaction-diffusion simulations with irregular tetrahedral meshes. The performance of our implementation is first examined and analyzed with simulations of a simple model. We then demonstrate its application to real-world research by simulating the reaction-diffusion components of a published calcium burst model in both Purkinje neuron sub-branch and full dendrite morphologies. Simulation results indicate that our implementation is capable of achieving super-linear speedup for balanced loading simulations with reasonable molecule density and mesh quality. In the best scenario, a parallel simulation with 2,000 processes runs more than 3,600 times faster than its serial SSA counterpart, and achieves more than 20-fold speedup relative to parallel simulation with 100 processes. In a more realistic scenario with dynamic calcium influx and data recording, the parallel simulation with 1,000 processes and no load balancing is still 500 times faster than the conventional serial SSA simulation.
PISCES: An environment for parallel scientific computation

NASA Technical Reports Server (NTRS)

Pratt, T. W.

1985-01-01

The parallel implementation of scientific computing environment (PISCES) is a project to provide high-level programming environments for parallel MIMD computers. Pisces 1, the first of these environments, is a FORTRAN 77 based environment which runs under the UNIX operating system. The Pisces 1 user programs in Pisces FORTRAN, an extension of FORTRAN 77 for parallel processing. The major emphasis in the Pisces 1 design is in providing a carefully specified virtual machine that defines the run-time environment within which Pisces FORTRAN programs are executed. Each implementation then provides the same virtual machine, regardless of differences in the underlying architecture. The design is intended to be portable to a variety of architectures. Currently Pisces 1 is implemented on a network of Apollo workstations and on a DEC VAX uniprocessor via simulation of the task level parallelism. An implementation for the Flexible Computing Corp. FLEX/32 is under construction. An introduction to the Pisces 1 virtual computer and the FORTRAN 77 extensions is presented. An example of an algorithm for the iterative solution of a system of equations is given. The most notable features of the design are the provision for several granularities of parallelism in programs and the provision of a window mechanism for distributed access to large arrays of data.
Hypergraph partitioning implementation for parallelizing matrix-vector multiplication using CUDA GPU-based parallel computing

NASA Astrophysics Data System (ADS)

Murni, Bustamam, A.; Ernastuti, Handhika, T.; Kerami, D.

2017-07-01

Calculation of the matrix-vector multiplication in the real-world problems often involves large matrix with arbitrary size. Therefore, parallelization is needed to speed up the calculation process that usually takes a long time. Graph partitioning techniques that have been discussed in the previous studies cannot be used to complete the parallelized calculation of matrix-vector multiplication with arbitrary size. This is due to the assumption of graph partitioning techniques that can only solve the square and symmetric matrix. Hypergraph partitioning techniques will overcome the shortcomings of the graph partitioning technique. This paper addresses the efficient parallelization of matrix-vector multiplication through hypergraph partitioning techniques using CUDA GPU-based parallel computing. CUDA (compute unified device architecture) is a parallel computing platform and programming model that was created by NVIDIA and implemented by the GPU (graphics processing unit).
Portable parallel portfolio optimization in the Aurora Financial Management System

NASA Astrophysics Data System (ADS)

Laure, Erwin; Moritsch, Hans

2001-07-01

Financial planning problems are formulated as large scale, stochastic, multiperiod, tree structured optimization problems. An efficient technique for solving this kind of problems is the nested Benders decomposition method. In this paper we present a parallel, portable, asynchronous implementation of this technique. To achieve our portability goals we elected the programming language Java for our implementation and used a high level Java based framework, called OpusJava, for expressing the parallelism potential as well as synchronization constraints. Our implementation is embedded within a modular decision support tool for portfolio and asset liability management, the Aurora Financial Management System.
Fast Numerical Solution of the Plasma Response Matrix for Real-time Ideal MHD Control

DOE Office of Scientific and Technical Information (OSTI.GOV)

Glasser, Alexander; Kolemen, Egemen; Glasser, Alan H.

To help effectuate near real-time feedback control of ideal MHD instabilities in tokamak geometries, a parallelized version of A.H. Glasser’s DCON (Direct Criterion of Newcomb) code is developed. To motivate the numerical implementation, we first solve DCON’s δW formulation with a Hamilton-Jacobi theory, elucidating analytical and numerical features of the ideal MHD stability problem. The plasma response matrix is demonstrated to be the solution of an ideal MHD Riccati equation. We then describe our adaptation of DCON with numerical methods natural to solutions of the Riccati equation, parallelizing it to enable its operation in near real-time. We replace DCON’s serial integration of perturbed modes—which satisfy a singular Euler- Lagrange equation—with a domain-decomposed integration of state transition matrices. Output is shown to match results from DCON with high accuracy, and with computation time < 1s. Such computational speed may enable active feedback ideal MHD stability control, especially in plasmas whose ideal MHD equilibria evolve with inductive timescalemore » $$\\tau$$ ≳ 1s—as in ITER. Further potential applications of this theory are discussed.« less
6 x 6-cm fully depleted pn-junction CCD for high-resolution spectroscopy in the 0.1- to 15-keV photon energy range

NASA Astrophysics Data System (ADS)

von Zanthier, Christoph; Holl, Peter; Kemmer, Josef; Lechner, Peter; Maier, B.; Soltau, Heike; Stoetter, R.; Braeuninger, Heinrich W.; Dennerl, Konrad; Haberl, Frank; Hartmann, R.; Hartner, Gisela D.; Hippmann, H.; Kastelic, E.; Kink, W.; Krause, N.; Meidinger, Norbert; Metzner, G.; Pfeffermann, Elmar; Popp, M.; Reppin, Claus; Stoetter, Diana; Strueder, Lothar; Truemper, Joachim; Weber, U.; Carathanassis, D.; Engelhard, S.; Gebhart, Th.; Hauff, D.; Lutz, G.; Richter, R. H.; Seitz, H.; Solc, P.; Bihler, Edgar; Boettcher, H.; Kendziorra, Eckhard; Kraemer, J.; Pflueger, Bernhard; Staubert, Ruediger

1998-04-01

The concept and performance of the fully depleted pn- junction CCD system, developed for the European XMM- and the German ABRIXAS-satellite missions for soft x-ray imaging and spectroscopy in the 0.1 keV to 15 keV photon range, is presented. The 58 mm X 60 mm large pn-CCD array uses pn- junctions for registers and for the backside instead of MOS registers. This concept naturally allows to fully deplete the detector volume to make it an efficient detector to photons with energies up to 15 keV. For high detection efficiency in the soft x-ray region down to 100 eV, an ultrathin pn-CCD backside deadlayer has been realized. Each pn-CCD-channel is equipped with an on-chip JFET amplifier which, in combination with the CAMEX-amplifier and multiplexing chip, facilitates parallel readout with a pixel read rate of 3 MHz and an electronic noise floor of ENC < e-. With the complete parallel readout, very fast pn-CCD readout modi can be implemented in the system which allow for high resolution photon spectroscopy of even the brightest x-ray sources in the sky.
A parallel and modular deformable cell Car-Parrinello code

NASA Astrophysics Data System (ADS)

Cavazzoni, Carlo; Chiarotti, Guido L.

1999-12-01

We have developed a modular parallel code implementing the Car-Parrinello [Phys. Rev. Lett. 55 (1985) 2471] algorithm including the variable cell dynamics [Europhys. Lett. 36 (1994) 345; J. Phys. Chem. Solids 56 (1995) 510]. Our code is written in Fortran 90, and makes use of some new programming concepts like encapsulation, data abstraction and data hiding. The code has a multi-layer hierarchical structure with tree like dependences among modules. The modules include not only the variables but also the methods acting on them, in an object oriented fashion. The modular structure allows easier code maintenance, develop and debugging procedures, and is suitable for a developer team. The layer structure permits high portability. The code displays an almost linear speed-up in a wide range of number of processors independently of the architecture. Super-linear speed up is obtained with a "smart" Fast Fourier Transform (FFT) that uses the available memory on the single node (increasing for a fixed problem with the number of processing elements) as temporary buffer to store wave function transforms. This code has been used to simulate water and ammonia at giant planet conditions for systems as large as 64 molecules for ˜50 ps.
PELE web server: atomistic study of biomolecular systems at your fingertips.

PubMed

Madadkar-Sobhani, Armin; Guallar, Victor

2013-07-01

PELE, Protein Energy Landscape Exploration, our novel technology based on protein structure prediction algorithms and a Monte Carlo sampling, is capable of modelling the all-atom protein-ligand dynamical interactions in an efficient and fast manner, with two orders of magnitude reduced computational cost when compared with traditional molecular dynamics techniques. PELE's heuristic approach generates trial moves based on protein and ligand perturbations followed by side chain sampling and global/local minimization. The collection of accepted steps forms a stochastic trajectory. Furthermore, several processors may be run in parallel towards a collective goal or defining several independent trajectories; the whole procedure has been parallelized using the Message Passing Interface. Here, we introduce the PELE web server, designed to make the whole process of running simulations easier and more practical by minimizing input file demand, providing user-friendly interface and producing abstract outputs (e.g. interactive graphs and tables). The web server has been implemented in C++ using Wt (http://www.webtoolkit.eu) and MySQL (http://www.mysql.com). The PELE web server, accessible at http://pele.bsc.es, is free and open to all users with no login requirement.
Fast Numerical Solution of the Plasma Response Matrix for Real-time Ideal MHD Control

DOE PAGES

Glasser, Alexander; Kolemen, Egemen; Glasser, Alan H.

2018-03-26

To help effectuate near real-time feedback control of ideal MHD instabilities in tokamak geometries, a parallelized version of A.H. Glasser’s DCON (Direct Criterion of Newcomb) code is developed. To motivate the numerical implementation, we first solve DCON’s δW formulation with a Hamilton-Jacobi theory, elucidating analytical and numerical features of the ideal MHD stability problem. The plasma response matrix is demonstrated to be the solution of an ideal MHD Riccati equation. We then describe our adaptation of DCON with numerical methods natural to solutions of the Riccati equation, parallelizing it to enable its operation in near real-time. We replace DCON’s serial integration of perturbed modes—which satisfy a singular Euler- Lagrange equation—with a domain-decomposed integration of state transition matrices. Output is shown to match results from DCON with high accuracy, and with computation time < 1s. Such computational speed may enable active feedback ideal MHD stability control, especially in plasmas whose ideal MHD equilibria evolve with inductive timescalemore » $$\\tau$$ ≳ 1s—as in ITER. Further potential applications of this theory are discussed.« less
Architecture of the parallel hierarchical network for fast image recognition

NASA Astrophysics Data System (ADS)

Timchenko, Leonid; Wójcik, Waldemar; Kokriatskaia, Natalia; Kutaev, Yuriy; Ivasyuk, Igor; Kotyra, Andrzej; Smailova, Saule

2016-09-01

Multistage integration of visual information in the brain allows humans to respond quickly to most significant stimuli while maintaining their ability to recognize small details in the image. Implementation of this principle in technical systems can lead to more efficient processing procedures. The multistage approach to image processing includes main types of cortical multistage convergence. The input images are mapped into a flexible hierarchy that reflects complexity of image data. Procedures of the temporal image decomposition and hierarchy formation are described in mathematical expressions. The multistage system highlights spatial regularities, which are passed through a number of transformational levels to generate a coded representation of the image that encapsulates a structure on different hierarchical levels in the image. At each processing stage a single output result is computed to allow a quick response of the system. The result is presented as an activity pattern, which can be compared with previously computed patterns on the basis of the closest match. With regard to the forecasting method, its idea lies in the following. In the results synchronization block, network-processed data arrive to the database where a sample of most correlated data is drawn using service parameters of the parallel-hierarchical network.
An object-oriented, coprocessor-accelerated model for ice sheet simulations

NASA Astrophysics Data System (ADS)

Seddik, H.; Greve, R.

2013-12-01

Recently, numerous models capable of modeling the thermo-dynamics of ice sheets have been developed within the ice sheet modeling community. Their capabilities have been characterized by a wide range of features with different numerical methods (finite difference or finite element), different implementations of the ice flow mechanics (shallow-ice, higher-order, full Stokes) and different treatments for the basal and coastal areas (basal hydrology, basal sliding, ice shelves). Shallow-ice models (SICOPOLIS, IcIES, PISM, etc) have been widely used for modeling whole ice sheets (Greenland and Antarctica) due to the relatively low computational cost of the shallow-ice approximation but higher order (ISSM, AIF) and full Stokes (Elmer/Ice) models have been recently used to model the Greenland ice sheet. The advance in processor speed and the decrease in cost for accessing large amount of memory and storage have undoubtedly been the driving force in the commoditization of models with higher capabilities, and the popularity of Elmer/Ice (http://elmerice.elmerfem.com) with an active user base is a notable representation of this trend. Elmer/Ice is a full Stokes model built on top of the multi-physics package Elmer (http://www.csc.fi/english/pages/elmer) which provides the full machinery for the complex finite element procedure and is fully parallel (mesh partitioning with OpenMPI communication). Elmer is mainly written in Fortran 90 and targets essentially traditional processors as the code base was not initially written to run on modern coprocessors (yet adding support for the recently introduced x86 based coprocessors is possible). Furthermore, a truly modular and object-oriented implementation is required for quick adaptation to fast evolving capabilities in hardware (Fortran 2003 provides an object-oriented programming model while not being clean and requiring a tricky refactoring of Elmer code). In this work, the object-oriented, coprocessor-accelerated finite element code Sainou is introduced. Sainou is an Elmer fork which is reimplemented in Objective C and used for experimenting with ice sheet models running on coprocessors, essentially GPU devices. GPUs are highly parallel processors that provide opportunities for fine-grained parallelization of the full Stokes problem using the standard OpenCL language (http://www.khronos.org/opencl/) to access the device. Sainou is built upon a collection of Objective C base classes that service a modular kernel (itself a base class) which provides the core methods to solve the finite element problem. An early implementation of Sainou will be presented with emphasis on the object architecture and the strategies of parallelizations. The computation of a simple heat conduction problem is used to test the implementation which also provides experimental support for running the global matrix assembly on GPU.
Fully Parallel MHD Stability Analysis Tool

NASA Astrophysics Data System (ADS)

Svidzinski, Vladimir; Galkin, Sergei; Kim, Jin-Soo; Liu, Yueqiang

2014-10-01

Progress on full parallelization of the plasma stability code MARS will be reported. MARS calculates eigenmodes in 2D axisymmetric toroidal equilibria in MHD-kinetic plasma models. It is a powerful tool for studying MHD and MHD-kinetic instabilities and it is widely used by fusion community. Parallel version of MARS is intended for simulations on local parallel clusters. It will be an efficient tool for simulation of MHD instabilities with low, intermediate and high toroidal mode numbers within both fluid and kinetic plasma models, already implemented in MARS. Parallelization of the code includes parallelization of the construction of the matrix for the eigenvalue problem and parallelization of the inverse iterations algorithm, implemented in MARS for the solution of the formulated eigenvalue problem. Construction of the matrix is parallelized by distributing the load among processors assigned to different magnetic surfaces. Parallelization of the solution of the eigenvalue problem is made by repeating steps of the present MARS algorithm using parallel libraries and procedures. Initial results of the code parallelization will be reported. Work is supported by the U.S. DOE SBIR program.
Parallel processing and expert systems

NASA Technical Reports Server (NTRS)

Lau, Sonie; Yan, Jerry C.

1991-01-01

Whether it be monitoring the thermal subsystem of Space Station Freedom, or controlling the navigation of the autonomous rover on Mars, NASA missions in the 1990s cannot enjoy an increased level of autonomy without the efficient implementation of expert systems. Merely increasing the computational speed of uniprocessors may not be able to guarantee that real-time demands are met for larger systems. Speedup via parallel processing must be pursued alongside the optimization of sequential implementations. Prototypes of parallel expert systems have been built at universities and industrial laboratories in the U.S. and Japan. The state-of-the-art research in progress related to parallel execution of expert systems is surveyed. The survey discusses multiprocessors for expert systems, parallel languages for symbolic computations, and mapping expert systems to multiprocessors. Results to date indicate that the parallelism achieved for these systems is small. The main reasons are (1) the body of knowledge applicable in any given situation and the amount of computation executed by each rule firing are small, (2) dividing the problem solving process into relatively independent partitions is difficult, and (3) implementation decisions that enable expert systems to be incrementally refined hamper compile-time optimization. In order to obtain greater speedups, data parallelism and application parallelism must be exploited.
Xyce parallel electronic simulator users guide, version 6.1

DOE Office of Scientific and Technical Information (OSTI.GOV)

Keiter, Eric R; Mei, Ting; Russo, Thomas V.

This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas; Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). This includes support for most popular parallel and serial computers; A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows one to developmore » new types of analysis without requiring the implementation of analysis-specific device models; Device models that are specifically tailored to meet Sandia's needs, including some radiationaware devices (for Sandia users only); and Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase-a message passing parallel implementation-which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows.« less
Xyce parallel electronic simulator users' guide, Version 6.0.1.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Keiter, Eric R; Mei, Ting; Russo, Thomas V.

This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows one to developmore » new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandias needs, including some radiationaware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase a message passing parallel implementation which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows.« less
Efficient Parallel Video Processing Techniques on GPU: From Framework to Implementation

PubMed Central

Su, Huayou; Wen, Mei; Wu, Nan; Ren, Ju; Zhang, Chunyuan

2014-01-01

Through reorganizing the execution order and optimizing the data structure, we proposed an efficient parallel framework for H.264/AVC encoder based on massively parallel architecture. We implemented the proposed framework by CUDA on NVIDIA's GPU. Not only the compute intensive components of the H.264 encoder are parallelized but also the control intensive components are realized effectively, such as CAVLC and deblocking filter. In addition, we proposed serial optimization methods, including the multiresolution multiwindow for motion estimation, multilevel parallel strategy to enhance the parallelism of intracoding as much as possible, component-based parallel CAVLC, and direction-priority deblocking filter. More than 96% of workload of H.264 encoder is offloaded to GPU. Experimental results show that the parallel implementation outperforms the serial program by 20 times of speedup ratio and satisfies the requirement of the real-time HD encoding of 30 fps. The loss of PSNR is from 0.14 dB to 0.77 dB, when keeping the same bitrate. Through the analysis to the kernels, we found that speedup ratios of the compute intensive algorithms are proportional with the computation power of the GPU. However, the performance of the control intensive parts (CAVLC) is much related to the memory bandwidth, which gives an insight for new architecture design. PMID:24757432
Xyce parallel electronic simulator users guide, version 6.0.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Keiter, Eric R; Mei, Ting; Russo, Thomas V.

This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows one to developmore » new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandias needs, including some radiationaware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase a message passing parallel implementation which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows.« less
[Metabolic study of the initial period of fasting in the king penguin chick].

PubMed

Cherel, Y; Le Maho, Y

1985-01-01

There is an 80% decrease in the specific daily change in body mass (dm/m dt) during the first 5-6 days of fasting in king penguin chicks, which characterizes period I of fasting. Parallel decreases in plasma alanine and uric acid concentrations suggest an important reduction in protein degradation. Plasma concentration of beta-hydroxybutyrate and glucose are high, respectively 1.3 and 12.5 mmol X 1(-1), and do not change significantly.
PARAVT: Parallel Voronoi tessellation code

NASA Astrophysics Data System (ADS)

González, R. E.

2016-10-01

In this study, we present a new open source code for massive parallel computation of Voronoi tessellations (VT hereafter) in large data sets. The code is focused for astrophysical purposes where VT densities and neighbors are widely used. There are several serial Voronoi tessellation codes, however no open source and parallel implementations are available to handle the large number of particles/galaxies in current N-body simulations and sky surveys. Parallelization is implemented under MPI and VT using Qhull library. Domain decomposition takes into account consistent boundary computation between tasks, and includes periodic conditions. In addition, the code computes neighbors list, Voronoi density, Voronoi cell volume, density gradient for each particle, and densities on a regular grid. Code implementation and user guide are publicly available at https://github.com/regonzar/paravt.
Parallel Monotonic Basin Hopping for Low Thrust Trajectory Optimization

NASA Technical Reports Server (NTRS)

McCarty, Steven L.; McGuire, Melissa L.

2018-01-01

Monotonic Basin Hopping has been shown to be an effective method of solving low thrust trajectory optimization problems. This paper outlines an extension to the common serial implementation by parallelizing it over any number of available compute cores. The Parallel Monotonic Basin Hopping algorithm described herein is shown to be an effective way to more quickly locate feasible solutions, and improve locally optimal solutions in an automated way without requiring a feasible initial guess. The increased speed achieved through parallelization enables the algorithm to be applied to more complex problems that would otherwise be impractical for a serial implementation. Low thrust cislunar transfers and a hybrid Mars example case demonstrate the effectiveness of the algorithm. Finally, a preliminary scaling study quantifies the expected decrease in solve time compared to a serial implementation.,

Parallel grid generation algorithm for distributed memory computers

NASA Technical Reports Server (NTRS)

Moitra, Stuti; Moitra, Anutosh

1994-01-01

A parallel grid-generation algorithm and its implementation on the Intel iPSC/860 computer are described. The grid-generation scheme is based on an algebraic formulation of homotopic relations. Methods for utilizing the inherent parallelism of the grid-generation scheme are described, and implementation of multiple levELs of parallelism on multiple instruction multiple data machines are indicated. The algorithm is capable of providing near orthogonality and spacing control at solid boundaries while requiring minimal interprocessor communications. Results obtained on the Intel hypercube for a blended wing-body configuration are used to demonstrate the effectiveness of the algorithm. Fortran implementations bAsed on the native programming model of the iPSC/860 computer and the Express system of software tools are reported. Computational gains in execution time speed-up ratios are given.
Extension and validation of an analytical model for in vivo PET verification of proton therapy—a phantom and clinical study

NASA Astrophysics Data System (ADS)

Attanasi, F.; Knopf, A.; Parodi, K.; Paganetti, H.; Bortfeld, T.; Rosso, V.; Del Guerra, A.

2011-08-01

The interest in positron emission tomography (PET) as a tool for treatment verification in proton therapy has become widespread in recent years, and several research groups worldwide are currently investigating the clinical implementation. After the first off-line investigation with a PET/CT scanner at MGH (Boston, USA), attention is now focused on an in-room PET application immediately after treatment in order to also detect shorter-lived isotopes, such as O15 and N13, minimizing isotope washout and avoiding patient repositioning errors. Clinical trials are being conducted by means of commercially available PET systems, and other tests are planned using application-dedicated tomographs. Parallel to the experimental investigation and new hardware development, great interest has been shown in the development of fast procedures to provide feedback regarding the delivered dose from reconstructed PET images. Since the thresholds of inelastic nuclear reactions leading to tissue β+-activation fall within the energy range of 15-20 MeV, the distal activity fall-off is correlated, but not directly matched, to the distal fall-off of the dose distribution. Moreover, the physical interactions leading to β+-activation and energy deposition are of a different nature. All these facts make it essential to further develop accurate and fast methodologies capable of predicting, on the basis of the planned dose distribution, expected PET images to be compared with actual PET measurements, thus providing clinical feedback on the correctness of the dose delivery and of the irradiation field position. The aim of this study has been to validate an analytical model and to implement and evaluate it in a fast and flexible framework able to locally predict such activity distributions directly taking the reference planning CT and planned dose as inputs. The results achieved in this study for phantoms and clinical cases highlighted the potential of the implemented method to predict expected activity distributions with great accuracy. Thus, the analytical model can be used as a powerful substitute method to the sensitive and time-consuming Monte Carlo approach.
Parallelization of elliptic solver for solving 1D Boussinesq model

NASA Astrophysics Data System (ADS)

Tarwidi, D.; Adytia, D.

2018-03-01

In this paper, a parallel implementation of an elliptic solver in solving 1D Boussinesq model is presented. Numerical solution of Boussinesq model is obtained by implementing a staggered grid scheme to continuity, momentum, and elliptic equation of Boussinesq model. Tridiagonal system emerging from numerical scheme of elliptic equation is solved by cyclic reduction algorithm. The parallel implementation of cyclic reduction is executed on multicore processors with shared memory architectures using OpenMP. To measure the performance of parallel program, large number of grids is varied from 28 to 214. Two test cases of numerical experiment, i.e. propagation of solitary and standing wave, are proposed to evaluate the parallel program. The numerical results are verified with analytical solution of solitary and standing wave. The best speedup of solitary and standing wave test cases is about 2.07 with 214 of grids and 1.86 with 213 of grids, respectively, which are executed by using 8 threads. Moreover, the best efficiency of parallel program is 76.2% and 73.5% for solitary and standing wave test cases, respectively.
GPU accelerated dynamic functional connectivity analysis for functional MRI data.

PubMed

Akgün, Devrim; Sakoğlu, Ünal; Esquivel, Johnny; Adinoff, Bryon; Mete, Mutlu

2015-07-01

Recent advances in multi-core processors and graphics card based computational technologies have paved the way for an improved and dynamic utilization of parallel computing techniques. Numerous applications have been implemented for the acceleration of computationally-intensive problems in various computational science fields including bioinformatics, in which big data problems are prevalent. In neuroimaging, dynamic functional connectivity (DFC) analysis is a computationally demanding method used to investigate dynamic functional interactions among different brain regions or networks identified with functional magnetic resonance imaging (fMRI) data. In this study, we implemented and analyzed a parallel DFC algorithm based on thread-based and block-based approaches. The thread-based approach was designed to parallelize DFC computations and was implemented in both Open Multi-Processing (OpenMP) and Compute Unified Device Architecture (CUDA) programming platforms. Another approach developed in this study to better utilize CUDA architecture is the block-based approach, where parallelization involves smaller parts of fMRI time-courses obtained by sliding-windows. Experimental results showed that the proposed parallel design solutions enabled by the GPUs significantly reduce the computation time for DFC analysis. Multicore implementation using OpenMP on 8-core processor provides up to 7.7× speed-up. GPU implementation using CUDA yielded substantial accelerations ranging from 18.5× to 157× speed-up once thread-based and block-based approaches were combined in the analysis. Proposed parallel programming solutions showed that multi-core processor and CUDA-supported GPU implementations accelerated the DFC analyses significantly. Developed algorithms make the DFC analyses more practical for multi-subject studies with more dynamic analyses. Copyright © 2015 Elsevier Ltd. All rights reserved.
Quasi-parallel precession diffraction: Alignment method for scanning transmission electron microscopes.

PubMed

Plana-Ruiz, S; Portillo, J; Estradé, S; Peiró, F; Kolb, Ute; Nicolopoulos, S

2018-06-06

A general method to set illuminating conditions for selectable beam convergence and probe size is presented in this work for Transmission Electron Microscopes (TEM) fitted with µs/pixel fast beam scanning control, (S)TEM, and an annular dark field detector. The case of interest of beam convergence and probe size, which enables diffraction pattern indexation, is then used as a starting point in this work to add 100 Hz precession to the beam while imaging the specimen at a fast rate and keeping the projector system in diffraction mode. The described systematic alignment method for the adjustment of beam precession on the specimen plane while scanning at fast rates is mainly based on the sharpness of the precessed STEM image. The complete alignment method for parallel condition and precession, Quasi-Parallel PED-STEM, is presented in block diagram scheme, as it has been tested on a variety of instruments. The immediate application of this methodology is that it renders the TEM column ready for the acquisition of Precessed Electron Diffraction Tomographies (EDT) as well as for the acquisition of slow Precessed Scanning Nanometer Electron Diffraction (SNED). Examples of the quality of the Precessed Electron Diffraction (PED) patterns and PED-STEM alignment images are presented with corresponding probe sizes and convergence angles. Copyright © 2018. Published by Elsevier B.V.
Graphics applications utilizing parallel processing

NASA Technical Reports Server (NTRS)

Rice, John R.

1990-01-01

The results are presented of research conducted to develop a parallel graphic application algorithm to depict the numerical solution of the 1-D wave equation, the vibrating string. The research was conducted on a Flexible Flex/32 multiprocessor and a Sequent Balance 21000 multiprocessor. The wave equation is implemented using the finite difference method. The synchronization issues that arose from the parallel implementation and the strategies used to alleviate the effects of the synchronization overhead are discussed.
Toward real-time Monte Carlo simulation using a commercial cloud computing infrastructure

NASA Astrophysics Data System (ADS)

Wang, Henry; Ma, Yunzhi; Pratx, Guillem; Xing, Lei

2011-09-01

Monte Carlo (MC) methods are the gold standard for modeling photon and electron transport in a heterogeneous medium; however, their computational cost prohibits their routine use in the clinic. Cloud computing, wherein computing resources are allocated on-demand from a third party, is a new approach for high performance computing and is implemented to perform ultra-fast MC calculation in radiation therapy. We deployed the EGS5 MC package in a commercial cloud environment. Launched from a single local computer with Internet access, a Python script allocates a remote virtual cluster. A handshaking protocol designates master and worker nodes. The EGS5 binaries and the simulation data are initially loaded onto the master node. The simulation is then distributed among independent worker nodes via the message passing interface, and the results aggregated on the local computer for display and data analysis. The described approach is evaluated for pencil beams and broad beams of high-energy electrons and photons. The output of cloud-based MC simulation is identical to that produced by single-threaded implementation. For 1 million electrons, a simulation that takes 2.58 h on a local computer can be executed in 3.3 min on the cloud with 100 nodes, a 47× speed-up. Simulation time scales inversely with the number of parallel nodes. The parallelization overhead is also negligible for large simulations. Cloud computing represents one of the most important recent advances in supercomputing technology and provides a promising platform for substantially improved MC simulation. In addition to the significant speed up, cloud computing builds a layer of abstraction for high performance parallel computing, which may change the way dose calculations are performed and radiation treatment plans are completed. This work was presented in part at the 2010 Annual Meeting of the American Association of Physicists in Medicine (AAPM), Philadelphia, PA.
Massive parallelization of a 3D finite difference electromagnetic forward solution using domain decomposition methods on multiple CUDA enabled GPUs

NASA Astrophysics Data System (ADS)

Schultz, A.

2010-12-01

3D forward solvers lie at the core of inverse formulations used to image the variation of electrical conductivity within the Earth's interior. This property is associated with variations in temperature, composition, phase, presence of volatiles, and in specific settings, the presence of groundwater, geothermal resources, oil/gas or minerals. The high cost of 3D solutions has been a stumbling block to wider adoption of 3D methods. Parallel algorithms for modeling frequency domain 3D EM problems have not achieved wide scale adoption, with emphasis on fairly coarse grained parallelism using MPI and similar approaches. The communications bandwidth as well as the latency required to send and receive network communication packets is a limiting factor in implementing fine grained parallel strategies, inhibiting wide adoption of these algorithms. Leading Graphics Processor Unit (GPU) companies now produce GPUs with hundreds of GPU processor cores per die. The footprint, in silicon, of the GPU's restricted instruction set is much smaller than the general purpose instruction set required of a CPU. Consequently, the density of processor cores on a GPU can be much greater than on a CPU. GPUs also have local memory, registers and high speed communication with host CPUs, usually through PCIe type interconnects. The extremely low cost and high computational power of GPUs provides the EM geophysics community with an opportunity to achieve fine grained (i.e. massive) parallelization of codes on low cost hardware. The current generation of GPUs (e.g. NVidia Fermi) provides 3 billion transistors per chip die, with nearly 500 processor cores and up to 6 GB of fast (DDR5) GPU memory. This latest generation of GPU supports fast hardware double precision (64 bit) floating point operations of the type required for frequency domain EM forward solutions. Each Fermi GPU board can sustain nearly 1 TFLOP in double precision, and multiple boards can be installed in the host computer system. We describe our ongoing efforts to achieve massive parallelization on a novel hybrid GPU testbed machine currently configured with 12 Intel Westmere Xeon CPU cores (or 24 parallel computational threads) with 96 GB DDR3 system memory, 4 GPU subsystems which in aggregate contain 960 NVidia Tesla GPU cores with 16 GB dedicated DDR3 GPU memory, and a second interleved bank of 4 GPU subsystems containing in aggregate 1792 NVidia Fermi GPU cores with 12 GB dedicated DDR5 GPU memory. We are applying domain decomposition methods to a modified version of Weiss' (2001) 3D frequency domain full physics EM finite difference code, an open source GPL licensed f90 code available for download from www.OpenEM.org. This will be the core of a new hybrid 3D inversion that parallelizes frequencies across CPUs and individual forward solutions across GPUs. We describe progress made in modifying the code to use direct solvers in GPU cores dedicated to each small subdomain, iteratively improving the solution by matching adjacent subdomain boundary solutions, rather than iterative Krylov space sparse solvers as currently applied to the whole domain.
Partitioning in parallel processing of production systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Oflazer, K.

1987-01-01

This thesis presents research on certain issues related to parallel processing of production systems. It first presents a parallel production system interpreter that has been implemented on a four-processor multiprocessor. This parallel interpreter is based on Forgy's OPS5 interpreter and exploits production-level parallelism in production systems. Runs on the multiprocessor system indicate that it is possible to obtain speed-up of around 1.7 in the match computation for certain production systems when productions are split into three sets that are processed in parallel. The next issue addressed is that of partitioning a set of rules to processors in a parallel interpretermore » with production-level parallelism, and the extent of additional improvement in performance. The partitioning problem is formulated and an algorithm for approximate solutions is presented. The thesis next presents a parallel processing scheme for OPS5 production systems that allows some redundancy in the match computation. This redundancy enables the processing of a production to be divided into units of medium granularity each of which can be processed in parallel. Subsequently, a parallel processor architecture for implementing the parallel processing algorithm is presented.« less
Genetic algorithms using SISAL parallel programming language

DOE Office of Scientific and Technical Information (OSTI.GOV)

Tejada, S.

1994-05-06

Genetic algorithms are a mathematical optimization technique developed by John Holland at the University of Michigan [1]. The SISAL programming language possesses many of the characteristics desired to implement genetic algorithms. SISAL is a deterministic, functional programming language which is inherently parallel. Because SISAL is functional and based on mathematical concepts, genetic algorithms can be efficiently translated into the language. Several of the steps involved in genetic algorithms, such as mutation, crossover, and fitness evaluation, can be parallelized using SISAL. In this paper I will l discuss the implementation and performance of parallel genetic algorithms in SISAL.
Knowledge representation into Ada parallel processing

NASA Technical Reports Server (NTRS)

Masotto, Tom; Babikyan, Carol; Harper, Richard

1990-01-01

The Knowledge Representation into Ada Parallel Processing project is a joint NASA and Air Force funded project to demonstrate the execution of intelligent systems in Ada on the Charles Stark Draper Laboratory fault-tolerant parallel processor (FTPP). Two applications were demonstrated - a portion of the adaptive tactical navigator and a real time controller. Both systems are implemented as Activation Framework Objects on the Activation Framework intelligent scheduling mechanism developed by Worcester Polytechnic Institute. The implementations, results of performance analyses showing speedup due to parallelism and initial efficiency improvements are detailed and further areas for performance improvements are suggested.
Implementation of the incremental scheme for one-electron first-order properties in coupled-cluster theory.

PubMed

Friedrich, Joachim; Coriani, Sonia; Helgaker, Trygve; Dolg, Michael

2009-10-21

A fully automated parallelized implementation of the incremental scheme for coupled-cluster singles-and-doubles (CCSD) energies has been extended to treat molecular (unrelaxed) first-order one-electron properties such as the electric dipole and quadrupole moments. The convergence and accuracy of the incremental approach for the dipole and quadrupole moments have been studied for a variety of chemically interesting systems. It is found that the electric dipole moment can be obtained to within 5% and 0.5% accuracy with respect to the exact CCSD value at the third and fourth orders of the expansion, respectively. Furthermore, we find that the incremental expansion of the quadrupole moment converges to the exact result with increasing order of the expansion: the convergence of nonaromatic compounds is fast with errors less than 16 mau and less than 1 mau at third and fourth orders, respectively (1 mau=10(-3)ea(0)(2)); the aromatic compounds converge slowly with maximum absolute deviations of 174 and 72 mau at third and fourth orders, respectively.
Understanding and Optimizing Asynchronous Low-Precision Stochastic Gradient Descent

PubMed Central

De Sa, Christopher; Feldman, Matthew; Ré, Christopher; Olukotun, Kunle

2018-01-01

Stochastic gradient descent (SGD) is one of the most popular numerical algorithms used in machine learning and other domains. Since this is likely to continue for the foreseeable future, it is important to study techniques that can make it run fast on parallel hardware. In this paper, we provide the first analysis of a technique called Buckwild! that uses both asynchronous execution and low-precision computation. We introduce the DMGC model, the first conceptualization of the parameter space that exists when implementing low-precision SGD, and show that it provides a way to both classify these algorithms and model their performance. We leverage this insight to propose and analyze techniques to improve the speed of low-precision SGD. First, we propose software optimizations that can increase throughput on existing CPUs by up to 11×. Second, we propose architectural changes, including a new cache technique we call an obstinate cache, that increase throughput beyond the limits of current-generation hardware. We also implement and analyze low-precision SGD on the FPGA, which is a promising alternative to the CPU for future SGD systems. PMID:29391770
A spiking neural network model of 3D perception for event-based neuromorphic stereo vision systems

PubMed Central

Osswald, Marc; Ieng, Sio-Hoi; Benosman, Ryad; Indiveri, Giacomo

2017-01-01

Stereo vision is an important feature that enables machine vision systems to perceive their environment in 3D. While machine vision has spawned a variety of software algorithms to solve the stereo-correspondence problem, their implementation and integration in small, fast, and efficient hardware vision systems remains a difficult challenge. Recent advances made in neuromorphic engineering offer a possible solution to this problem, with the use of a new class of event-based vision sensors and neural processing devices inspired by the organizing principles of the brain. Here we propose a radically novel model that solves the stereo-correspondence problem with a spiking neural network that can be directly implemented with massively parallel, compact, low-latency and low-power neuromorphic engineering devices. We validate the model with experimental results, highlighting features that are in agreement with both computational neuroscience stereo vision theories and experimental findings. We demonstrate its features with a prototype neuromorphic hardware system and provide testable predictions on the role of spike-based representations and temporal dynamics in biological stereo vision processing systems. PMID:28079187
A spiking neural network model of 3D perception for event-based neuromorphic stereo vision systems.

PubMed

Osswald, Marc; Ieng, Sio-Hoi; Benosman, Ryad; Indiveri, Giacomo

2017-01-12

Stereo vision is an important feature that enables machine vision systems to perceive their environment in 3D. While machine vision has spawned a variety of software algorithms to solve the stereo-correspondence problem, their implementation and integration in small, fast, and efficient hardware vision systems remains a difficult challenge. Recent advances made in neuromorphic engineering offer a possible solution to this problem, with the use of a new class of event-based vision sensors and neural processing devices inspired by the organizing principles of the brain. Here we propose a radically novel model that solves the stereo-correspondence problem with a spiking neural network that can be directly implemented with massively parallel, compact, low-latency and low-power neuromorphic engineering devices. We validate the model with experimental results, highlighting features that are in agreement with both computational neuroscience stereo vision theories and experimental findings. We demonstrate its features with a prototype neuromorphic hardware system and provide testable predictions on the role of spike-based representations and temporal dynamics in biological stereo vision processing systems.
A spiking neural network model of 3D perception for event-based neuromorphic stereo vision systems

NASA Astrophysics Data System (ADS)

Osswald, Marc; Ieng, Sio-Hoi; Benosman, Ryad; Indiveri, Giacomo

2017-01-01

Stereo vision is an important feature that enables machine vision systems to perceive their environment in 3D. While machine vision has spawned a variety of software algorithms to solve the stereo-correspondence problem, their implementation and integration in small, fast, and efficient hardware vision systems remains a difficult challenge. Recent advances made in neuromorphic engineering offer a possible solution to this problem, with the use of a new class of event-based vision sensors and neural processing devices inspired by the organizing principles of the brain. Here we propose a radically novel model that solves the stereo-correspondence problem with a spiking neural network that can be directly implemented with massively parallel, compact, low-latency and low-power neuromorphic engineering devices. We validate the model with experimental results, highlighting features that are in agreement with both computational neuroscience stereo vision theories and experimental findings. We demonstrate its features with a prototype neuromorphic hardware system and provide testable predictions on the role of spike-based representations and temporal dynamics in biological stereo vision processing systems.
Implementation of ADI: Schemes on MIMD parallel computers

NASA Technical Reports Server (NTRS)

Vanderwijngaart, Rob F.

1993-01-01

In order to simulate the effects of the impingement of hot exhaust jets of High Performance Aircraft on landing surfaces a multi-disciplinary computation coupling flow dynamics to heat conduction in the runway needs to be carried out. Such simulations, which are essentially unsteady, require very large computational power in order to be completed within a reasonable time frame of the order of an hour. Such power can be furnished by the latest generation of massively parallel computers. These remove the bottleneck of ever more congested data paths to one or a few highly specialized central processing units (CPU's) by having many off-the-shelf CPU's work independently on their own data, and exchange information only when needed. During the past year the first phase of this project was completed, in which the optimal strategy for mapping an ADI-algorithm for the three dimensional unsteady heat equation to a MIMD parallel computer was identified. This was done by implementing and comparing three different domain decomposition techniques that define the tasks for the CPU's in the parallel machine. These implementations were done for a Cartesian grid and Dirichlet boundary conditions. The most promising technique was then used to implement the heat equation solver on a general curvilinear grid with a suite of nontrivial boundary conditions. Finally, this technique was also used to implement the Scalar Penta-diagonal (SP) benchmark, which was taken from the NAS Parallel Benchmarks report. All implementations were done in the programming language C on the Intel iPSC/860 computer.
The design and implementation of a parallel unstructured Euler solver using software primitives

NASA Technical Reports Server (NTRS)

Das, R.; Mavriplis, D. J.; Saltz, J.; Gupta, S.; Ponnusamy, R.

1992-01-01

This paper is concerned with the implementation of a three-dimensional unstructured grid Euler-solver on massively parallel distributed-memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. An unstructured multigrid algorithm with an edge-based data structure has been adopted, and a number of optimizations have been devised and implemented in order to accelerate the parallel communication rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential code, while providing a basis for future automatic run-time compilation support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effect of these optimizations leads to roughly a factor of three performance improvement. The overall solution efficiency is compared with that obtained on the CRAY-YMP vector supercomputer.
Parallel implementation of geometrical shock dynamics for two dimensional converging shock waves

NASA Astrophysics Data System (ADS)

Qiu, Shi; Liu, Kuang; Eliasson, Veronica

2016-10-01

Geometrical shock dynamics (GSD) theory is an appealing method to predict the shock motion in the sense that it is more computationally efficient than solving the traditional Euler equations, especially for converging shock waves. However, to solve and optimize large scale configurations, the main bottleneck is the computational cost. Among the existing numerical GSD schemes, there is only one that has been implemented on parallel computers, with the purpose to analyze detonation waves. To extend the computational advantage of the GSD theory to more general applications such as converging shock waves, a numerical implementation using a spatial decomposition method has been coupled with a front tracking approach on parallel computers. In addition, an efficient tridiagonal system solver for massively parallel computers has been applied to resolve the most expensive function in this implementation, resulting in an efficiency of 0.93 while using 32 HPCC cores. Moreover, symmetric boundary conditions have been developed to further reduce the computational cost, achieving a speedup of 19.26 for a 12-sided polygonal converging shock.
A fast immersed boundary method for external incompressible viscous flows using lattice Green's functions

NASA Astrophysics Data System (ADS)

Liska, Sebastian; Colonius, Tim

2017-02-01

A new parallel, computationally efficient immersed boundary method for solving three-dimensional, viscous, incompressible flows on unbounded domains is presented. Immersed surfaces with prescribed motions are generated using the interpolation and regularization operators obtained from the discrete delta function approach of the original (Peskin's) immersed boundary method. Unlike Peskin's method, boundary forces are regarded as Lagrange multipliers that are used to satisfy the no-slip condition. The incompressible Navier-Stokes equations are discretized on an unbounded staggered Cartesian grid and are solved in a finite number of operations using lattice Green's function techniques. These techniques are used to automatically enforce the natural free-space boundary conditions and to implement a novel block-wise adaptive grid that significantly reduces the run-time cost of solutions by limiting operations to grid cells in the immediate vicinity and near-wake region of the immersed surface. These techniques also enable the construction of practical discrete viscous integrating factors that are used in combination with specialized half-explicit Runge-Kutta schemes to accurately and efficiently solve the differential algebraic equations describing the discrete momentum equation, incompressibility constraint, and no-slip constraint. Linear systems of equations resulting from the time integration scheme are efficiently solved using an approximation-free nested projection technique. The algebraic properties of the discrete operators are used to reduce projection steps to simple discrete elliptic problems, e.g. discrete Poisson problems, that are compatible with recent parallel fast multipole methods for difference equations. Numerical experiments on low-aspect-ratio flat plates and spheres at Reynolds numbers up to 3700 are used to verify the accuracy and physical fidelity of the formulation.

Fast Simulation of Dynamic Ultrasound Images Using the GPU.

PubMed

Storve, Sigurd; Torp, Hans

2017-10-01

Simulated ultrasound data is a valuable tool for development and validation of quantitative image analysis methods in echocardiography. Unfortunately, simulation time can become prohibitive for phantoms consisting of a large number of point scatterers. The COLE algorithm by Gao et al. is a fast convolution-based simulator that trades simulation accuracy for improved speed. We present highly efficient parallelized CPU and GPU implementations of the COLE algorithm with an emphasis on dynamic simulations involving moving point scatterers. We argue that it is crucial to minimize the amount of data transfers from the CPU to achieve good performance on the GPU. We achieve this by storing the complete trajectories of the dynamic point scatterers as spline curves in the GPU memory. This leads to good efficiency when simulating sequences consisting of a large number of frames, such as B-mode and tissue Doppler data for a full cardiac cycle. In addition, we propose a phase-based subsample delay technique that efficiently eliminates flickering artifacts seen in B-mode sequences when COLE is used without enough temporal oversampling. To assess the performance, we used a laptop computer and a desktop computer, each equipped with a multicore Intel CPU and an NVIDIA GPU. Running the simulator on a high-end TITAN X GPU, we observed two orders of magnitude speedup compared to the parallel CPU version, three orders of magnitude speedup compared to simulation times reported by Gao et al. in their paper on COLE, and a speedup of 27000 times compared to the multithreaded version of Field II, using numbers reported in a paper by Jensen. We hope that by releasing the simulator as an open-source project we will encourage its use and further development.
Genotypic tropism testing by massively parallel sequencing: qualitative and quantitative analysis.

PubMed

Däumer, Martin; Kaiser, Rolf; Klein, Rolf; Lengauer, Thomas; Thiele, Bernhard; Thielen, Alexander

2011-05-13

Inferring viral tropism from genotype is a fast and inexpensive alternative to phenotypic testing. While being highly predictive when performed on clonal samples, sensitivity of predicting CXCR4-using (X4) variants drops substantially in clinical isolates. This is mainly attributed to minor variants not detected by standard bulk-sequencing. Massively parallel sequencing (MPS) detects single clones thereby being much more sensitive. Using this technology we wanted to improve genotypic prediction of coreceptor usage. Plasma samples from 55 antiretroviral-treated patients tested for coreceptor usage with the Monogram Trofile Assay were sequenced with standard population-based approaches. Fourteen of these samples were selected for further analysis with MPS. Tropism was predicted from each sequence with geno2pheno[coreceptor]. Prediction based on bulk-sequencing yielded 59.1% sensitivity and 90.9% specificity compared to the trofile assay. With MPS, 7600 reads were generated on average per isolate. Minorities of sequences with high confidence in CXCR4-usage were found in all samples, irrespective of phenotype. When using the default false-positive-rate of geno2pheno[coreceptor] (10%), and defining a minority cutoff of 5%, the results were concordant in all but one isolate. The combination of MPS and coreceptor usage prediction results in a fast and accurate alternative to phenotypic assays. The detection of X4-viruses in all isolates suggests that coreceptor usage as well as fitness of minorities is important for therapy outcome. The high sensitivity of this technology in combination with a quantitative description of the viral population may allow implementing meaningful cutoffs for predicting response to CCR5-antagonists in the presence of X4-minorities.
A pragmatic cluster randomised trial evaluating three implementation interventions.

PubMed

Rycroft-Malone, Jo; Seers, Kate; Crichton, Nicola; Chandler, Jackie; Hawkes, Claire A; Allen, Claire; Bullock, Ian; Strunin, Leo

2012-08-30

Implementation research is concerned with bridging the gap between evidence and practice through the study of methods to promote the uptake of research into routine practice. Good quality evidence has been summarised into guideline recommendations to show that peri-operative fasting times could be considerably shorter than patients currently experience. The objective of this trial was to evaluate the effectiveness of three strategies for the implementation of recommendations about peri-operative fasting. A pragmatic cluster randomised trial underpinned by the PARIHS framework was conducted during 2006 to 2009 with a national sample of UK hospitals using time series with mixed methods process evaluation and cost analysis. Hospitals were randomised to one of three interventions: standard dissemination (SD) of a guideline package, SD plus a web-based resource championed by an opinion leader, and SD plus plan-do-study-act (PDSA). The primary outcome was duration of fluid fast prior to induction of anaesthesia. Secondary outcomes included duration of food fast, patients' experiences, and stakeholders' experiences of implementation, including influences. ANOVA was used to test differences over time and interventions. Nineteen acute NHS hospitals participated. Across timepoints, 3,505 duration of fasting observations were recorded. No significant effect of the interventions was observed for either fluid or food fasting times. The effect size was 0.33 for the web-based intervention compared to SD alone for the change in fluid fasting and was 0.12 for PDSA compared to SD alone. The process evaluation showed different types of impact, including changes to practices, policies, and attitudes. A rich picture of the implementation challenges emerged, including inter-professional tensions and a lack of clarity for decision-making authority and responsibility. This was a large, complex study and one of the first national randomised controlled trials conducted within acute care in implementation research. The evidence base for fasting practice was accepted by those participating in this study and the messages from it simple; however, implementation and practical challenges influenced the interventions' impact. A set of conditions for implementation emerges from the findings of this study, which are presented as theoretically transferable propositions that have international relevance. ISRCTN18046709--Peri-operative Implementation Study Evaluation (POISE).
An implementation of cellular automaton model for single-line train working diagram

NASA Astrophysics Data System (ADS)

Hua, Wei; Liu, Jun

2006-04-01

According to the railway transportation system's characteristics, a new cellular automaton model for the single-line railway system is presented in this paper. Based on this model, several simulations were done to imitate the train operation under three working diagrams. From a different angle the results show how the organization of train operation impacts on the railway carrying capacity. By using the non-parallel train working diagram the influence of fast-train on slow-train is found to be the strongest. Many slow-trains have to wait in-between neighbouring stations to let the fast-train(s) pass through first. So the slow-train will advance like a wave propagating from the departure station to the arrival station. This also resembles the situation of a highway jammed traffic flow. Furthermore, the nonuniformity of travel times between the sections also greatly limits the railway carrying capacity. After converting the nonuniform sections into the sections with uniform travel times while the total travel time is kept unchanged, all three carrying capacities are improved greatly as shown by simulation. It also shows that the cellular automaton model is an effective and feasible way to investigate the railway transportation system.
Parallel and Preemptable Dynamically Dimensioned Search Algorithms for Single and Multi-objective Optimization in Water Resources

NASA Astrophysics Data System (ADS)

Tolson, B.; Matott, L. S.; Gaffoor, T. A.; Asadzadeh, M.; Shafii, M.; Pomorski, P.; Xu, X.; Jahanpour, M.; Razavi, S.; Haghnegahdar, A.; Craig, J. R.

2015-12-01

We introduce asynchronous parallel implementations of the Dynamically Dimensioned Search (DDS) family of algorithms including DDS, discrete DDS, PA-DDS and DDS-AU. These parallel algorithms are unique from most existing parallel optimization algorithms in the water resources field in that parallel DDS is asynchronous and does not require an entire population (set of candidate solutions) to be evaluated before generating and then sending a new candidate solution for evaluation. One key advance in this study is developing the first parallel PA-DDS multi-objective optimization algorithm. The other key advance is enhancing the computational efficiency of solving optimization problems (such as model calibration) by combining a parallel optimization algorithm with the deterministic model pre-emption concept. These two efficiency techniques can only be combined because of the asynchronous nature of parallel DDS. Model pre-emption functions to terminate simulation model runs early, prior to completely simulating the model calibration period for example, when intermediate results indicate the candidate solution is so poor that it will definitely have no influence on the generation of further candidate solutions. The computational savings of deterministic model preemption available in serial implementations of population-based algorithms (e.g., PSO) disappear in synchronous parallel implementations as these algorithms. In addition to the key advances above, we implement the algorithms across a range of computation platforms (Windows and Unix-based operating systems from multi-core desktops to a supercomputer system) and package these for future modellers within a model-independent calibration software package called Ostrich as well as MATLAB versions. Results across multiple platforms and multiple case studies (from 4 to 64 processors) demonstrate the vast improvement over serial DDS-based algorithms and highlight the important role model pre-emption plays in the performance of parallel, pre-emptable DDS algorithms. Case studies include single- and multiple-objective optimization problems in water resources model calibration and in many cases linear or near linear speedups are observed.
Real-time processing of radar return on a parallel computer

NASA Technical Reports Server (NTRS)

Aalfs, David D.

1992-01-01

NASA is working with the FAA to demonstrate the feasibility of pulse Doppler radar as a candidate airborne sensor to detect low altitude windshears. The need to provide the pilot with timely information about possible hazards has motivated a demand for real-time processing of a radar return. Investigated here is parallel processing as a means of accommodating the high data rates required. A PC based parallel computer, called the transputer, is used to investigate issues in real time concurrent processing of radar signals. A transputer network is made up of an array of single instruction stream processors that can be networked in a variety of ways. They are easily reconfigured and software development is largely independent of the particular network topology. The performance of the transputer is evaluated in light of the computational requirements. A number of algorithms have been implemented on the transputers in OCCAM, a language specially designed for parallel processing. These include signal processing algorithms such as the Fast Fourier Transform (FFT), pulse-pair, and autoregressive modelling, as well as routing software to support concurrency. The most computationally intensive task is estimating the spectrum. Two approaches have been taken on this problem, the first and most conventional of which is to use the FFT. By using table look-ups for the basis function and other optimizing techniques, an algorithm has been developed that is sufficient for real time. The other approach is to model the signal as an autoregressive process and estimate the spectrum based on the model coefficients. This technique is attractive because it does not suffer from the spectral leakage problem inherent in the FFT. Benchmark tests indicate that autoregressive modeling is feasible in real time.
Solar Wind Proton Temperature Anisotropy: Linear Theory and WIND/SWE Observations

NASA Technical Reports Server (NTRS)

Hellinger, P.; Travnicek, P.; Kasper, J. C.; Lazarus, A. J.

2006-01-01

We present a comparison between WIND/SWE observations (Kasper et al., 2006) of beta parallel to p and T perpendicular to p/T parallel to p (where beta parallel to p is the proton parallel beta and T perpendicular to p and T parallel to p are the perpendicular and parallel proton are the perpendicular and parallel proton temperatures, respectively; here parallel and perpendicular indicate directions with respect to the ambient magnetic field) and predictions of the Vlasov linear theory. In the slow solar wind, the observed proton temperature anisotropy seems to be constrained by oblique instabilities, by the mirror one and the oblique fire hose, contrary to the results of the linear theory which predicts a dominance of the proton cyclotron instability and the parallel fire hose. The fast solar wind core protons exhibit an anticorrelation between beta parallel to c and T perpendicular to c/T parallel to c (where beta parallel to c is the core proton parallel beta and T perpendicular to c and T parallel to c are the perpendicular and parallel core proton temperatures, respectively) similar to that observed in the HELIOS data (Marsch et al., 2004).
Integrated Task and Data Parallel Programming

NASA Technical Reports Server (NTRS)

Grimshaw, A. S.

1998-01-01

This research investigates the combination of task and data parallel language constructs within a single programming language. There are an number of applications that exhibit properties which would be well served by such an integrated language. Examples include global climate models, aircraft design problems, and multidisciplinary design optimization problems. Our approach incorporates data parallel language constructs into an existing, object oriented, task parallel language. The language will support creation and manipulation of parallel classes and objects of both types (task parallel and data parallel). Ultimately, the language will allow data parallel and task parallel classes to be used either as building blocks or managers of parallel objects of either type, thus allowing the development of single and multi-paradigm parallel applications. 1995 Research Accomplishments In February I presented a paper at Frontiers 1995 describing the design of the data parallel language subset. During the spring I wrote and defended my dissertation proposal. Since that time I have developed a runtime model for the language subset. I have begun implementing the model and hand-coding simple examples which demonstrate the language subset. I have identified an astrophysical fluid flow application which will validate the data parallel language subset. 1996 Research Agenda Milestones for the coming year include implementing a significant portion of the data parallel language subset over the Legion system. Using simple hand-coded methods, I plan to demonstrate (1) concurrent task and data parallel objects and (2) task parallel objects managing both task and data parallel objects. My next steps will focus on constructing a compiler and implementing the fluid flow application with the language. Concurrently, I will conduct a search for a real-world application exhibiting both task and data parallelism within the same program. Additional 1995 Activities During the fall I collaborated with Andrew Grimshaw and Adam Ferrari to write a book chapter which will be included in Parallel Processing in C++ edited by Gregory Wilson. I also finished two courses, Compilers and Advanced Compilers, in 1995. These courses complete my class requirements at the University of Virginia. I have only my dissertation research and defense to complete.
Integrated Task And Data Parallel Programming: Language Design

NASA Technical Reports Server (NTRS)

Grimshaw, Andrew S.; West, Emily A.

1998-01-01

his research investigates the combination of task and data parallel language constructs within a single programming language. There are an number of applications that exhibit properties which would be well served by such an integrated language. Examples include global climate models, aircraft design problems, and multidisciplinary design optimization problems. Our approach incorporates data parallel language constructs into an existing, object oriented, task parallel language. The language will support creation and manipulation of parallel classes and objects of both types (task parallel and data parallel). Ultimately, the language will allow data parallel and task parallel classes to be used either as building blocks or managers of parallel objects of either type, thus allowing the development of single and multi-paradigm parallel applications. 1995 Research Accomplishments In February I presented a paper at Frontiers '95 describing the design of the data parallel language subset. During the spring I wrote and defended my dissertation proposal. Since that time I have developed a runtime model for the language subset. I have begun implementing the model and hand-coding simple examples which demonstrate the language subset. I have identified an astrophysical fluid flow application which will validate the data parallel language subset. 1996 Research Agenda Milestones for the coming year include implementing a significant portion of the data parallel language subset over the Legion system. Using simple hand-coded methods, I plan to demonstrate (1) concurrent task and data parallel objects and (2) task parallel objects managing both task and data parallel objects. My next steps will focus on constructing a compiler and implementing the fluid flow application with the language. Concurrently, I will conduct a search for a real-world application exhibiting both task and data parallelism within the same program m. Additional 1995 Activities During the fall I collaborated with Andrew Grimshaw and Adam Ferrari to write a book chapter which will be included in Parallel Processing in C++ edited by Gregory Wilson. I also finished two courses, Compilers and Advanced Compilers, in 1995. These courses complete my class requirements at the University of Virginia. I have only my dissertation research and defense to complete.
Fast Fourier Transform algorithm design and tradeoffs

NASA Technical Reports Server (NTRS)

Kamin, Ray A., III; Adams, George B., III

1988-01-01

The Fast Fourier Transform (FFT) is a mainstay of certain numerical techniques for solving fluid dynamics problems. The Connection Machine CM-2 is the target for an investigation into the design of multidimensional Single Instruction Stream/Multiple Data (SIMD) parallel FFT algorithms for high performance. Critical algorithm design issues are discussed, necessary machine performance measurements are identified and made, and the performance of the developed FFT programs are measured. Fast Fourier Transform programs are compared to the currently best Cray-2 FFT program.
ColDICE: A parallel Vlasov–Poisson solver using moving adaptive simplicial tessellation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sousbie, Thierry, E-mail: tsousbie@gmail.com; Department of Physics, The University of Tokyo, Tokyo 113-0033; Research Center for the Early Universe, School of Science, The University of Tokyo, Tokyo 113-0033

2016-09-15

Resolving numerically Vlasov–Poisson equations for initially cold systems can be reduced to following the evolution of a three-dimensional sheet evolving in six-dimensional phase-space. We describe a public parallel numerical algorithm consisting in representing the phase-space sheet with a conforming, self-adaptive simplicial tessellation of which the vertices follow the Lagrangian equations of motion. The algorithm is implemented both in six- and four-dimensional phase-space. Refinement of the tessellation mesh is performed using the bisection method and a local representation of the phase-space sheet at second order relying on additional tracers created when needed at runtime. In order to preserve in the bestmore » way the Hamiltonian nature of the system, refinement is anisotropic and constrained by measurements of local Poincaré invariants. Resolution of Poisson equation is performed using the fast Fourier method on a regular rectangular grid, similarly to particle in cells codes. To compute the density projected onto this grid, the intersection of the tessellation and the grid is calculated using the method of Franklin and Kankanhalli [65–67] generalised to linear order. As preliminary tests of the code, we study in four dimensional phase-space the evolution of an initially small patch in a chaotic potential and the cosmological collapse of a fluctuation composed of two sinusoidal waves. We also perform a “warm” dark matter simulation in six-dimensional phase-space that we use to check the parallel scaling of the code.« less
A Lightweight Remote Parallel Visualization Platform for Interactive Massive Time-varying Climate Data Analysis

NASA Astrophysics Data System (ADS)

Li, J.; Zhang, T.; Huang, Q.; Liu, Q.

2014-12-01

Today's climate datasets are featured with large volume, high degree of spatiotemporal complexity and evolving fast overtime. As visualizing large volume distributed climate datasets is computationally intensive, traditional desktop based visualization applications fail to handle the computational intensity. Recently, scientists have developed remote visualization techniques to address the computational issue. Remote visualization techniques usually leverage server-side parallel computing capabilities to perform visualization tasks and deliver visualization results to clients through network. In this research, we aim to build a remote parallel visualization platform for visualizing and analyzing massive climate data. Our visualization platform was built based on Paraview, which is one of the most popular open source remote visualization and analysis applications. To further enhance the scalability and stability of the platform, we have employed cloud computing techniques to support the deployment of the platform. In this platform, all climate datasets are regular grid data which are stored in NetCDF format. Three types of data access methods are supported in the platform: accessing remote datasets provided by OpenDAP servers, accessing datasets hosted on the web visualization server and accessing local datasets. Despite different data access methods, all visualization tasks are completed at the server side to reduce the workload of clients. As a proof of concept, we have implemented a set of scientific visualization methods to show the feasibility of the platform. Preliminary results indicate that the framework can address the computation limitation of desktop based visualization applications.
Explicit integration with GPU acceleration for large kinetic networks

DOE PAGES

Brock, Benjamin; Belt, Andrew; Billings, Jay Jay; ...

2015-09-15

In this study, we demonstrate the first implementation of recently-developed fast explicit kinetic integration algorithms on modern graphics processing unit (GPU) accelerators. Taking as a generic test case a Type Ia supernova explosion with an extremely stiff thermonuclear network having 150 isotopic species and 1604 reactions coupled to hydrodynamics using operator splitting, we demonstrate the capability to solve of order 100 realistic kinetic networks in parallel in the same time that standard implicit methods can solve a single such network on a CPU. In addition, this orders-of-magnitude decrease in computation time for solving systems of realistic kinetic networks implies thatmore » important coupled, multiphysics problems in various scientific and technical fields that were intractable, or could be simulated only with highly schematic kinetic networks, are now computationally feasible.« less
On the modelling of linear-assisted DC-DC voltage regulators for photovoltaic solar energy systems

NASA Astrophysics Data System (ADS)

Martínez-García, Herminio; García-Vílchez, Encarna

2017-11-01

This paper shows the modelling of linear-assisted or hybrid (linear & switching) DC/DC voltage regulators. In this kind of regulators, an auxiliary linear regulator is used, which objective is to cancel the ripple at the output voltage and provide fast responses for load variations. On the other hand, a switching DC/DC converter, connected in parallel with the linear regulator, allows to supply almost the whole output current demanded by the load. The objective of this topology is to take advantage of the suitable regulation characteristics that series linear voltage regulators have, but almost achieving the high efficiency that switching DC/DC converters provide. Linear-assisted DC/DC regulators are feedback systems with potential instability. Therefore, their modelling is mandatory in order to obtain design guidelines and assure stability of the implemented power supply system.
The geospatial data quality REST API for primary biodiversity data.

PubMed

Otegui, Javier; Guralnick, Robert P

2016-06-01

We present a REST web service to assess the geospatial quality of primary biodiversity data. It enables access to basic and advanced functions to detect completeness and consistency issues as well as general errors in the provided record or set of records. The API uses JSON for data interchange and efficient parallelization techniques for fast assessments of large datasets. The Geospatial Data Quality API is part of the VertNet set of APIs. It can be accessed at http://api-geospatial.vertnet-portal.appspot.com/geospatial and is already implemented in the VertNet data portal for quality reporting. Source code is freely available under GPL license from http://www.github.com/vertnet/api-geospatial javier.otegui@gmail.com or rguralnick@flmnh.ufl.edu Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.
Recent Experience with a Hybrid SCADA/PMU On-Line State Estimator

DOE Office of Scientific and Technical Information (OSTI.GOV)

Rizy, D Tom

2009-01-01

PMU devices are expected to grow in number from a few to several hundreds in the next five years. Some relays are already global positioning system-capable and could provide the same type of data as any PMU. This introduces a new paradigm of very fast accurate synchrophasor measurements from across the grid in real-time that augment and parallel existing slower SCADA measurements. Control center applications will benefit from this PMU data; for example, use of PMU data in state estimation is expected to improve accuracy and robustness, which in turn will result in more timely and accurate N-1 security analysis,more » resulting in an overall improvement of grid system reliability and security. This paper describes results from a recent implementation of this technology, the benefits and future work.« less
Distributed Fast Self-Organized Maps for Massive Spectrophotometric Data Analysis †.

PubMed

Dafonte, Carlos; Garabato, Daniel; Álvarez, Marco A; Manteiga, Minia

2018-05-03

Analyzing huge amounts of data becomes essential in the era of Big Data, where databases are populated with hundreds of Gigabytes that must be processed to extract knowledge. Hence, classical algorithms must be adapted towards distributed computing methodologies that leverage the underlying computational power of these platforms. Here, a parallel, scalable, and optimized design for self-organized maps (SOM) is proposed in order to analyze massive data gathered by the spectrophotometric sensor of the European Space Agency (ESA) Gaia spacecraft, although it could be extrapolated to other domains. The performance comparison between the sequential implementation and the distributed ones based on Apache Hadoop and Apache Spark is an important part of the work, as well as the detailed analysis of the proposed optimizations. Finally, a domain-specific visualization tool to explore astronomical SOMs is presented.
The factorization of large composite numbers on the MPP

NASA Technical Reports Server (NTRS)

Mckurdy, Kathy J.; Wunderlich, Marvin C.

1987-01-01

The continued fraction method for factoring large integers (CFRAC) was an ideal algorithm to be implemented on a massively parallel computer such as the Massively Parallel Processor (MPP). After much effort, the first 60 digit number was factored on the MPP using about 6 1/2 hours of array time. Although this result added about 10 digits to the size number that could be factored using CFRAC on a serial machine, it was already badly beaten by the implementation of Davis and Holdridge on the CRAY-1 using the quadratic sieve, an algorithm which is clearly superior to CFRAC for large numbers. An algorithm is illustrated which is ideally suited to the single instruction multiple data (SIMD) massively parallel architecture and some of the modifications which were needed in order to make the parallel implementation effective and efficient are described.
A parallel implementation of a multisensor feature-based range-estimation method

NASA Technical Reports Server (NTRS)

Suorsa, Raymond E.; Sridhar, Banavar

1993-01-01

There are many proposed vision based methods to perform obstacle detection and avoidance for autonomous or semi-autonomous vehicles. All methods, however, will require very high processing rates to achieve real time performance. A system capable of supporting autonomous helicopter navigation will need to extract obstacle information from imagery at rates varying from ten frames per second to thirty or more frames per second depending on the vehicle speed. Such a system will need to sustain billions of operations per second. To reach such high processing rates using current technology, a parallel implementation of the obstacle detection/ranging method is required. This paper describes an efficient and flexible parallel implementation of a multisensor feature-based range-estimation algorithm, targeted for helicopter flight, realized on both a distributed-memory and shared-memory parallel computer.
Development of a GPU Compatible Version of the Fast Radiation Code RRTMG

NASA Astrophysics Data System (ADS)

Iacono, M. J.; Mlawer, E. J.; Berthiaume, D.; Cady-Pereira, K. E.; Suarez, M.; Oreopoulos, L.; Lee, D.

2012-12-01

The absorption of solar radiation and emission/absorption of thermal radiation are crucial components of the physics that drive Earth's climate and weather. Therefore, accurate radiative transfer calculations are necessary for realistic climate and weather simulations. Efficient radiation codes have been developed for this purpose, but their accuracy requirements still necessitate that as much as 30% of the computational time of a GCM is spent computing radiative fluxes and heating rates. The overall computational expense constitutes a limitation on a GCM's predictive ability if it becomes an impediment to adding new physics to or increasing the spatial and/or vertical resolution of the model. The emergence of Graphics Processing Unit (GPU) technology, which will allow the parallel computation of multiple independent radiative calculations in a GCM, will lead to a fundamental change in the competition between accuracy and speed. Processing time previously consumed by radiative transfer will now be available for the modeling of other processes, such as physics parameterizations, without any sacrifice in the accuracy of the radiative transfer. Furthermore, fast radiation calculations can be performed much more frequently and will allow the modeling of radiative effects of rapid changes in the atmosphere. The fast radiation code RRTMG, developed at Atmospheric and Environmental Research (AER), is utilized operationally in many dynamical models throughout the world. We will present the results from the first stage of an effort to create a version of the RRTMG radiation code designed to run efficiently in a GPU environment. This effort will focus on the RRTMG implementation in GEOS-5. RRTMG has an internal pseudo-spectral vector of length of order 100 that, when combined with the much greater length of the global horizontal grid vector from which the radiation code is called in GEOS-5, makes RRTMG/GEOS-5 particularly suited to achieving a significant speed improvement through GPU technology. This large number of independent cases will allow us to take full advantage of the computational power of the latest GPUs, ensuring that all thread cores in the GPU remain active, a key criterion for obtaining significant speedup. The CUDA (Compute Unified Device Architecture) Fortran compiler developed by PGI and Nvidia will allow us to construct this parallel implementation on the GPU while remaining in the Fortran language. This implementation will scale very well across various CUDA-supported GPUs such as the recently released Fermi Nvidia cards. We will present the computational speed improvements of the GPU-compatible code relative to the standard CPU-based RRTMG with respect to a very large and diverse suite of atmospheric profiles. This suite will also be utilized to demonstrate the minimal impact of the code restructuring on the accuracy of radiation calculations. The GPU-compatible version of RRTMG will be directly applicable to future versions of GEOS-5, but it is also likely to provide significant associated benefits for other GCMs that employ RRTMG.

Architecture studies and system demonstrations for optical parallel processor for AI and NI

NASA Astrophysics Data System (ADS)

Lee, Sing H.

1988-03-01

In solving deterministic AI problems the data search for matching the arguments of a PROLOG expression causes serious bottleneck when implemented sequentially by electronic systems. To overcome this bottleneck we have developed the concepts for an optical expert system based on matrix-algebraic formulation, which will be suitable for parallel optical implementation. The optical AI system based on matrix-algebraic formation will offer distinct advantages for parallel search, adult learning, etc.
Parallel processing data network of master and slave transputers controlled by a serial control network

DOEpatents

Crosetto, D.B.

1996-12-31

The present device provides for a dynamically configurable communication network having a multi-processor parallel processing system having a serial communication network and a high speed parallel communication network. The serial communication network is used to disseminate commands from a master processor to a plurality of slave processors to effect communication protocol, to control transmission of high density data among nodes and to monitor each slave processor`s status. The high speed parallel processing network is used to effect the transmission of high density data among nodes in the parallel processing system. Each node comprises a transputer, a digital signal processor, a parallel transfer controller, and two three-port memory devices. A communication switch within each node connects it to a fast parallel hardware channel through which all high density data arrives or leaves the node. 6 figs.
Parallel processing data network of master and slave transputers controlled by a serial control network

DOEpatents

Crosetto, Dario B.

1996-01-01

The present device provides for a dynamically configurable communication network having a multi-processor parallel processing system having a serial communication network and a high speed parallel communication network. The serial communication network is used to disseminate commands from a master processor (100) to a plurality of slave processors (200) to effect communication protocol, to control transmission of high density data among nodes and to monitor each slave processor's status. The high speed parallel processing network is used to effect the transmission of high density data among nodes in the parallel processing system. Each node comprises a transputer (104), a digital signal processor (114), a parallel transfer controller (106), and two three-port memory devices. A communication switch (108) within each node (100) connects it to a fast parallel hardware channel (70) through which all high density data arrives or leaves the node.
Protein biomarker discovery and fast monitoring for the identification and detection of Anisakids by parallel reaction monitoring (PRM) mass spectrometry.

PubMed

Carrera, Mónica; Gallardo, José M; Pascual, Santiago; González, Ángel F; Medina, Isabel

2016-06-16

Anisakids are fish-borne parasites that are responsible for a large number of human infections and allergic reactions around the world. World health organizations and food safety authorities aim to control and prevent this emerging health problem. In the present work, a new method for the fast monitoring of these parasites is described. The strategy is divided in three steps: (i) purification of thermostable proteins from fish-borne parasites (Anisakids), (ii) in-solution HIFU trypsin digestion and (iii) monitoring of several peptide markers by parallel reaction monitoring (PRM) mass spectrometry. This methodology allows the fast detection of Anisakids in <2h. An affordable assay utilizing this methodology will facilitate testing for regulatory and safety applications. The work describes for the first time, the Protein Biomarker Discovery and the Fast Monitoring for the identification and detection of Anisakids in fishery products. The strategy is based on the purification of thermostable proteins, the use of accelerated in-solution trypsin digestions under an ultrasonic field provided by High-Intensity Focused Ultrasound (HIFU) and the monitoring of several peptide biomarkers by Parallel Reaction Monitoring (PRM) Mass Spectrometry in a linear ion trap mass spectrometer. The workflow allows the unequivocal detection of Anisakids, in <2h. The present strategy constitutes the fastest method for Anisakids detection, whose application in the food quality control area, could provide to the authorities an effective and rapid method to guarantee the safety to the consumers. Copyright © 2016 Elsevier B.V. All rights reserved.
Xyce Parallel Electronic Simulator Users' Guide Version 6.8

DOE Office of Scientific and Technical Information (OSTI.GOV)

Keiter, Eric R.; Aadithya, Karthik Venkatraman; Mei, Ting

This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been de- signed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel com- puting platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows onemore » to develop new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandia's needs, including some radiation- aware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase$-$ a message passing parallel implementation $-$ which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows.« less
Genetic Parallel Programming: design and implementation.

PubMed

Cheang, Sin Man; Leung, Kwong Sak; Lee, Kin Hong

2006-01-01

This paper presents a novel Genetic Parallel Programming (GPP) paradigm for evolving parallel programs running on a Multi-Arithmetic-Logic-Unit (Multi-ALU) Processor (MAP). The MAP is a Multiple Instruction-streams, Multiple Data-streams (MIMD), general-purpose register machine that can be implemented on modern Very Large-Scale Integrated Circuits (VLSIs) in order to evaluate genetic programs at high speed. For human programmers, writing parallel programs is more difficult than writing sequential programs. However, experimental results show that GPP evolves parallel programs with less computational effort than that of their sequential counterparts. It creates a new approach to evolving a feasible problem solution in parallel program form and then serializes it into a sequential program if required. The effectiveness and efficiency of GPP are investigated using a suite of 14 well-studied benchmark problems. Experimental results show that GPP speeds up evolution substantially.
Capabilities of Fully Parallelized MHD Stability Code MARS

NASA Astrophysics Data System (ADS)

Svidzinski, Vladimir; Galkin, Sergei; Kim, Jin-Soo; Liu, Yueqiang

2016-10-01

Results of full parallelization of the plasma stability code MARS will be reported. MARS calculates eigenmodes in 2D axisymmetric toroidal equilibria in MHD-kinetic plasma models. Parallel version of MARS, named PMARS, has been recently developed at FAR-TECH. Parallelized MARS is an efficient tool for simulation of MHD instabilities with low, intermediate and high toroidal mode numbers within both fluid and kinetic plasma models, implemented in MARS. Parallelization of the code included parallelization of the construction of the matrix for the eigenvalue problem and parallelization of the inverse vector iterations algorithm, implemented in MARS for the solution of the formulated eigenvalue problem. Construction of the matrix is parallelized by distributing the load among processors assigned to different magnetic surfaces. Parallelization of the solution of the eigenvalue problem is made by repeating steps of the MARS algorithm using parallel libraries and procedures. Parallelized MARS is capable of calculating eigenmodes with significantly increased spatial resolution: up to 5,000 adapted radial grid points with up to 500 poloidal harmonics. Such resolution is sufficient for simulation of kink, tearing and peeling-ballooning instabilities with physically relevant parameters. Work is supported by the U.S. DOE SBIR program.
Fully Parallel MHD Stability Analysis Tool

NASA Astrophysics Data System (ADS)

Svidzinski, Vladimir; Galkin, Sergei; Kim, Jin-Soo; Liu, Yueqiang

2015-11-01

Progress on full parallelization of the plasma stability code MARS will be reported. MARS calculates eigenmodes in 2D axisymmetric toroidal equilibria in MHD-kinetic plasma models. It is a powerful tool for studying MHD and MHD-kinetic instabilities and it is widely used by fusion community. Parallel version of MARS is intended for simulations on local parallel clusters. It will be an efficient tool for simulation of MHD instabilities with low, intermediate and high toroidal mode numbers within both fluid and kinetic plasma models, already implemented in MARS. Parallelization of the code includes parallelization of the construction of the matrix for the eigenvalue problem and parallelization of the inverse iterations algorithm, implemented in MARS for the solution of the formulated eigenvalue problem. Construction of the matrix is parallelized by distributing the load among processors assigned to different magnetic surfaces. Parallelization of the solution of the eigenvalue problem is made by repeating steps of the present MARS algorithm using parallel libraries and procedures. Results of MARS parallelization and of the development of a new fix boundary equilibrium code adapted for MARS input will be reported. Work is supported by the U.S. DOE SBIR program.
A GPU-paralleled implementation of an enhanced face recognition algorithm

NASA Astrophysics Data System (ADS)

Chen, Hao; Liu, Xiyang; Shao, Shuai; Zan, Jiguo

2013-03-01

Face recognition algorithm based on compressed sensing and sparse representation is hotly argued in these years. The scheme of this algorithm increases recognition rate as well as anti-noise capability. However, the computational cost is expensive and has become a main restricting factor for real world applications. In this paper, we introduce a GPU-accelerated hybrid variant of face recognition algorithm named parallel face recognition algorithm (pFRA). We describe here how to carry out parallel optimization design to take full advantage of many-core structure of a GPU. The pFRA is tested and compared with several other implementations under different data sample size. Finally, Our pFRA, implemented with NVIDIA GPU and Computer Unified Device Architecture (CUDA) programming model, achieves a significant speedup over the traditional CPU implementations.
Fast semivariogram computation using FPGA architectures

NASA Astrophysics Data System (ADS)

Lagadapati, Yamuna; Shirvaikar, Mukul; Dong, Xuanliang

2015-02-01

The semivariogram is a statistical measure of the spatial distribution of data and is based on Markov Random Fields (MRFs). Semivariogram analysis is a computationally intensive algorithm that has typically seen applications in the geosciences and remote sensing areas. Recently, applications in the area of medical imaging have been investigated, resulting in the need for efficient real time implementation of the algorithm. The semivariogram is a plot of semivariances for different lag distances between pixels. A semi-variance, γ(h), is defined as the half of the expected squared differences of pixel values between any two data locations with a lag distance of h. Due to the need to examine each pair of pixels in the image or sub-image being processed, the base algorithm complexity for an image window with n pixels is O(n2). Field Programmable Gate Arrays (FPGAs) are an attractive solution for such demanding applications due to their parallel processing capability. FPGAs also tend to operate at relatively modest clock rates measured in a few hundreds of megahertz, but they can perform tens of thousands of calculations per clock cycle while operating in the low range of power. This paper presents a technique for the fast computation of the semivariogram using two custom FPGA architectures. The design consists of several modules dedicated to the constituent computational tasks. A modular architecture approach is chosen to allow for replication of processing units. This allows for high throughput due to concurrent processing of pixel pairs. The current implementation is focused on isotropic semivariogram computations only. Anisotropic semivariogram implementation is anticipated to be an extension of the current architecture, ostensibly based on refinements to the current modules. The algorithm is benchmarked using VHDL on a Xilinx XUPV5-LX110T development Kit, which utilizes the Virtex5 FPGA. Medical image data from MRI scans are utilized for the experiments. Computational speedup is measured with respect to Matlab implementation on a personal computer with an Intel i7 multi-core processor. Preliminary simulation results indicate that a significant advantage in speed can be attained by the architectures, making the algorithm viable for implementation in medical devices
A Generic and Efficient E-field Parallel Imaging Correlator for Next-Generation Radio Telescopes

NASA Astrophysics Data System (ADS)

Thyagarajan, Nithyanandan; Beardsley, Adam P.; Bowman, Judd D.; Morales, Miguel F.

2017-05-01

Modern radio telescopes are favouring densely packed array layouts with large numbers of antennas (NA ≳ 1000). Since the complexity of traditional correlators scales as O(N_A^2), there will be a steep cost for realizing the full imaging potential of these powerful instruments. Through our generic and efficient E-field Parallel Imaging Correlator (epic), we present the first software demonstration of a generalized direct imaging algorithm, namely the Modular Optimal Frequency Fourier imager. Not only does it bring down the cost for dense layouts to O(N_A log _2N_A) but can also image from irregular layouts and heterogeneous arrays of antennas. epic is highly modular, parallelizable, implemented in object-oriented python, and publicly available. We have verified the images produced to be equivalent to those from traditional techniques to within a precision set by gridding coarseness. We have also validated our implementation on data observed with the Long Wavelength Array (LWA1). We provide a detailed framework for imaging with heterogeneous arrays and show that epic robustly estimates the input sky model for such arrays. Antenna layouts with dense filling factors consisting of a large number of antennas such as LWA, the Square Kilometre Array, Hydrogen Epoch of Reionization Array, and Canadian Hydrogen Intensity Mapping Experiment will gain significant computational advantage by deploying an optimized version of epic. The algorithm is a strong candidate for instruments targeting transient searches of fast radio bursts as well as planetary and exoplanetary phenomena due to the availability of high-speed calibrated time-domain images and low output bandwidth relative to visibility-based systems.
Many-integrated core (MIC) technology for accelerating Monte Carlo simulation of radiation transport: A study based on the code DPM

NASA Astrophysics Data System (ADS)

Rodriguez, M.; Brualla, L.

2018-04-01

Monte Carlo simulation of radiation transport is computationally demanding to obtain reasonably low statistical uncertainties of the estimated quantities. Therefore, it can benefit in a large extent from high-performance computing. This work is aimed at assessing the performance of the first generation of the many-integrated core architecture (MIC) Xeon Phi coprocessor with respect to that of a CPU consisting of a double 12-core Xeon processor in Monte Carlo simulation of coupled electron-photonshowers. The comparison was made twofold, first, through a suite of basic tests including parallel versions of the random number generators Mersenne Twister and a modified implementation of RANECU. These tests were addressed to establish a baseline comparison between both devices. Secondly, through the p DPM code developed in this work. p DPM is a parallel version of the Dose Planning Method (DPM) program for fast Monte Carlo simulation of radiation transport in voxelized geometries. A variety of techniques addressed to obtain a large scalability on the Xeon Phi were implemented in p DPM. Maximum scalabilities of 84 . 2 × and 107 . 5 × were obtained in the Xeon Phi for simulations of electron and photon beams, respectively. Nevertheless, in none of the tests involving radiation transport the Xeon Phi performed better than the CPU. The disadvantage of the Xeon Phi with respect to the CPU owes to the low performance of the single core of the former. A single core of the Xeon Phi was more than 10 times less efficient than a single core of the CPU for all radiation transport simulations.
Numerical Modelling of Foundation Slabs with use of Schur Complement Method

NASA Astrophysics Data System (ADS)

Koktan, Jiří; Brožovský, Jiří

2017-10-01

The paper discusses numerical modelling of foundation slabs with use of advanced numerical approaches, which are suitable for parallel processing. The solution is based on the Finite Element Method with the slab-type elements. The subsoil is modelled with use of Winklertype contact model (as an alternative a multi-parameter model can be used). The proposed modelling approach uses the Schur Complement method to speed-up the computations of the problem. The method is based on a special division of the analyzed model to several substructures. It adds some complexity to the numerical procedures, especially when subsoil models are used inside the finite element method solution. In other hand, this method makes possible a fast solution of large models but it introduces further problems to the process. Thus, the main aim of this paper is to verify that such method can be successfully used for this type of problem. The most suitable finite elements will be discussed, there will be also discussion related to finite element mesh and limitations of its construction for such problem. The core approaches of the implementation of the Schur Complement Method for this type of the problem will be also presented. The proposed approach was implemented in the form of a computer program, which will be also briefly introduced. There will be also presented results of example computations, which prove the speed-up of the solution - there will be shown important speed-up of solution even in the case of on-parallel processing and the ability of bypass size limitations of numerical models with use of the discussed approach.
Petascale turbulence simulation using a highly parallel fast multipole method on GPUs

NASA Astrophysics Data System (ADS)

Yokota, Rio; Barba, L. A.; Narumi, Tetsu; Yasuoka, Kenji

2013-03-01

This paper reports large-scale direct numerical simulations of homogeneous-isotropic fluid turbulence, achieving sustained performance of 1.08 petaflop/s on GPU hardware using single precision. The simulations use a vortex particle method to solve the Navier-Stokes equations, with a highly parallel fast multipole method (FMM) as numerical engine, and match the current record in mesh size for this application, a cube of 40963 computational points solved with a spectral method. The standard numerical approach used in this field is the pseudo-spectral method, relying on the FFT algorithm as the numerical engine. The particle-based simulations presented in this paper quantitatively match the kinetic energy spectrum obtained with a pseudo-spectral method, using a trusted code. In terms of parallel performance, weak scaling results show the FMM-based vortex method achieving 74% parallel efficiency on 4096 processes (one GPU per MPI process, 3 GPUs per node of the TSUBAME-2.0 system). The FFT-based spectral method is able to achieve just 14% parallel efficiency on the same number of MPI processes (using only CPU cores), due to the all-to-all communication pattern of the FFT algorithm. The calculation time for one time step was 108 s for the vortex method and 154 s for the spectral method, under these conditions. Computing with 69 billion particles, this work exceeds by an order of magnitude the largest vortex-method calculations to date.
Implementation of the NAS Parallel Benchmarks in Java

NASA Technical Reports Server (NTRS)

Frumkin, Michael A.; Schultz, Matthew; Jin, Haoqiang; Yan, Jerry; Biegel, Bryan (Technical Monitor)

2002-01-01

Several features make Java an attractive choice for High Performance Computing (HPC). In order to gauge the applicability of Java to Computational Fluid Dynamics (CFD), we have implemented the NAS (NASA Advanced Supercomputing) Parallel Benchmarks in Java. The performance and scalability of the benchmarks point out the areas where improvement in Java compiler technology and in Java thread implementation would position Java closer to Fortran in the competition for CFD applications.
Implementation and Performance of Factorized Back projection on Low-Cost Commercial-Off-the-Shelf Hardware

DTIC Science & Technology

performance on a low cost, low size, weight, and power (SWAP) computer : a Raspberry Pi Model B. For a comparison of performance, a baseline implementation...improvement factor of 2-3 compared to filtered backprojection. Execution on a single Raspberry Pi is too slow for real-time imaging. However, factorized...backprojection is easily parallelized, and we include a discussion of parallel implementation across multiple Pis .
Parallel Computing Using Web Servers and "Servlets".

ERIC Educational Resources Information Center

Lo, Alfred; Bloor, Chris; Choi, Y. K.

2000-01-01

Describes parallel computing and presents inexpensive ways to implement a virtual parallel computer with multiple Web servers. Highlights include performance measurement of parallel systems; models for using Java and intranet technology including single server, multiple clients and multiple servers, single client; and a comparison of CGI (common…
OPAL: An Open-Source MPI-IO Library over Cray XT

DOE Office of Scientific and Technical Information (OSTI.GOV)

Yu, Weikuan; Vetter, Jeffrey S; Canon, Richard Shane

Parallel IO over Cray XT is supported by a vendor-supplied MPI-IO package. This package contains a proprietary ADIO implementation built on top of the sysio library. While it is reasonable to maintain a stable code base for application scientists' convenience, it is also very important to the system developers and researchers to analyze and assess the effectiveness of parallel IO software, and accordingly, tune and optimize the MPI-IO implementation. A proprietary parallel IO code base relinquishes such flexibilities. On the other hand, a generic UFS-based MPI-IO implementation is typically used on many Linux-based platforms. We have developed an open-source MPI-IOmore » package over Lustre, referred to as OPAL (OPportunistic and Adaptive MPI-IO Library over Lustre). OPAL provides a single source-code base for MPI-IO over Lustre on Cray XT and Linux platforms. Compared to Cray implementation, OPAL provides a number of good features, including arbitrary specification of striping patterns and Lustre-stripe aligned file domain partitioning. This paper presents the performance comparisons between OPAL and Cray's proprietary implementation. Our evaluation demonstrates that OPAL achieves the performance comparable to the Cray implementation. We also exemplify the benefits of an open source package in revealing the underpinning of the parallel IO performance.« less
Fast disk array for image storage

NASA Astrophysics Data System (ADS)

Feng, Dan; Zhu, Zhichun; Jin, Hai; Zhang, Jiangling

1997-01-01

A fast disk array is designed for the large continuous image storage. It includes a high speed data architecture and the technology of data striping and organization on the disk array. The high speed data path which is constructed by two dual port RAM and some control circuit is configured to transfer data between a host system and a plurality of disk drives. The bandwidth can be more than 100 MB/s if the data path based on PCI (peripheral component interconnect). The organization of data stored on the disk array is similar to RAID 4. Data are striped on a plurality of disk, and each striping unit is equal to a track. I/O instructions are performed in parallel on the disk drives. An independent disk is used to store the parity information in the fast disk array architecture. By placing the parity generation circuit directly on the SCSI (or SCSI 2) bus, the parity information can be generated on the fly. It will affect little on the data writing in parallel on the other disks. The fast disk array architecture designed in the paper can meet the demands of the image storage.
Fast MPEG-CDVS Encoder With GPU-CPU Hybrid Computing

NASA Astrophysics Data System (ADS)

Duan, Ling-Yu; Sun, Wei; Zhang, Xinfeng; Wang, Shiqi; Chen, Jie; Yin, Jianxiong; See, Simon; Huang, Tiejun; Kot, Alex C.; Gao, Wen

2018-05-01

The compact descriptors for visual search (CDVS) standard from ISO/IEC moving pictures experts group (MPEG) has succeeded in enabling the interoperability for efficient and effective image retrieval by standardizing the bitstream syntax of compact feature descriptors. However, the intensive computation of CDVS encoder unfortunately hinders its widely deployment in industry for large-scale visual search. In this paper, we revisit the merits of low complexity design of CDVS core techniques and present a very fast CDVS encoder by leveraging the massive parallel execution resources of GPU. We elegantly shift the computation-intensive and parallel-friendly modules to the state-of-the-arts GPU platforms, in which the thread block allocation and the memory access are jointly optimized to eliminate performance loss. In addition, those operations with heavy data dependence are allocated to CPU to resolve the extra but non-necessary computation burden for GPU. Furthermore, we have demonstrated the proposed fast CDVS encoder can work well with those convolution neural network approaches which has harmoniously leveraged the advantages of GPU platforms, and yielded significant performance improvements. Comprehensive experimental results over benchmarks are evaluated, which has shown that the fast CDVS encoder using GPU-CPU hybrid computing is promising for scalable visual search.

Some links on this page may take you to non-federal websites. Their policies may differ from this site.