parallel fft algorithm: Topics by Science.gov

Sample records for parallel fft algorithm

Efficient implementation of parallel three-dimensional FFT on clusters of PCs

NASA Astrophysics Data System (ADS)

Takahashi, Daisuke

2003-05-01

In this paper, we propose a high-performance parallel three-dimensional fast Fourier transform (FFT) algorithm on clusters of PCs. The three-dimensional FFT algorithm can be altered into a block three-dimensional FFT algorithm to reduce the number of cache misses. We show that the block three-dimensional FFT algorithm improves performance by utilizing the cache memory effectively. We use the block three-dimensional FFT algorithm to implement the parallel three-dimensional FFT algorithm. We succeeded in obtaining performance of over 1.3 GFLOPS on an 8-node dual Pentium III 1 GHz PC SMP cluster.
A high performance parallel algorithm for 1-D FFT

DOE Office of Scientific and Technical Information (OSTI.GOV)

Agarwal, R.C.; Gustavson, F.G.; Zubair, M.

1994-12-31

In this paper the authors propose a parallel high performance FFT algorithm based on a multi-dimensional formulation. They use this to solve a commonly encountered FFT based kernel on a distributed memory parallel machine, the IBM scalable parallel system, SP1. The kernel requires a forward FFT computation of an input sequence, multiplication of the transformed data by a coefficient array, and finally an inverse FFT computation of the resultant data. They show that the multi-dimensional formulation helps in reducing the communication costs and also improves the single node performance by effectively utilizing the memory system of the node. They implementedmore » this kernel on the IBM SP1 and observed a performance of 1.25 GFLOPS on a 64-node machine.« less
Fast Fourier Transform algorithm design and tradeoffs

NASA Technical Reports Server (NTRS)

Kamin, Ray A., III; Adams, George B., III

1988-01-01

The Fast Fourier Transform (FFT) is a mainstay of certain numerical techniques for solving fluid dynamics problems. The Connection Machine CM-2 is the target for an investigation into the design of multidimensional Single Instruction Stream/Multiple Data (SIMD) parallel FFT algorithms for high performance. Critical algorithm design issues are discussed, necessary machine performance measurements are identified and made, and the performance of the developed FFT programs are measured. Fast Fourier Transform programs are compared to the currently best Cray-2 FFT program.
Parallel computing of a digital hologram and particle searching for microdigital-holographic particle-tracking velocimetry

DOE Office of Scientific and Technical Information (OSTI.GOV)

Satake, Shin-ichi; Kanamori, Hiroyuki; Kunugi, Tomoaki

2007-02-01

We have developed a parallel algorithm for microdigital-holographic particle-tracking velocimetry. The algorithm is used in (1) numerical reconstruction of a particle image computer using a digital hologram, and (2) searching for particles. The numerical reconstruction from the digital hologram makes use of the Fresnel diffraction equation and the FFT (fast Fourier transform),whereas the particle search algorithm looks for local maximum graduation in a reconstruction field represented by a 3D matrix. To achieve high performance computing for both calculations (reconstruction and particle search), two memory partitions are allocated to the 3D matrix. In this matrix, the reconstruction part consists of horizontallymore » placed 2D memory partitions on the x-y plane for the FFT, whereas, the particle search part consists of vertically placed 2D memory partitions set along the z axes.Consequently, the scalability can be obtained for the proportion of processor elements,where the benchmarks are carried out for parallel computation by a SGI Altix machine.« less
Petascale turbulence simulation using a highly parallel fast multipole method on GPUs

NASA Astrophysics Data System (ADS)

Yokota, Rio; Barba, L. A.; Narumi, Tetsu; Yasuoka, Kenji

2013-03-01

This paper reports large-scale direct numerical simulations of homogeneous-isotropic fluid turbulence, achieving sustained performance of 1.08 petaflop/s on GPU hardware using single precision. The simulations use a vortex particle method to solve the Navier-Stokes equations, with a highly parallel fast multipole method (FMM) as numerical engine, and match the current record in mesh size for this application, a cube of 40963 computational points solved with a spectral method. The standard numerical approach used in this field is the pseudo-spectral method, relying on the FFT algorithm as the numerical engine. The particle-based simulations presented in this paper quantitatively match the kinetic energy spectrum obtained with a pseudo-spectral method, using a trusted code. In terms of parallel performance, weak scaling results show the FMM-based vortex method achieving 74% parallel efficiency on 4096 processes (one GPU per MPI process, 3 GPUs per node of the TSUBAME-2.0 system). The FFT-based spectral method is able to achieve just 14% parallel efficiency on the same number of MPI processes (using only CPU cores), due to the all-to-all communication pattern of the FFT algorithm. The calculation time for one time step was 108 s for the vortex method and 154 s for the spectral method, under these conditions. Computing with 69 billion particles, this work exceeds by an order of magnitude the largest vortex-method calculations to date.
Ordered fast fourier transforms on a massively parallel hypercube multiprocessor

NASA Technical Reports Server (NTRS)

Tong, Charles; Swarztrauber, Paul N.

1989-01-01

Design alternatives for ordered Fast Fourier Transformation (FFT) algorithms were examined on massively parallel hypercube multiprocessors such as the Connection Machine. Particular emphasis is placed on reducing communication which is known to dominate the overall computing time. To this end, the order and computational phases of the FFT were combined, and the sequence to processor maps that reduce communication were used. The class of ordered transforms is expanded to include any FFT in which the order of the transform is the same as that of the input sequence. Two such orderings are examined, namely, standard-order and A-order which can be implemented with equal ease on the Connection Machine where orderings are determined by geometries and priorities. If the sequence has N = 2 exp r elements and the hypercube has P = 2 exp d processors, then a standard-order FFT can be implemented with d + r/2 + 1 parallel transmissions. An A-order sequence can be transformed with 2d - r/2 parallel transmissions which is r - d + 1 fewer than the standard order. A parallel method for computing the trigonometric coefficients is presented that does not use trigonometric functions or interprocessor communication. A performance of 0.9 GFLOPS was obtained for an A-order transform on the Connection Machine.
Ordered fast Fourier transforms on a massively parallel hypercube multiprocessor

NASA Technical Reports Server (NTRS)

Tong, Charles; Swarztrauber, Paul N.

1991-01-01

The present evaluation of alternative, massively parallel hypercube processor-applicable designs for ordered radix-2 decimation-in-frequency FFT algorithms gives attention to the reduction of computation time-dominating communication. A combination of the order and computational phases of the FFT is accordingly employed, in conjunction with sequence-to-processor maps which reduce communication. Two orderings, 'standard' and 'cyclic', in which the order of the transform is the same as that of the input sequence, can be implemented with ease on the Connection Machine (where orderings are determined by geometries and priorities. A parallel method for trigonometric coefficient computation is presented which does not employ trigonometric functions or interprocessor communication.
Real-time processing of radar return on a parallel computer

NASA Technical Reports Server (NTRS)

Aalfs, David D.

1992-01-01

NASA is working with the FAA to demonstrate the feasibility of pulse Doppler radar as a candidate airborne sensor to detect low altitude windshears. The need to provide the pilot with timely information about possible hazards has motivated a demand for real-time processing of a radar return. Investigated here is parallel processing as a means of accommodating the high data rates required. A PC based parallel computer, called the transputer, is used to investigate issues in real time concurrent processing of radar signals. A transputer network is made up of an array of single instruction stream processors that can be networked in a variety of ways. They are easily reconfigured and software development is largely independent of the particular network topology. The performance of the transputer is evaluated in light of the computational requirements. A number of algorithms have been implemented on the transputers in OCCAM, a language specially designed for parallel processing. These include signal processing algorithms such as the Fast Fourier Transform (FFT), pulse-pair, and autoregressive modelling, as well as routing software to support concurrency. The most computationally intensive task is estimating the spectrum. Two approaches have been taken on this problem, the first and most conventional of which is to use the FFT. By using table look-ups for the basis function and other optimizing techniques, an algorithm has been developed that is sufficient for real time. The other approach is to model the signal as an autoregressive process and estimate the spectrum based on the model coefficients. This technique is attractive because it does not suffer from the spectral leakage problem inherent in the FFT. Benchmark tests indicate that autoregressive modeling is feasible in real time.
Proceedings: Sisal `93

DOE Office of Scientific and Technical Information (OSTI.GOV)

Feo, J.T.

1993-10-01

This report contain papers on: Programmability and performance issues; The case of an iterative partial differential equation solver; Implementing the kernal of the Australian Region Weather Prediction Model in Sisal; Even and quarter-even prime length symmetric FFTs and their Sisal Implementations; Top-down thread generation for Sisal; Overlapping communications and computations on NUMA architechtures; Compiling technique based on dataflow analysis for funtional programming language Valid; Copy elimination for true multidimensional arrays in Sisal 2.0; Increasing parallelism for an optimization that reduces copying in IF2 graphs; Caching in on Sisal; Cache performance of Sisal Vs. FORTRAN; FFT algorithms on a shared-memory multiprocessor;more » A parallel implementation of nonnumeric search problems in Sisal; Computer vision algorithms in Sisal; Compilation of Sisal for a high-performance data driven vector processor; Sisal on distributed memory machines; A virtual shared addressing system for distributed memory Sisal; Developing a high-performance FFT algorithm in Sisal for a vector supercomputer; Implementation issues for IF2 on a static data-flow architechture; and Systematic control of parallelism in array-based data-flow computation. Selected papers have been indexed separately for inclusion in the Energy Science and Technology Database.« less
Efficient FFT Algorithm for Psychoacoustic Model of the MPEG-4 AAC

NASA Astrophysics Data System (ADS)

Lee, Jae-Seong; Lee, Chang-Joon; Park, Young-Cheol; Youn, Dae-Hee

This paper proposes an efficient FFT algorithm for the Psycho-Acoustic Model (PAM) of MPEG-4 AAC. The proposed algorithm synthesizes FFT coefficients using MDCT and MDST coefficients through circular convolution. The complexity of the MDCT and MDST coefficients is approximately half of the original FFT. We also design a new PAM based on the proposed FFT algorithm, which has 15% lower computational complexity than the original PAM without degradation of sound quality. Subjective as well as objective test results are presented to confirm the efficiency of the proposed FFT computation algorithm and the PAM.
A general purpose subroutine for fast fourier transform on a distributed memory parallel machine

NASA Technical Reports Server (NTRS)

Dubey, A.; Zubair, M.; Grosch, C. E.

1992-01-01

One issue which is central in developing a general purpose Fast Fourier Transform (FFT) subroutine on a distributed memory parallel machine is the data distribution. It is possible that different users would like to use the FFT routine with different data distributions. Thus, there is a need to design FFT schemes on distributed memory parallel machines which can support a variety of data distributions. An FFT implementation on a distributed memory parallel machine which works for a number of data distributions commonly encountered in scientific applications is presented. The problem of rearranging the data after computing the FFT is also addressed. The performance of the implementation on a distributed memory parallel machine Intel iPSC/860 is evaluated.
Improved argument-FFT frequency offset estimation for QPSK coherent optical Systems

NASA Astrophysics Data System (ADS)

Han, Jilong; Li, Wei; Yuan, Zhilin; Li, Haitao; Huang, Liyan; Hu, Qianggao

2016-02-01

A frequency offset estimation (FOE) algorithm based on fast Fourier transform (FFT) of the signal's argument is investigated, which does not require removing the modulated data phase. In this paper, we analyze the flaw of the argument-FFT algorithm and propose a combined FOE algorithm, in which the absolute of frequency offset (FO) is accurately calculated by argument-FFT algorithm with a relatively large number of samples and the sign of FO is determined by FFT-based interpolation discrete Fourier transformation (DFT) algorithm with a relatively small number of samples. Compared with the previous algorithms based on argument-FFT, the proposed one has low complexity and can still effectively work with a relatively less number of samples.
Spectral turning bands for efficient Gaussian random fields generation on GPUs and accelerators

NASA Astrophysics Data System (ADS)

Hunger, L.; Cosenza, B.; Kimeswenger, S.; Fahringer, T.

2015-11-01

A random field (RF) is a set of correlated random variables associated with different spatial locations. RF generation algorithms are of crucial importance for many scientific areas, such as astrophysics, geostatistics, computer graphics, and many others. Current approaches commonly make use of 3D fast Fourier transform (FFT), which does not scale well for RF bigger than the available memory; they are also limited to regular rectilinear meshes. We introduce random field generation with the turning band method (RAFT), an RF generation algorithm based on the turning band method that is optimized for massively parallel hardware such as GPUs and accelerators. Our algorithm replaces the 3D FFT with a lower-order, one-dimensional FFT followed by a projection step and is further optimized with loop unrolling and blocking. RAFT can easily generate RF on non-regular (non-uniform) meshes and efficiently produce fields with mesh sizes bigger than the available device memory by using a streaming, out-of-core approach. Our algorithm generates RF with the correct statistical behavior and is tested on a variety of modern hardware, such as NVIDIA Tesla, AMD FirePro and Intel Phi. RAFT is faster than the traditional methods on regular meshes and has been successfully applied to two real case scenarios: planetary nebulae and cosmological simulations.
FFTs in external or hierarchical memory

NASA Technical Reports Server (NTRS)

Bailey, David H.

1989-01-01

A description is given of advanced techniques for computing an ordered FFT on a computer with external or hierarchical memory. These algorithms (1) require as few as two passes through the external data set, (2) use strictly unit stride, long vector transfers between main memory and external storage, (3) require only a modest amount of scratch space in main memory, and (4) are well suited for vector and parallel computation. Performance figures are included for implementations of some of these algorithms on Cray supercomputers. Of interest is the fact that a main memory version outperforms the current Cray library FFT routines on the Cray-2, the Cray X-MP, and the Cray Y-MP systems. Using all eight processors on the Cray Y-MP, this main memory routine runs at nearly 2 Gflops.
FFT Computation with Systolic Arrays, A New Architecture

NASA Technical Reports Server (NTRS)

Boriakoff, Valentin

1994-01-01

The use of the Cooley-Tukey algorithm for computing the l-d FFT lends itself to a particular matrix factorization which suggests direct implementation by linearly-connected systolic arrays. Here we present a new systolic architecture that embodies this algorithm. This implementation requires a smaller number of processors and a smaller number of memory cells than other recent implementations, as well as having all the advantages of systolic arrays. For the implementation of the decimation-in-frequency case, word-serial data input allows continuous real-time operation without the need of a serial-to-parallel conversion device. No control or data stream switching is necessary. Computer simulation of this architecture was done in the context of a 1024 point DFT with a fixed point processor, and CMOS processor implementation has started.
Fast algorithm for computing complex number-theoretic transforms

NASA Technical Reports Server (NTRS)

Reed, I. S.; Liu, K. Y.; Truong, T. K.

1977-01-01

A high-radix FFT algorithm for computing transforms over FFT, where q is a Mersenne prime, is developed to implement fast circular convolutions. This new algorithm requires substantially fewer multiplications than the conventional FFT.
Real-time digital holographic microscopy using the graphic processing unit.

PubMed

Shimobaba, Tomoyoshi; Sato, Yoshikuni; Miura, Junya; Takenouchi, Mai; Ito, Tomoyoshi

2008-08-04

Digital holographic microscopy (DHM) is a well-known powerful method allowing both the amplitude and phase of a specimen to be simultaneously observed. In order to obtain a reconstructed image from a hologram, numerous calculations for the Fresnel diffraction are required. The Fresnel diffraction can be accelerated by the FFT (Fast Fourier Transform) algorithm. However, real-time reconstruction from a hologram is difficult even if we use a recent central processing unit (CPU) to calculate the Fresnel diffraction by the FFT algorithm. In this paper, we describe a real-time DHM system using a graphic processing unit (GPU) with many stream processors, which allows use as a highly parallel processor. The computational speed of the Fresnel diffraction using the GPU is faster than that of recent CPUs. The real-time DHM system can obtain reconstructed images from holograms whose size is 512 x 512 grids in 24 frames per second.
Parallel Signal Processing and System Simulation using aCe

NASA Technical Reports Server (NTRS)

Dorband, John E.; Aburdene, Maurice F.

2003-01-01

Recently, networked and cluster computation have become very popular for both signal processing and system simulation. A new language is ideally suited for parallel signal processing applications and system simulation since it allows the programmer to explicitly express the computations that can be performed concurrently. In addition, the new C based parallel language (ace C) for architecture-adaptive programming allows programmers to implement algorithms and system simulation applications on parallel architectures by providing them with the assurance that future parallel architectures will be able to run their applications with a minimum of modification. In this paper, we will focus on some fundamental features of ace C and present a signal processing application (FFT).
Research on fast Fourier transforms algorithm of huge remote sensing image technology with GPU and partitioning technology.

PubMed

Yang, Xue; Li, Xue-You; Li, Jia-Guo; Ma, Jun; Zhang, Li; Yang, Jan; Du, Quan-Ye

2014-02-01

Fast Fourier transforms (FFT) is a basic approach to remote sensing image processing. With the improvement of capacity of remote sensing image capture with the features of hyperspectrum, high spatial resolution and high temporal resolution, how to use FFT technology to efficiently process huge remote sensing image becomes the critical step and research hot spot of current image processing technology. FFT algorithm, one of the basic algorithms of image processing, can be used for stripe noise removal, image compression, image registration, etc. in processing remote sensing image. CUFFT function library is the FFT algorithm library based on CPU and FFTW. FFTW is a FFT algorithm developed based on CPU in PC platform, and is currently the fastest CPU based FFT algorithm function library. However there is a common problem that once the available memory or memory is less than the capacity of image, there will be out of memory or memory overflow when using the above two methods to realize image FFT arithmetic. To address this problem, a CPU and partitioning technology based Huge Remote Fast Fourier Transform (HRFFT) algorithm is proposed in this paper. By improving the FFT algorithm in CUFFT function library, the problem of out of memory and memory overflow is solved. Moreover, this method is proved rational by experiment combined with the CCD image of HJ-1A satellite. When applied to practical image processing, it improves effect of the image processing, speeds up the processing, which saves the time of computation and achieves sound result.
An efficient three-dimensional Poisson solver for SIMD high-performance-computing architectures

NASA Technical Reports Server (NTRS)

Cohl, H.

1994-01-01

We present an algorithm that solves the three-dimensional Poisson equation on a cylindrical grid. The technique uses a finite-difference scheme with operator splitting. This splitting maps the banded structure of the operator matrix into a two-dimensional set of tridiagonal matrices, which are then solved in parallel. Our algorithm couples FFT techniques with the well-known ADI (Alternating Direction Implicit) method for solving Elliptic PDE's, and the implementation is extremely well suited for a massively parallel environment like the SIMD architecture of the MasPar MP-1. Due to the highly recursive nature of our problem, we believe that our method is highly efficient, as it avoids excessive interprocessor communication.

Massively parallel implementation of 3D-RISM calculation with volumetric 3D-FFT.

PubMed

Maruyama, Yutaka; Yoshida, Norio; Tadano, Hiroto; Takahashi, Daisuke; Sato, Mitsuhisa; Hirata, Fumio

2014-07-05

A new three-dimensional reference interaction site model (3D-RISM) program for massively parallel machines combined with the volumetric 3D fast Fourier transform (3D-FFT) was developed, and tested on the RIKEN K supercomputer. The ordinary parallel 3D-RISM program has a limitation on the number of parallelizations because of the limitations of the slab-type 3D-FFT. The volumetric 3D-FFT relieves this limitation drastically. We tested the 3D-RISM calculation on the large and fine calculation cell (2048(3) grid points) on 16,384 nodes, each having eight CPU cores. The new 3D-RISM program achieved excellent scalability to the parallelization, running on the RIKEN K supercomputer. As a benchmark application, we employed the program, combined with molecular dynamics simulation, to analyze the oligomerization process of chymotrypsin Inhibitor 2 mutant. The results demonstrate that the massive parallel 3D-RISM program is effective to analyze the hydration properties of the large biomolecular systems. Copyright © 2014 Wiley Periodicals, Inc.
Algorithms and programming tools for image processing on the MPP:3

NASA Technical Reports Server (NTRS)

Reeves, Anthony P.

1987-01-01

This is the third and final report on the work done for NASA Grant 5-403 on Algorithms and Programming Tools for Image Processing on the MPP:3. All the work done for this grant is summarized in the introduction. Work done since August 1986 is reported in detail. Research for this grant falls under the following headings: (1) fundamental algorithms for the MPP; (2) programming utilities for the MPP; (3) the Parallel Pascal Development System; and (4) performance analysis. In this report, the results of two efforts are reported: region growing, and performance analysis of important characteristic algorithms. In each case, timing results from MPP implementations are included. A paper is included in which parallel algorithms for region growing on the MPP is discussed. These algorithms permit different sized regions to be merged in parallel. Details on the implementation and peformance of several important MPP algorithms are given. These include a number of standard permutations, the FFT, convolution, arbitrary data mappings, image warping, and pyramid operations, all of which have been implemented on the MPP. The permutation and image warping functions have been included in the standard development system library.
FFT applications to plane-polar near-field antenna measurements

NASA Technical Reports Server (NTRS)

Gatti, Mark S.; Rahmat-Samii, Yahya

1988-01-01

The four-point bivariate Lagrange interpolation algorithm was applied to near-field antenna data measured in a plane-polar facility. The results were sufficiently accurate to permit the use of the FFT (fast Fourier transform) algorithm to calculate the far-field patterns of the antenna. Good agreement was obtained between the far-field patterns as calculated by the Jacobi-Bessel and the FFT algorithms. The significant advantage in using the FFT is in the calculation of the principal plane cuts, which may be made very quickly. Also, the application of the FFT algorithm directly to the near-field data was used to perform surface holographic diagnosis of a reflector antenna. The effects due to the focusing of the emergent beam from the reflector, as well as the effects of the information in the wide-angle regions, are shown. The use of the plane-polar near-field antenna test range has therfore been expanded to include these useful FFT applications.
A Comparison of Direction Finding Results From an FFT Peak Identification Technique With Those From the Music Algorithm

DTIC Science & Technology

1991-07-01

MUSIC ALGORITHM (U) by L.E. Montbrland go I July 1991 CRC REPORT NO. 1438 Ottawa I* Government of Canada Gouvsrnweient du Canada I o DParunnt of...FINDING RESULTS FROM AN FFT PEAK IDENTIFICATION TECHNIQUE WITH THOSE FROM THE MUSIC ALGORITHM (U) by L.E. Montbhrand CRC REPORT NO. 1438 July 1991...Ottawa A Comparison of Direction Finding Results From an FFT Peak Identification Technique With Those From the Music Algorithm L.E. Montbriand Abstract A
Effects of Computer Architecture on FFT (Fast Fourier Transform) Algorithm Performance.

DTIC Science & Technology

1983-12-01

Criteria for Efficient Implementation of FFT Algorithms," IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-30, pp. 107-109, Feb...1982. Burrus, C. S. and P. W. Eschenbacher. "An In-Place, In-Order Prime Factor FFT Algorithm," IEEE Transactions on Acoustics, Speech, and Signal... Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-30, pp. 217-226, Apr. 1982. Control Data Corporation. CDC Cyber 170 Computer Systems
Model-based spectral estimation of Doppler signals using parallel genetic algorithms.

PubMed

Solano González, J; Rodríguez Vázquez, K; García Nocetti, D F

2000-05-01

Conventional spectral analysis methods use a fast Fourier transform (FFT) on consecutive or overlapping windowed data segments. For Doppler ultrasound signals, this approach suffers from an inadequate frequency resolution due to the time segment duration and the non-stationarity characteristics of the signals. Parametric or model-based estimators can give significant improvements in the time-frequency resolution at the expense of a higher computational complexity. This work describes an approach which implements in real-time a parametric spectral estimator method using genetic algorithms (GAs) in order to find the optimum set of parameters for the adaptive filter that minimises the error function. The aim is to reduce the computational complexity of the conventional algorithm by using the simplicity associated to GAs and exploiting its parallel characteristics. This will allow the implementation of higher order filters, increasing the spectrum resolution, and opening a greater scope for using more complex methods.
Multitasking domain decomposition fast Poisson solvers on the Cray Y-MP

NASA Technical Reports Server (NTRS)

Chan, Tony F.; Fatoohi, Rod A.

1990-01-01

The results of multitasking implementation of a domain decomposition fast Poisson solver on eight processors of the Cray Y-MP are presented. The object of this research is to study the performance of domain decomposition methods on a Cray supercomputer and to analyze the performance of different multitasking techniques using highly parallel algorithms. Two implementations of multitasking are considered: macrotasking (parallelism at the subroutine level) and microtasking (parallelism at the do-loop level). A conventional FFT-based fast Poisson solver is also multitasked. The results of different implementations are compared and analyzed. A speedup of over 7.4 on the Cray Y-MP running in a dedicated environment is achieved for all cases.
A new fast algorithm for computing a complex number: Theoretic transforms

NASA Technical Reports Server (NTRS)

Reed, I. S.; Liu, K. Y.; Truong, T. K.

1977-01-01

A high-radix fast Fourier transformation (FFT) algorithm for computing transforms over GF(sq q), where q is a Mersenne prime, is developed to implement fast circular convolutions. This new algorithm requires substantially fewer multiplications than the conventional FFT.
Efficient Modeling of Gravity Fields Caused by Sources with Arbitrary Geometry and Arbitrary Density Distribution

NASA Astrophysics Data System (ADS)

Wu, Leyuan

2018-01-01

We present a brief review of gravity forward algorithms in Cartesian coordinate system, including both space-domain and Fourier-domain approaches, after which we introduce a truly general and efficient algorithm, namely the convolution-type Gauss fast Fourier transform (Conv-Gauss-FFT) algorithm, for 2D and 3D modeling of gravity potential and its derivatives due to sources with arbitrary geometry and arbitrary density distribution which are defined either by discrete or by continuous functions. The Conv-Gauss-FFT algorithm is based on the combined use of a hybrid rectangle-Gaussian grid and the fast Fourier transform (FFT) algorithm. Since the gravity forward problem in Cartesian coordinate system can be expressed as continuous convolution-type integrals, we first approximate the continuous convolution by a weighted sum of a series of shifted discrete convolutions, and then each shifted discrete convolution, which is essentially a Toeplitz system, is calculated efficiently and accurately by combining circulant embedding with the FFT algorithm. Synthetic and real model tests show that the Conv-Gauss-FFT algorithm can obtain high-precision forward results very efficiently for almost any practical model, and it works especially well for complex 3D models when gravity fields on large 3D regular grids are needed.
Efficient implementation of multidimensional fast fourier transform on a distributed-memory parallel multi-node computer

DOEpatents

Bhanot, Gyan V [Princeton, NJ; Chen, Dong [Croton-On-Hudson, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Philip [Cortlandt Manor, NY; Steinmacher-Burow, Burkhard D [Mount Kisco, NY; Vranas, Pavlos M [Bedford Hills, NY

2012-01-10

The present in invention is directed to a method, system and program storage device for efficiently implementing a multidimensional Fast Fourier Transform (FFT) of a multidimensional array comprising a plurality of elements initially distributed in a multi-node computer system comprising a plurality of nodes in communication over a network, comprising: distributing the plurality of elements of the array in a first dimension across the plurality of nodes of the computer system over the network to facilitate a first one-dimensional FFT; performing the first one-dimensional FFT on the elements of the array distributed at each node in the first dimension; re-distributing the one-dimensional FFT-transformed elements at each node in a second dimension via "all-to-all" distribution in random order across other nodes of the computer system over the network; and performing a second one-dimensional FFT on elements of the array re-distributed at each node in the second dimension, wherein the random order facilitates efficient utilization of the network thereby efficiently implementing the multidimensional FFT. The "all-to-all" re-distribution of array elements is further efficiently implemented in applications other than the multidimensional FFT on the distributed-memory parallel supercomputer.
Efficient implementation of a multidimensional fast fourier transform on a distributed-memory parallel multi-node computer

DOEpatents

Bhanot, Gyan V [Princeton, NJ; Chen, Dong [Croton-On-Hudson, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Philip [Cortlandt Manor, NY; Steinmacher-Burow, Burkhard D [Mount Kisco, NY; Vranas, Pavlos M [Bedford Hills, NY

2008-01-01

The present in invention is directed to a method, system and program storage device for efficiently implementing a multidimensional Fast Fourier Transform (FFT) of a multidimensional array comprising a plurality of elements initially distributed in a multi-node computer system comprising a plurality of nodes in communication over a network, comprising: distributing the plurality of elements of the array in a first dimension across the plurality of nodes of the computer system over the network to facilitate a first one-dimensional FFT; performing the first one-dimensional FFT on the elements of the array distributed at each node in the first dimension; re-distributing the one-dimensional FFT-transformed elements at each node in a second dimension via "all-to-all" distribution in random order across other nodes of the computer system over the network; and performing a second one-dimensional FFT on elements of the array re-distributed at each node in the second dimension, wherein the random order facilitates efficient utilization of the network thereby efficiently implementing the multidimensional FFT. The "all-to-all" re-distribution of array elements is further efficiently implemented in applications other than the multidimensional FFT on the distributed-memory parallel supercomputer.
Advanced digital SAR processing study

NASA Technical Reports Server (NTRS)

Martinson, L. W.; Gaffney, B. P.; Liu, B.; Perry, R. P.; Ruvin, A.

1982-01-01

A highly programmable, land based, real time synthetic aperture radar (SAR) processor requiring a processed pixel rate of 2.75 MHz or more in a four look system was designed. Variations in range and azimuth compression, number of looks, range swath, range migration and SR mode were specified. Alternative range and azimuth processing algorithms were examined in conjunction with projected integrated circuit, digital architecture, and software technologies. The advaced digital SAR processor (ADSP) employs an FFT convolver algorithm for both range and azimuth processing in a parallel architecture configuration. Algorithm performace comparisons, design system design, implementation tradeoffs and the results of a supporting survey of integrated circuit and digital architecture technologies are reported. Cost tradeoffs and projections with alternate implementation plans are presented.
The Block V Receiver fast acquisition algorithm for the Galileo S-band mission

NASA Technical Reports Server (NTRS)

Aung, M.; Hurd, W. J.; Buu, C. M.; Berner, J. B.; Stephens, S. A.; Gevargiz, J. M.

1994-01-01

A fast acquisition algorithm for the Galileo suppressed carrier, subcarrier, and data symbol signals under low data rate, signal-to-noise ratio (SNR) and high carrier phase-noise conditions has been developed. The algorithm employs a two-arm fast Fourier transform (FFT) method utilizing both the in-phase and quadrature-phase channels of the carrier. The use of both channels results in an improved SNR in the FFT acquisition, enabling the use of a shorter FFT period over which the carrier instability is expected to be less significant. The use of a two-arm FFT also enables subcarrier and symbol acquisition before carrier acquisition. With the subcarrier and symbol loops locked first, the carrier can be acquired from an even shorter FFT period. Two-arm tracking loops are employed to lock the subcarrier and symbol loops parameter modification to achieve the final (high) loop SNR in the shortest time possible. The fast acquisition algorithm is implemented in the Block V Receiver (BVR). This article describes the complete algorithm design, the extensive computer simulation work done for verification of the design and the analysis, implementation issues in the BVR, and the acquisition times of the algorithm. In the expected case of the Galileo spacecraft at Jupiter orbit insertion PD/No equals 14.6 dB-Hz, R(sym) equals 16 symbols per sec, and the predicted acquisition time of the algorithm (to attain a 0.2-dB degradation from each loop to the output symbol SNR) is 38 sec.
An improved conscan algorithm based on a Kalman filter

NASA Technical Reports Server (NTRS)

Eldred, D. B.

1994-01-01

Conscan is commonly used by DSN antennas to allow adaptive tracking of a target whose position is not precisely known. This article describes an algorithm that is based on a Kalman filter and is proposed to replace the existing fast Fourier transform based (FFT-based) algorithm for conscan. Advantages of this algorithm include better pointing accuracy, continuous update information, and accommodation of missing data. Additionally, a strategy for adaptive selection of the conscan radius is proposed. The performance of the algorithm is illustrated through computer simulations and compared to the FFT algorithm. The results show that the Kalman filter algorithm is consistently superior.
GaAs Supercomputing: Architecture, Language, And Algorithms For Image Processing

NASA Astrophysics Data System (ADS)

Johl, John T.; Baker, Nick C.

1988-10-01

The application of high-speed GaAs processors in a parallel system matches the demanding computational requirements of image processing. The architecture of the McDonnell Douglas Astronautics Company (MDAC) vector processor is described along with the algorithms and language translator. Most image and signal processing algorithms can utilize parallel processing and show a significant performance improvement over sequential versions. The parallelization performed by this system is within each vector instruction. Since each vector has many elements, each requiring some computation, useful concurrent arithmetic operations can easily be performed. Balancing the memory bandwidth with the computation rate of the processors is an important design consideration for high efficiency and utilization. The architecture features a bus-based execution unit consisting of four to eight 32-bit GaAs RISC microprocessors running at a 200 MHz clock rate for a peak performance of 1.6 BOPS. The execution unit is connected to a vector memory with three buses capable of transferring two input words and one output word every 10 nsec. The address generators inside the vector memory perform different vector addressing modes and feed the data to the execution unit. The functions discussed in this paper include basic MATRIX OPERATIONS, 2-D SPATIAL CONVOLUTION, HISTOGRAM, and FFT. For each of these algorithms, assembly language programs were run on a behavioral model of the system to obtain performance figures.
High-Throughput, Adaptive FFT Architecture for FPGA-Based Spaceborne Data Processors

NASA Technical Reports Server (NTRS)

NguyenKobayashi, Kayla; Zheng, Jason X.; He, Yutao; Shah, Biren N.

2011-01-01

Exponential growth in microelectronics technology such as field-programmable gate arrays (FPGAs) has enabled high-performance spaceborne instruments with increasing onboard data processing capabilities. As a commonly used digital signal processing (DSP) building block, fast Fourier transform (FFT) has been of great interest in onboard data processing applications, which needs to strike a reasonable balance between high-performance (throughput, block size, etc.) and low resource usage (power, silicon footprint, etc.). It is also desirable to be designed so that a single design can be reused and adapted into instruments with different requirements. The Multi-Pass Wide Kernel FFT (MPWK-FFT) architecture was developed, in which the high-throughput benefits of the parallel FFT structure and the low resource usage of Singleton s single butterfly method is exploited. The result is a wide-kernel, multipass, adaptive FFT architecture. The 32K-point MPWK-FFT architecture includes 32 radix-2 butterflies, 64 FIFOs to store the real inputs, 64 FIFOs to store the imaginary inputs, complex twiddle factor storage, and FIFO logic to route the outputs to the correct FIFO. The inputs are stored in sequential fashion into the FIFOs, and the outputs of each butterfly are sequentially written first into the even FIFO, then the odd FIFO. Because of the order of the outputs written into the FIFOs, the depth of the even FIFOs, which are 768 each, are 1.5 times larger than the odd FIFOs, which are 512 each. The total memory needed for data storage, assuming that each sample is 36 bits, is 2.95 Mbits. The twiddle factors are stored in internal ROM inside the FPGA for fast access time. The total memory size to store the twiddle factors is 589.9Kbits. This FFT structure combines the benefits of high throughput from the parallel FFT kernels and low resource usage from the multi-pass FFT kernels with desired adaptability. Space instrument missions that need onboard FFT capabilities such as the proposed DESDynl, SWOT (Surface Water Ocean Topography), and Europa sounding radar missions would greatly benefit from this technology with significant reductions in non-recurring cost and risk.
Further optimization of SeDDaRA blind image deconvolution algorithm and its DSP implementation

NASA Astrophysics Data System (ADS)

Wen, Bo; Zhang, Qiheng; Zhang, Jianlin

2011-11-01

Efficient algorithm for blind image deconvolution and its high-speed implementation is of great value in practice. Further optimization of SeDDaRA is developed, from algorithm structure to numerical calculation methods. The main optimization covers that, the structure's modularization for good implementation feasibility, reducing the data computation and dependency of 2D-FFT/IFFT, and acceleration of power operation by segmented look-up table. Then the Fast SeDDaRA is proposed and specialized for low complexity. As the final implementation, a hardware system of image restoration is conducted by using the multi-DSP parallel processing. Experimental results show that, the processing time and memory demand of Fast SeDDaRA decreases 50% at least; the data throughput of image restoration system is over 7.8Msps. The optimization is proved efficient and feasible, and the Fast SeDDaRA is able to support the real-time application.
An algorithm for the basis of the finite Fourier transform

NASA Technical Reports Server (NTRS)

Santhanam, Thalanayar S.

1995-01-01

The Finite Fourier Transformation matrix (F.F.T.) plays a central role in the formulation of quantum mechanics in a finite dimensional space studied by the author over the past couple of decades. An outstanding problem which still remains open is to find a complete basis for F.F.T. In this paper we suggest a simple algorithm to find the eigenvectors of F.T.T.
Spread Spectrum Signal Characteristic Estimation Using Exponential Averaging and an AD-HOC Chip rate Estimator

DTIC Science & Technology

2007-03-01

Quadrature QPSK Quadrature Phase-Shift Keying RV Random Variable SHAC Single-Hop-Observation Auto- Correlation SINR Signal-to-Interference...The fast Fourier transform ( FFT ) accumulation method and the strip spectral correlation algorithm subdivide the support region in the bi-frequency...diamond shapes, while the strip spectral correlation algorithm subdivides the region into strips. Each strip covers a number of the FFT accumulation
Joint compensation scheme of polarization crosstalk, intersymbol interference, frequency offset, and phase noise based on cascaded Kalman filter

NASA Astrophysics Data System (ADS)

Zhang, Qun; Yang, Yanfu; Xiang, Qian; Zhou, Zhongqing; Yao, Yong

2018-02-01

A joint compensation scheme based on cascaded Kalman filter is proposed, which can implement polarization tracking, channel equalization, frequency offset, and phase noise compensation simultaneously. The experimental results show that the proposed algorithm can not only compensate multiple channel impairments simultaneously but also improve the polarization tracking capacity and accelerate the convergence speed. The scheme has up to eight times faster convergence speed compared with radius-directed equalizer (RDE) + Max-FFT (maximum fast Fourier transform) + BPS (blind phase search) and can track up polarization rotation 60 times and 15 times faster than that of RDE + Max-FFT + BPS and CMMA (cascaded multimodulus algorithm) + Max-FFT + BPS, respectively.

A combined finite element-boundary integral formulation for solution of two-dimensional scattering problems via CGFFT. [Conjugate Gradient Fast Fourier Transformation

NASA Technical Reports Server (NTRS)

Collins, Jeffery D.; Volakis, John L.; Jin, Jian-Ming

1990-01-01

A new technique is presented for computing the scattering by 2-D structures of arbitrary composition. The proposed solution approach combines the usual finite element method with the boundary-integral equation to formulate a discrete system. This is subsequently solved via the conjugate gradient (CG) algorithm. A particular characteristic of the method is the use of rectangular boundaries to enclose the scatterer. Several of the resulting boundary integrals are therefore convolutions and may be evaluated via the fast Fourier transform (FFT) in the implementation of the CG algorithm. The solution approach offers the principal advantage of having O(N) memory demand and employs a 1-D FFT versus a 2-D FFT as required with a traditional implementation of the CGFFT algorithm. The speed of the proposed solution method is compared with that of the traditional CGFFT algorithm, and results for rectangular bodies are given and shown to be in excellent agreement with the moment method.
A combined finite element and boundary integral formulation for solution via CGFFT of 2-dimensional scattering problems

NASA Technical Reports Server (NTRS)

Collins, Jeffery D.; Volakis, John L.

1989-01-01

A new technique is presented for computing the scattering by 2-D structures of arbitrary composition. The proposed solution approach combines the usual finite element method with the boundary integral equation to formulate a discrete system. This is subsequently solved via the conjugate gradient (CG) algorithm. A particular characteristic of the method is the use of rectangular boundaries to enclose the scatterer. Several of the resulting boundary integrals are therefore convolutions and may be evaluated via the fast Fourier transform (FFT) in the implementation of the CG algorithm. The solution approach offers the principle advantage of having O(N) memory demand and employs a 1-D FFT versus a 2-D FFT as required with a traditional implementation of the CGFFT algorithm. The speed of the proposed solution method is compared with that of the traditional CGFFT algorithm, and results for rectangular bodies are given and shown to be in excellent agreement with the moment method.
A Spaceborne Synthetic Aperture Radar Partial Fixed-Point Imaging System Using a Field- Programmable Gate Array—Application-Specific Integrated Circuit Hybrid Heterogeneous Parallel Acceleration Technique

PubMed Central

Li, Bingyi; Chen, Liang; Wei, Chunpeng; Xie, Yizhuang; Chen, He; Yu, Wenyue

2017-01-01

With the development of satellite load technology and very large scale integrated (VLSI) circuit technology, onboard real-time synthetic aperture radar (SAR) imaging systems have become a solution for allowing rapid response to disasters. A key goal of the onboard SAR imaging system design is to achieve high real-time processing performance with severe size, weight, and power consumption constraints. In this paper, we analyse the computational burden of the commonly used chirp scaling (CS) SAR imaging algorithm. To reduce the system hardware cost, we propose a partial fixed-point processing scheme. The fast Fourier transform (FFT), which is the most computation-sensitive operation in the CS algorithm, is processed with fixed-point, while other operations are processed with single precision floating-point. With the proposed fixed-point processing error propagation model, the fixed-point processing word length is determined. The fidelity and accuracy relative to conventional ground-based software processors is verified by evaluating both the point target imaging quality and the actual scene imaging quality. As a proof of concept, a field- programmable gate array—application-specific integrated circuit (FPGA-ASIC) hybrid heterogeneous parallel accelerating architecture is designed and realized. The customized fixed-point FFT is implemented using the 130 nm complementary metal oxide semiconductor (CMOS) technology as a co-processor of the Xilinx xc6vlx760t FPGA. A single processing board requires 12 s and consumes 21 W to focus a 50-km swath width, 5-m resolution stripmap SAR raw data with a granularity of 16,384 × 16,384. PMID:28672813
A Spaceborne Synthetic Aperture Radar Partial Fixed-Point Imaging System Using a Field- Programmable Gate Array-Application-Specific Integrated Circuit Hybrid Heterogeneous Parallel Acceleration Technique.

PubMed

Yang, Chen; Li, Bingyi; Chen, Liang; Wei, Chunpeng; Xie, Yizhuang; Chen, He; Yu, Wenyue

2017-06-24

With the development of satellite load technology and very large scale integrated (VLSI) circuit technology, onboard real-time synthetic aperture radar (SAR) imaging systems have become a solution for allowing rapid response to disasters. A key goal of the onboard SAR imaging system design is to achieve high real-time processing performance with severe size, weight, and power consumption constraints. In this paper, we analyse the computational burden of the commonly used chirp scaling (CS) SAR imaging algorithm. To reduce the system hardware cost, we propose a partial fixed-point processing scheme. The fast Fourier transform (FFT), which is the most computation-sensitive operation in the CS algorithm, is processed with fixed-point, while other operations are processed with single precision floating-point. With the proposed fixed-point processing error propagation model, the fixed-point processing word length is determined. The fidelity and accuracy relative to conventional ground-based software processors is verified by evaluating both the point target imaging quality and the actual scene imaging quality. As a proof of concept, a field- programmable gate array-application-specific integrated circuit (FPGA-ASIC) hybrid heterogeneous parallel accelerating architecture is designed and realized. The customized fixed-point FFT is implemented using the 130 nm complementary metal oxide semiconductor (CMOS) technology as a co-processor of the Xilinx xc6vlx760t FPGA. A single processing board requires 12 s and consumes 21 W to focus a 50-km swath width, 5-m resolution stripmap SAR raw data with a granularity of 16,384 × 16,384.
Acoustic 3D modeling by the method of integral equations

NASA Astrophysics Data System (ADS)

Malovichko, M.; Khokhlov, N.; Yavich, N.; Zhdanov, M.

2018-02-01

This paper presents a parallel algorithm for frequency-domain acoustic modeling by the method of integral equations (IE). The algorithm is applied to seismic simulation. The IE method reduces the size of the problem but leads to a dense system matrix. A tolerable memory consumption and numerical complexity were achieved by applying an iterative solver, accompanied by an effective matrix-vector multiplication operation, based on the fast Fourier transform (FFT). We demonstrate that, the IE system matrix is better conditioned than that of the finite-difference (FD) method, and discuss its relation to a specially preconditioned FD matrix. We considered several methods of matrix-vector multiplication for the free-space and layered host models. The developed algorithm and computer code were benchmarked against the FD time-domain solution. It was demonstrated that, the method could accurately calculate the seismic field for the models with sharp material boundaries and a point source and receiver located close to the free surface. We used OpenMP to speed up the matrix-vector multiplication, while MPI was used to speed up the solution of the system equations, and also for parallelizing across multiple sources. The practical examples and efficiency tests are presented as well.
DFT Performance Prediction in FFTW

NASA Astrophysics Data System (ADS)

Gu, Liang; Li, Xiaoming

Fastest Fourier Transform in the West (FFTW) is an adaptive FFT library that generates highly efficient Discrete Fourier Transform (DFT) implementations. It is one of the fastest FFT libraries available and it outperforms many adaptive or hand-tuned DFT libraries. Its success largely relies on the huge search space spanned by several FFT algorithms and a set of compiler generated C code (called codelets) for small size DFTs. FFTW empirically finds the best algorithm by measuring the performance of different algorithm combinations. Although the empirical search works very well for FFTW, the search process does not explain why the best plan found performs best, and the search overhead grows polynomially as the DFT size increases. The opposite of empirical search is model-driven optimization. However, it is widely believed that model-driven optimization is inferior to empirical search and is particularly powerless to solve problems as complex as the optimization of DFT.
A High-Order Direct Solver for Helmholtz Equations with Neumann Boundary Conditions

NASA Technical Reports Server (NTRS)

Sun, Xian-He; Zhuang, Yu

1997-01-01

In this study, a compact finite-difference discretization is first developed for Helmholtz equations on rectangular domains. Special treatments are then introduced for Neumann and Neumann-Dirichlet boundary conditions to achieve accuracy and separability. Finally, a Fast Fourier Transform (FFT) based technique is used to yield a fast direct solver. Analytical and experimental results show this newly proposed solver is comparable to the conventional second-order elliptic solver when accuracy is not a primary concern, and is significantly faster than that of the conventional solver if a highly accurate solution is required. In addition, this newly proposed fourth order Helmholtz solver is parallel in nature. It is readily available for parallel and distributed computers. The compact scheme introduced in this study is likely extendible for sixth-order accurate algorithms and for more general elliptic equations.
High performance Python for direct numerical simulations of turbulent flows

NASA Astrophysics Data System (ADS)

Mortensen, Mikael; Langtangen, Hans Petter

2016-06-01

Direct Numerical Simulations (DNS) of the Navier Stokes equations is an invaluable research tool in fluid dynamics. Still, there are few publicly available research codes and, due to the heavy number crunching implied, available codes are usually written in low-level languages such as C/C++ or Fortran. In this paper we describe a pure scientific Python pseudo-spectral DNS code that nearly matches the performance of C++ for thousands of processors and billions of unknowns. We also describe a version optimized through Cython, that is found to match the speed of C++. The solvers are written from scratch in Python, both the mesh, the MPI domain decomposition, and the temporal integrators. The solvers have been verified and benchmarked on the Shaheen supercomputer at the KAUST supercomputing laboratory, and we are able to show very good scaling up to several thousand cores. A very important part of the implementation is the mesh decomposition (we implement both slab and pencil decompositions) and 3D parallel Fast Fourier Transforms (FFT). The mesh decomposition and FFT routines have been implemented in Python using serial FFT routines (either NumPy, pyFFTW or any other serial FFT module), NumPy array manipulations and with MPI communications handled by MPI for Python (mpi4py). We show how we are able to execute a 3D parallel FFT in Python for a slab mesh decomposition using 4 lines of compact Python code, for which the parallel performance on Shaheen is found to be slightly better than similar routines provided through the FFTW library. For a pencil mesh decomposition 7 lines of code is required to execute a transform.
A novel power harmonic analysis method based on Nuttall-Kaiser combination window double spectrum interpolated FFT algorithm

NASA Astrophysics Data System (ADS)

Jin, Tao; Chen, Yiyang; Flesch, Rodolfo C. C.

2017-11-01

Harmonics pose a great threat to safe and economical operation of power grids. Therefore, it is critical to detect harmonic parameters accurately to design harmonic compensation equipment. The fast Fourier transform (FFT) is widely used for electrical popular power harmonics analysis. However, the barrier effect produced by the algorithm itself and spectrum leakage caused by asynchronous sampling often affects the harmonic analysis accuracy. This paper examines a new approach for harmonic analysis based on deducing the modifier formulas of frequency, phase angle, and amplitude, utilizing the Nuttall-Kaiser window double spectrum line interpolation method, which overcomes the shortcomings in traditional FFT harmonic calculations. The proposed approach is verified numerically and experimentally to be accurate and reliable.
Fast quantum nD Fourier and radon transforms

NASA Astrophysics Data System (ADS)

Labunets, Valeri G.; Labunets-Rundblad, Ekaterina V.; Astola, Jaakko T.

2001-07-01

Fast Classical and quantum algorithms are introduced for a wide class of non-separable nD discrete unitary K- transforms(DKT)KNn. They require a number of 1D DKT Kn smaller than in the Cooley-Tukey radix-p FFT-type approach. The method utilizes a decomposition of the nDK- transform into a product of original nD discrete Radon Transform and of a family parallel/independ 1DK-transforms. If the nDK-transform has a separable kernel, that again in this case our approach leads to decrease of multiplicative complexity by factor of n compared to the tow/column separable Cooley-Tukey p-radix approach.
Numerical evaluation of the radiation from unbaffled, finite plates using the FFT

NASA Technical Reports Server (NTRS)

Williams, E. G.

1983-01-01

An iteration technique is described which numerically evaluates the acoustic pressure and velocity on and near unbaffled, finite, thin plates vibrating in air. The technique is based on Rayleigh's integral formula and its inverse. These formulas are written in their angular spectrum form so that the fast Fourier transform (FFT) algorithm may be used to evaluate them. As an example of the technique the pressure on the surface of a vibrating, unbaffled disk is computed and shown to be in excellent agreement with the exact solution using oblate spheroidal functions. Furthermore, the computed velocity field outside the disk shows the well-known singularity at the rim of the disk. The radiated fields from unbaffled flat sources of any geometry with prescribed surface velocity may be evaluated using this technique. The use of the FFT to perform the integrations in Rayleigh's formulas provides a great savings in computation time compared with standard integration algorithms, especially when an array processor can be used to implement the FFT.
High-accuracy 3D Fourier forward modeling of gravity field based on the Gauss-FFT technique

NASA Astrophysics Data System (ADS)

Zhao, Guangdong; Chen, Bo; Chen, Longwei; Liu, Jianxin; Ren, Zhengyong

2018-03-01

The 3D Fourier forward modeling of 3D density sources is capable of providing 3D gravity anomalies coincided with the meshed density distribution within the whole source region. This paper firstly derives a set of analytical expressions through employing 3D Fourier transforms for calculating the gravity anomalies of a 3D density source approximated by right rectangular prisms. To reduce the errors due to aliasing and imposed periodicity as well as edge effects in the Fourier domain modeling, we develop the 3D Gauss-FFT technique to the 3D gravity anomalies forward modeling. The capability and adaptability of this scheme are tested by simple synthetic models. The results show that the accuracy of the Fourier forward methods using the Gauss-FFT with 4 Gaussian-nodes (or more) is comparable to that of the spatial modeling. In addition, the "ghost" source effects in the 3D Fourier forward gravity field due to imposed periodicity of the standard FFT algorithm are remarkably depressed by the application of the 3D Gauss-FFT algorithm. More importantly, the execution times of the 4 nodes Gauss-FFT modeling are reduced by two orders of magnitude compared with the spatial forward method. It demonstrates that the improved Fourier method is an efficient and accurate forward modeling tool for the gravity field.
Accelerated Adaptive MGS Phase Retrieval

NASA Technical Reports Server (NTRS)

Lam, Raymond K.; Ohara, Catherine M.; Green, Joseph J.; Bikkannavar, Siddarayappa A.; Basinger, Scott A.; Redding, David C.; Shi, Fang

2011-01-01

The Modified Gerchberg-Saxton (MGS) algorithm is an image-based wavefront-sensing method that can turn any science instrument focal plane into a wavefront sensor. MGS characterizes optical systems by estimating the wavefront errors in the exit pupil using only intensity images of a star or other point source of light. This innovative implementation of MGS significantly accelerates the MGS phase retrieval algorithm by using stream-processing hardware on conventional graphics cards. Stream processing is a relatively new, yet powerful, paradigm to allow parallel processing of certain applications that apply single instructions to multiple data (SIMD). These stream processors are designed specifically to support large-scale parallel computing on a single graphics chip. Computationally intensive algorithms, such as the Fast Fourier Transform (FFT), are particularly well suited for this computing environment. This high-speed version of MGS exploits commercially available hardware to accomplish the same objective in a fraction of the original time. The exploit involves performing matrix calculations in nVidia graphic cards. The graphical processor unit (GPU) is hardware that is specialized for computationally intensive, highly parallel computation. From the software perspective, a parallel programming model is used, called CUDA, to transparently scale multicore parallelism in hardware. This technology gives computationally intensive applications access to the processing power of the nVidia GPUs through a C/C++ programming interface. The AAMGS (Accelerated Adaptive MGS) software takes advantage of these advanced technologies, to accelerate the optical phase error characterization. With a single PC that contains four nVidia GTX-280 graphic cards, the new implementation can process four images simultaneously to produce a JWST (James Webb Space Telescope) wavefront measurement 60 times faster than the previous code.
LAMMPS strong scaling performance optimization on Blue Gene/Q

DOE Office of Scientific and Technical Information (OSTI.GOV)

Coffman, Paul; Jiang, Wei; Romero, Nichols A.

2014-11-12

LAMMPS "Large-scale Atomic/Molecular Massively Parallel Simulator" is an open-source molecular dynamics package from Sandia National Laboratories. Significant performance improvements in strong-scaling and time-to-solution for this application on IBM's Blue Gene/Q have been achieved through computational optimizations of the OpenMP versions of the short-range Lennard-Jones term of the CHARMM force field and the long-range Coulombic interaction implemented with the PPPM (particle-particle-particle mesh) algorithm, enhanced by runtime parameter settings controlling thread utilization. Additionally, MPI communication performance improvements were made to the PPPM calculation by re-engineering the parallel 3D FFT to use MPICH collectives instead of point-to-point. Performance testing was done using anmore » 8.4-million atom simulation scaling up to 16 racks on the Mira system at Argonne Leadership Computing Facility (ALCF). Speedups resulting from this effort were in some cases over 2x.« less
Phase-unwrapping algorithm by a rounding-least-squares approach

NASA Astrophysics Data System (ADS)

Juarez-Salazar, Rigoberto; Robledo-Sanchez, Carlos; Guerrero-Sanchez, Fermin

2014-02-01

A simple and efficient phase-unwrapping algorithm based on a rounding procedure and a global least-squares minimization is proposed. Instead of processing the gradient of the wrapped phase, this algorithm operates over the gradient of the phase jumps by a robust and noniterative scheme. Thus, the residue-spreading and over-smoothing effects are reduced. The algorithm's performance is compared with four well-known phase-unwrapping methods: minimum cost network flow (MCNF), fast Fourier transform (FFT), quality-guided, and branch-cut. A computer simulation and experimental results show that the proposed algorithm reaches a high-accuracy level than the MCNF method by a low-computing time similar to the FFT phase-unwrapping method. Moreover, since the proposed algorithm is simple, fast, and user-free, it could be used in metrological interferometric and fringe-projection automatic real-time applications.
Superfast algorithms of multidimensional discrete k-wave transforms and Volterra filtering based on superfast radon transform

NASA Astrophysics Data System (ADS)

Labunets, Valeri G.; Labunets-Rundblad, Ekaterina V.; Astola, Jaakko T.

2001-12-01

Fast algorithms for a wide class of non-separable n-dimensional (nD) discrete unitary K-transforms (DKT) are introduced. They need less 1D DKTs than in the case of the classical radix-2 FFT-type approach. The method utilizes a decomposition of the nD K-transform into the product of a new nD discrete Radon transform and of a set of parallel/independ 1D K-transforms. If the nD K-transform has a separable kernel (e.g., the case of the discrete Fourier transform) our approach leads to decrease of multiplicative complexity by the factor of n comparing to the classical row/column separable approach. It is well known that an n-th order Volterra filter of one dimensional signal can be evaluated by an appropriate nD linear convolution. This work describes new superfast algorithm for Volterra filtering. New approach is based on the superfast discrete Radon and Nussbaumer polynomial transforms.
A FFT-based formulation for efficient mechanical fields computation in isotropic and anisotropic periodic discrete dislocation dynamics

NASA Astrophysics Data System (ADS)

Bertin, N.; Upadhyay, M. V.; Pradalier, C.; Capolungo, L.

2015-09-01

In this paper, we propose a novel full-field approach based on the fast Fourier transform (FFT) technique to compute mechanical fields in periodic discrete dislocation dynamics (DDD) simulations for anisotropic materials: the DDD-FFT approach. By coupling the FFT-based approach to the discrete continuous model, the present approach benefits from the high computational efficiency of the FFT algorithm, while allowing for a discrete representation of dislocation lines. It is demonstrated that the computational time associated with the new DDD-FFT approach is significantly lower than that of current DDD approaches when large number of dislocation segments are involved for isotropic and anisotropic elasticity, respectively. Furthermore, for fine Fourier grids, the treatment of anisotropic elasticity comes at a similar computational cost to that of isotropic simulation. Thus, the proposed approach paves the way towards achieving scale transition from DDD to mesoscale plasticity, especially due to the method’s ability to incorporate inhomogeneous elasticity.
Wideband aperture array using RF channelizers and massively parallel digital 2D IIR filterbank

NASA Astrophysics Data System (ADS)

Sengupta, Arindam; Madanayake, Arjuna; Gómez-García, Roberto; Engeberg, Erik D.

2014-05-01

Wideband receive-mode beamforming applications in wireless location, electronically-scanned antennas for radar, RF sensing, microwave imaging and wireless communications require digital aperture arrays that offer a relatively constant far-field beam over several octaves of bandwidth. Several beamforming schemes including the well-known true time-delay and the phased array beamformers have been realized using either finite impulse response (FIR) or fast Fourier transform (FFT) digital filter-sum based techniques. These beamforming algorithms offer the desired selectivity at the cost of a high computational complexity and frequency-dependant far-field array patterns. A novel approach to receiver beamforming is the use of massively parallel 2-D infinite impulse response (IIR) fan filterbanks for the synthesis of relatively frequency independent RF beams at an order of magnitude lower multiplier complexity compared to FFT or FIR filter based conventional algorithms. The 2-D IIR filterbanks demand fast digital processing that can support several octaves of RF bandwidth, fast analog-to-digital converters (ADCs) for RF-to-bits type direct conversion of wideband antenna element signals. Fast digital implementation platforms that can realize high-precision recursive filter structures necessary for real-time beamforming, at RF radio bandwidths, are also desired. We propose a novel technique that combines a passive RF channelizer, multichannel ADC technology, and single-phase massively parallel 2-D IIR digital fan filterbanks, realized at low complexity using FPGA and/or ASIC technology. There exists native support for a larger bandwidth than the maximum clock frequency of the digital implementation technology. We also strive to achieve More-than-Moore throughput by processing a wideband RF signal having content with N-fold (B = N Fclk/2) bandwidth compared to the maximum clock frequency Fclk Hz of the digital VLSI platform under consideration. Such increase in bandwidth is achieved without use of polyphase signal processing or time-interleaved ADC methods. That is, all digital processors operate at the same Fclk clock frequency without phasing, while wideband operation is achieved by sub-sampling of narrower sub-bands at the the RF channelizer outputs.
Improved FFT-based numerical inversion of Laplace transforms via fast Hartley transform algorithm

NASA Technical Reports Server (NTRS)

Hwang, Chyi; Lu, Ming-Jeng; Shieh, Leang S.

1991-01-01

The disadvantages of numerical inversion of the Laplace transform via the conventional fast Fourier transform (FFT) are identified and an improved method is presented to remedy them. The improved method is based on introducing a new integration step length Delta(omega) = pi/mT for trapezoidal-rule approximation of the Bromwich integral, in which a new parameter, m, is introduced for controlling the accuracy of the numerical integration. Naturally, this method leads to multiple sets of complex FFT computations. A new inversion formula is derived such that N equally spaced samples of the inverse Laplace transform function can be obtained by (m/2) + 1 sets of N-point complex FFT computations or by m sets of real fast Hartley transform (FHT) computations.
A 640-MHz 32-megachannel real-time polyphase-FFT spectrum analyzer

NASA Technical Reports Server (NTRS)

Zimmerman, G. A.; Garyantes, M. F.; Grimm, M. J.; Charny, B.

1991-01-01

A polyphase fast Fourier transform (FFT) spectrum analyzer being designed for NASA's Search for Extraterrestrial Intelligence (SETI) Sky Survey at the Jet Propulsion Laboratory is described. By replacing the time domain multiplicative window preprocessing with polyphase filter processing, much of the processing loss of windowed FFTs can be eliminated. Polyphase coefficient memory costs are minimized by effective use of run length compression. Finite word length effects are analyzed, producing a balanced system with 8 bit inputs, 16 bit fixed point polyphase arithmetic, and 24 bit fixed point FFT arithmetic. Fixed point renormalization midway through the computation is seen to be naturally accommodated by the matrix FFT algorithm proposed. Simulation results validate the finite word length arithmetic analysis and the renormalization technique.

Pre-Hardware Optimization and Implementation Of Fast Optics Closed Control Loop Algorithms

NASA Technical Reports Server (NTRS)

Kizhner, Semion; Lyon, Richard G.; Herman, Jay R.; Abuhassan, Nader

2004-01-01

One of the main heritage tools used in scientific and engineering data spectrum analysis is the Fourier Integral Transform and its high performance digital equivalent - the Fast Fourier Transform (FFT). The FFT is particularly useful in two-dimensional (2-D) image processing (FFT2) within optical systems control. However, timing constraints of a fast optics closed control loop would require a supercomputer to run the software implementation of the FFT2 and its inverse, as well as other image processing representative algorithm, such as numerical image folding and fringe feature extraction. A laboratory supercomputer is not always available even for ground operations and is not feasible for a night project. However, the computationally intensive algorithms still warrant alternative implementation using reconfigurable computing technologies (RC) such as Digital Signal Processors (DSP) and Field Programmable Gate Arrays (FPGA), which provide low cost compact super-computing capabilities. We present a new RC hardware implementation and utilization architecture that significantly reduces the computational complexity of a few basic image-processing algorithm, such as FFT2, image folding and phase diversity for the NASA Solar Viewing Interferometer Prototype (SVIP) using a cluster of DSPs and FPGAs. The DSP cluster utilization architecture also assures avoidance of a single point of failure, while using commercially available hardware. This, combined with the control algorithms pre-hardware optimization, or the first time allows construction of image-based 800 Hertz (Hz) optics closed control loops on-board a spacecraft, based on the SVIP ground instrument. That spacecraft is the proposed Earth Atmosphere Solar Occultation Imager (EASI) to study greenhouse gases CO2, C2H, H2O, O3, O2, N2O from Lagrange-2 point in space. This paper provides an advanced insight into a new type of science capabilities for future space exploration missions based on on-board image processing for control and for robotics missions using vision sensors. It presents a top-level description of technologies required for the design and construction of SVIP and EASI and to advance the spatial-spectral imaging and large-scale space interferometry science and engineering.
A Parameterized Pattern-Error Objective for Large-Scale Phase-Only Array Pattern Design

DTIC Science & Technology

2016-03-21

12 4.4 Example 3: Sector Beam w/ Nonuniform Amplitude...fixed uniform amplitude illumination, phase-only optimization can also find application to arrays with fixed but nonuniform tapers. Such fixed tapers...arbitrary element locations nonuniform FFT algorithms exist [43–45] that have the same asymptotic complexity as the conventional FFT, although the
Computing the Power-Density Spectrum for an Engineering Model

NASA Technical Reports Server (NTRS)

Dunn, H. J.

1982-01-01

Computer program for calculating of power-density spectrum (PDS) from data base generated by Advanced Continuous Simulation Language (ACSL) uses algorithm that employs fast Fourier transform (FFT) to calculate PDS of variable. Accomplished by first estimating autocovariance function of variable and then taking FFT of smoothed autocovariance function to obtain PDS. Fast-Fourier-transform technique conserves computer resources.
Imaging the eye fundus with real-time en-face spectral domain optical coherence tomography

PubMed Central

Bradu, Adrian; Podoleanu, Adrian Gh.

2014-01-01

Real-time display of processed en-face spectral domain optical coherence tomography (SD-OCT) images is important for diagnosis. However, due to many steps of data processing requirements, such as Fast Fourier transformation (FFT), data re-sampling, spectral shaping, apodization, zero padding, followed by software cut of the 3D volume acquired to produce an en-face slice, conventional high-speed SD-OCT cannot render an en-face OCT image in real time. Recently we demonstrated a Master/Slave (MS)-OCT method that is highly parallelizable, as it provides reflectivity values of points at depth within an A-scan in parallel. This allows direct production of en-face images. In addition, the MS-OCT method does not require data linearization, which further simplifies the processing. The computation in our previous paper was however time consuming. In this paper we present an optimized algorithm that can be used to provide en-face MS-OCT images much quicker. Using such an algorithm we demonstrate around 10 times faster production of sets of en-face OCT images than previously obtained as well as simultaneous real-time display of up to 4 en-face OCT images of 200 × 200 pixels2 from the fovea and the optic nerve of a volunteer. We also demonstrate 3D and B-scan OCT images obtained from sets of MS-OCT C-scans, i.e. with no FFT and no intermediate step of generation of A-scans. PMID:24761303
Solving Coupled Gross--Pitaevskii Equations on a Cluster of PlayStation 3 Computers

NASA Astrophysics Data System (ADS)

Edwards, Mark; Heward, Jeffrey; Clark, C. W.

2009-05-01

At Georgia Southern University we have constructed an 8+1--node cluster of Sony PlayStation 3 (PS3) computers with the intention of using this computing resource to solve problems related to the behavior of ultra--cold atoms in general with a particular emphasis on studying bose--bose and bose--fermi mixtures confined in optical lattices. As a first project that uses this computing resource, we have implemented a parallel solver of the coupled time--dependent, one--dimensional Gross--Pitaevskii (TDGP) equations. These equations govern the behavior of dual-- species bosonic mixtures. We chose the split--operator/FFT to solve the coupled 1D TDGP equations. The fast Fourier transform component of this solver can be readily parallelized on the PS3 cpu known as the Cell Broadband Engine (CellBE). Each CellBE chip contains a single 64--bit PowerPC Processor Element known as the PPE and eight ``Synergistic Processor Element'' identified as the SPE's. We report on this algorithm and compare its performance to a non--parallel solver as applied to modeling evaporative cooling in dual--species bosonic mixtures.
A GaAs vector processor based on parallel RISC microprocessors

NASA Astrophysics Data System (ADS)

Misko, Tim A.; Rasset, Terry L.

A vector processor architecture based on the development of a 32-bit microprocessor using gallium arsenide (GaAs) technology has been developed. The McDonnell Douglas vector processor (MVP) will be fabricated completely from GaAs digital integrated circuits. The MVP architecture includes a vector memory of 1 megabyte, a parallel bus architecture with eight processing elements connected in parallel, and a control processor. The processing elements consist of a reduced instruction set CPU (RISC) with four floating-point coprocessor units and necessary memory interface functions. This architecture has been simulated for several benchmark programs including complex fast Fourier transform (FFT), complex inner product, trigonometric functions, and sort-merge routine. The results of this study indicate that the MVP can process a 1024-point complex FFT at a speed of 112 microsec (389 megaflops) while consuming approximately 618 W of power in a volume of approximately 0.1 ft-cubed.
Alternative techniques for high-resolution spectral estimation of spectrally encoded endoscopy

NASA Astrophysics Data System (ADS)

Mousavi, Mahta; Duan, Lian; Javidi, Tara; Ellerbee, Audrey K.

2015-09-01

Spectrally encoded endoscopy (SEE) is a minimally invasive optical imaging modality capable of fast confocal imaging of internal tissue structures. Modern SEE systems use coherent sources to image deep within the tissue and data are processed similar to optical coherence tomography (OCT); however, standard processing of SEE data via the Fast Fourier Transform (FFT) leads to degradation of the axial resolution as the bandwidth of the source shrinks, resulting in a well-known trade-off between speed and axial resolution. Recognizing the limitation of FFT as a general spectral estimation algorithm to only take into account samples collected by the detector, in this work we investigate alternative high-resolution spectral estimation algorithms that exploit information such as sparsity and the general region position of the bulk sample to improve the axial resolution of processed SEE data. We validate the performance of these algorithms using bothMATLAB simulations and analysis of experimental results generated from a home-built OCT system to simulate an SEE system with variable scan rates. Our results open a new door towards using non-FFT algorithms to generate higher quality (i.e., higher resolution) SEE images at correspondingly fast scan rates, resulting in systems that are more accurate and more comfortable for patients due to the reduced image time.
pyDockWEB: a web server for rigid-body protein-protein docking using electrostatics and desolvation scoring.

PubMed

Jiménez-García, Brian; Pons, Carles; Fernández-Recio, Juan

2013-07-01

pyDockWEB is a web server for the rigid-body docking prediction of protein-protein complex structures using a new version of the pyDock scoring algorithm. We use here a new custom parallel FTDock implementation, with adjusted grid size for optimal FFT calculations, and a new version of pyDock, which dramatically speeds up calculations while keeping the same predictive accuracy. Given the 3D coordinates of two interacting proteins, pyDockWEB returns the best docking orientations as scored mainly by electrostatics and desolvation energy. The server does not require registration by the user and is freely accessible for academics at http://life.bsc.es/servlet/pydock. Supplementary data are available at Bioinformatics online.
An investigation of pulsar searching techniques with the fast folding algorithm

NASA Astrophysics Data System (ADS)

Cameron, A. D.; Barr, E. D.; Champion, D. J.; Kramer, M.; Zhu, W. W.

2017-06-01

Here, we present an in-depth study of the behaviour of the fast folding algorithm (FFA), an alternative pulsar searching technique to the fast Fourier transform (FFT). Weaknesses in the FFT, including a susceptibility to red noise, leave it insensitive to pulsars with long rotational periods (P > 1 s). This sensitivity gap has the potential to bias our understanding of the period distribution of the pulsar population. The FFA, a time-domain based pulsar searching technique, has the potential to overcome some of these biases. Modern distributed-computing frameworks now allow for the application of this algorithm to all-sky blind pulsar surveys for the first time. However, many aspects of the behaviour of this search technique remain poorly understood, including its responsiveness to variations in pulse shape and the presence of red noise. Using a custom CPU-based implementation of the FFA, ffancy, we have conducted an in-depth study into the behaviour of the FFA in both an ideal, white noise regime as well as a trial on observational data from the High Time Resolution Universe South Low Latitude pulsar survey, including a comparison to the behaviour of the FFT. We are able to both confirm and expand upon earlier studies that demonstrate the ability of the FFA to outperform the FFT under ideal white noise conditions, and demonstrate a significant improvement in sensitivity to long-period pulsars in real observational data through the use of the FFA.
The fast decoding of Reed-Solomon codes using number theoretic transforms

NASA Technical Reports Server (NTRS)

Reed, I. S.; Welch, L. R.; Truong, T. K.

1976-01-01

It is shown that Reed-Solomon (RS) codes can be encoded and decoded by using a fast Fourier transform (FFT) algorithm over finite fields. The arithmetic utilized to perform these transforms requires only integer additions, circular shifts and a minimum number of integer multiplications. The computing time of this transform encoder-decoder for RS codes is less than the time of the standard method for RS codes. More generally, the field GF(q) is also considered, where q is a prime of the form K x 2 to the nth power + 1 and K and n are integers. GF(q) can be used to decode very long RS codes by an efficient FFT algorithm with an improvement in the number of symbols. It is shown that a radix-8 FFT algorithm over GF(q squared) can be utilized to encode and decode very long RS codes with a large number of symbols. For eight symbols in GF(q squared), this transform over GF(q squared) can be made simpler than any other known number theoretic transform with a similar capability. Of special interest is the decoding of a 16-tuple RS code with four errors.
Application of multiple signal classification algorithm to frequency estimation in coherent dual-frequency lidar

NASA Astrophysics Data System (ADS)

Li, Ruixiao; Li, Kun; Zhao, Changming

2018-01-01

Coherent dual-frequency Lidar (CDFL) is a new development of Lidar which dramatically enhances the ability to decrease the influence of atmospheric interference by using dual-frequency laser to measure the range and velocity with high precision. Based on the nature of CDFL signals, we propose to apply the multiple signal classification (MUSIC) algorithm in place of the fast Fourier transform (FFT) to estimate the phase differences in dual-frequency Lidar. In the presence of Gaussian white noise, the simulation results show that the signal peaks are more evident when using MUSIC algorithm instead of FFT in condition of low signal-noise-ratio (SNR), which helps to improve the precision of detection on range and velocity, especially for the long distance measurement systems.
Error and Complexity Analysis for a Collocation-Grid-Projection Plus Precorrected-FFT Algorithm for Solving Potential Integral Equations with LaPlace or Helmholtz Kernels

NASA Technical Reports Server (NTRS)

Phillips, J. R.

1996-01-01

In this paper we derive error bounds for a collocation-grid-projection scheme tuned for use in multilevel methods for solving boundary-element discretizations of potential integral equations. The grid-projection scheme is then combined with a precorrected FFT style multilevel method for solving potential integral equations with 1/r and e(sup ikr)/r kernels. A complexity analysis of this combined method is given to show that for homogeneous problems, the method is order n natural log n nearly independent of the kernel. In addition, it is shown analytically and experimentally that for an inhomogeneity generated by a very finely discretized surface, the combined method slows to order n(sup 4/3). Finally, examples are given to show that the collocation-based grid-projection plus precorrected-FFT scheme is competitive with fast-multipole algorithms when considering realistic problems and 1/r kernels, but can be used over a range of spatial frequencies with only a small performance penalty.
Delineation of First-Order Elastic Property Closures for Hexagonal Metals Using Fast Fourier Transforms

PubMed Central

Landry, Nicholas W.; Knezevic, Marko

2015-01-01

Property closures are envelopes representing the complete set of theoretically feasible macroscopic property combinations for a given material system. In this paper, we present a computational procedure based on fast Fourier transforms (FFTs) for delineation of elastic property closures for hexagonal close packed (HCP) metals. The procedure consists of building a database of non-zero Fourier transforms for each component of the elastic stiffness tensor, calculating the Fourier transforms of orientation distribution functions (ODFs), and calculating the ODF-to-elastic property bounds in the Fourier space. In earlier studies, HCP closures were computed using the generalized spherical harmonics (GSH) representation and an assumption of orthotropic sample symmetry; here, the FFT approach allowed us to successfully calculate the closures for a range of HCP metals without invoking any sample symmetry assumption. The methodology presented here facilitates for the first time computation of property closures involving normal-shear coupling stiffness coefficients. We found that the representation of these property linkages using FFTs need more terms compared to GSH representations. However, the use of FFT representations reduces the computational time involved in producing the property closures due to the use of fast FFT algorithms. Moreover, FFT algorithms are readily available as opposed to GSH codes. PMID:28793566
Proceedings of the second SISAL users` conference

DOE Office of Scientific and Technical Information (OSTI.GOV)

Feo, J T; Frerking, C; Miller, P J

1992-12-01

This report contains papers on the following topics: A sisal code for computing the fourier transform on S{sub N}; five ways to fill your knapsack; simulating material dislocation motion in sisal; candis as an interface for sisal; parallelisation and performance of the burg algorithm on a shared-memory multiprocessor; use of genetic algorithm in sisal to solve the file design problem; implementing FFT`s in sisal; programming and evaluating the performance of signal processing applications in the sisal programming environment; sisal and Von Neumann-based languages: translation and intercommunication; an IF2 code generator for ADAM architecture; program partitioning for NUMA multiprocessor computer systems;more » mapping functional parallelism on distributed memory machines; implicit array copying: prevention is better than cure ; mathematical syntax for sisal; an approach for optimizing recursive functions; implementing arrays in sisal 2.0; Fol: an object oriented extension to the sisal language; twine: a portable, extensible sisal execution kernel; and investigating the memory performance of the optimizing sisal compiler.« less
SU-E-T-91: Accuracy of Dose Calculation Algorithms for Patients Undergoing Stereotactic Ablative Radiotherapy

DOE Office of Scientific and Technical Information (OSTI.GOV)

Tajaldeen, A; Ramachandran, P; Geso, M

2015-06-15

Purpose: The purpose of this study was to investigate and quantify the variation in dose distributions in small field lung cancer radiotherapy using seven different dose calculation algorithms. Methods: The study was performed in 21 lung cancer patients who underwent Stereotactic Ablative Body Radiotherapy (SABR). Two different methods (i) Same dose coverage to the target volume (named as same dose method) (ii) Same monitor units in all algorithms (named as same monitor units) were used for studying the performance of seven different dose calculation algorithms in XiO and Eclipse treatment planning systems. The seven dose calculation algorithms include Superposition, Fastmore » superposition, Fast Fourier Transform ( FFT) Convolution, Clarkson, Anisotropic Analytic Algorithm (AAA), Acurous XB and pencil beam (PB) algorithms. Prior to this, a phantom study was performed to assess the accuracy of these algorithms. Superposition algorithm was used as a reference algorithm in this study. The treatment plans were compared using different dosimetric parameters including conformity, heterogeneity and dose fall off index. In addition to this, the dose to critical structures like lungs, heart, oesophagus and spinal cord were also studied. Statistical analysis was performed using Prism software. Results: The mean±stdev with conformity index for Superposition, Fast superposition, Clarkson and FFT convolution algorithms were 1.29±0.13, 1.31±0.16, 2.2±0.7 and 2.17±0.59 respectively whereas for AAA, pencil beam and Acurous XB were 1.4±0.27, 1.66±0.27 and 1.35±0.24 respectively. Conclusion: Our study showed significant variations among the seven different algorithms. Superposition and AcurosXB algorithms showed similar values for most of the dosimetric parameters. Clarkson, FFT convolution and pencil beam algorithms showed large differences as compared to superposition algorithms. Based on our study, we recommend Superposition and AcurosXB algorithms as the first choice of algorithms in lung cancer radiotherapy involving small fields. However, further investigation by Monte Carlo simulation is required to confirm our results.« less
Textural analyses of carbon fiber materials by 2D-FFT of complex images obtained by high frequency eddy current imaging (HF-ECI)

NASA Astrophysics Data System (ADS)

Schulze, Martin H.; Heuer, Henning

2012-04-01

Carbon fiber based materials are used in many lightweight applications in aeronautical, automotive, machine and civil engineering application. By the increasing automation in the production process of CFRP laminates a manual optical inspection of each resin transfer molding (RTM) layer is not practicable. Due to the limitation to surface inspection, the quality parameters of multilayer 3 dimensional materials cannot be observed by optical systems. The Imaging Eddy- Current (EC) NDT is the only suitable inspection method for non-resin materials in the textile state that allows an inspection of surface and hidden layers in parallel. The HF-ECI method has the capability to measure layer displacements (misaligned angle orientations) and gap sizes in a multilayer carbon fiber structure. EC technique uses the variation of the electrical conductivity of carbon based materials to obtain material properties. Beside the determination of textural parameters like layer orientation and gap sizes between rovings, the detection of foreign polymer particles, fuzzy balls or visualization of undulations can be done by the method. For all of these typical parameters an imaging classification process chain based on a high resolving directional ECimaging device named EddyCus® MPECS and a 2D-FFT with adapted preprocessing algorithms are developed.
Using single buffers and data reorganization to implement a multi-megasample fast Fourier transform

NASA Technical Reports Server (NTRS)

Brown, R. D.

1992-01-01

Data ordering in large fast Fourier transforms (FFT's) is both conceptually and implementationally difficult. Discribed here is a method of visualizing data orderings as vectors of address bits, which enables the engineer to use more efficient data orderings and reduce double-buffer memory designs. Also detailed are the difficulties and algorithmic solutions involved in FFT lengths up to 4 megasamples (Msamples) and sample rates up to 80 MHz.
Performance of FFT methods in local gravity field modelling

NASA Technical Reports Server (NTRS)

Forsberg, Rene; Solheim, Dag

1989-01-01

Fast Fourier transform (FFT) methods provide a fast and efficient means of processing large amounts of gravity or geoid data in local gravity field modelling. The FFT methods, however, has a number of theoretical and practical limitations, especially the use of flat-earth approximation, and the requirements for gridded data. In spite of this the method often yields excellent results in practice when compared to other more rigorous (and computationally expensive) methods, such as least-squares collocation. The good performance of the FFT methods illustrate that the theoretical approximations are offset by the capability of taking into account more data in larger areas, especially important for geoid predictions. For best results good data gridding algorithms are essential. In practice truncated collocation approaches may be used. For large areas at high latitudes the gridding must be done using suitable map projections such as UTM, to avoid trivial errors caused by the meridian convergence. The FFT methods are compared to ground truth data in New Mexico (xi, eta from delta g), Scandinavia (N from delta g, the geoid fits to 15 cm over 2000 km), and areas of the Atlantic (delta g from satellite altimetry using Wiener filtering). In all cases the FFT methods yields results comparable or superior to other methods.
Canonic FFT flow graphs for real-valued even/odd symmetric inputs

NASA Astrophysics Data System (ADS)

Lao, Yingjie; Parhi, Keshab K.

2017-12-01

Canonic real-valued fast Fourier transform (RFFT) has been proposed to reduce the arithmetic complexity by eliminating redundancies. In a canonic N-point RFFT, the number of signal values at each stage is canonic with respect to the number of signal values, i.e., N. The major advantage of the canonic RFFTs is that these require the least number of butterfly operations and only real datapaths when mapped to architectures. In this paper, we consider the FFT computation whose inputs are not only real but also even/odd symmetric, which indeed lead to the well-known discrete cosine and sine transforms (DCTs and DSTs). Novel algorithms for generating the flow graphs of canonic RFFTs with even/odd symmetric inputs are proposed. It is shown that the proposed algorithms lead to canonic structures with N/2 +1 signal values at each stage for an N-point real even symmetric FFT (REFFT) or N/2 -1 signal values at each stage for an N-point RFFT real odd symmetric FFT (ROFFT). In order to remove butterfly operations, several twiddle factor transformations are proposed in this paper. We also discuss the design of canonic REFFT for any composite length. Performances of the canonic REFFT/ROFFT are also discussed. It is shown that the flow graph of canonic REFFT/ROFFT has less number of interconnections, less butterfly operations, and less twiddle factor operations, compared to prior works.
A study of GPS measurement errors due to noise and multipath interference for CGADS

NASA Technical Reports Server (NTRS)

Axelrad, Penina; MacDoran, Peter F.; Comp, Christopher J.

1996-01-01

This report describes a study performed by the Colorado Center for Astrodynamics Research (CCAR) on GPS measurement errors in the Codeless GPS Attitude Determination System (CGADS) due to noise and multipath interference. Preliminary simulation models fo the CGADS receiver and orbital multipath are described. The standard FFT algorithms for processing the codeless data is described and two alternative algorithms - an auto-regressive/least squares (AR-LS) method, and a combined adaptive notch filter/least squares (ANF-ALS) method, are also presented. Effects of system noise, quantization, baseband frequency selection, and Doppler rates on the accuracy of phase estimates with each of the processing methods are shown. Typical electrical phase errors for the AR-LS method are 0.2 degrees, compared to 0.3 and 0.5 degrees for the FFT and ANF-ALS algorithms, respectively. Doppler rate was found to have the largest effect on the performance.

A finite element conjugate gradient FFT method for scattering

NASA Technical Reports Server (NTRS)

Collins, Jeffery D.; Zapp, John; Hsa, Chang-Yu; Volakis, John L.

1990-01-01

An extension of a two dimensional formulation is presented for a three dimensional body of revolution. With the introduction of a Fourier expansion of the vector electric and magnetic fields, a coupled two dimensional system is generated and solved via the finite element method. An exact boundary condition is employed to terminate the mesh and the fast fourier transformation (FFT) is used to evaluate the boundary integrals for low O(n) memory demand when an iterative solution algorithm is used. By virtue of the finite element method, the algorithm is applicable to structures of arbitrary material composition. Several improvements to the two dimensional algorithm are also described. These include: (1) modifications for terminating the mesh at circular boundaries without distorting the convolutionality of the boundary integrals; (2) the development of nonproprietary mesh generation routines for two dimensional applications; (3) the development of preprocessors for interfacing SDRC IDEAS with the main algorithm; and (4) the development of post-processing algorithms based on the public domain package GRAFIC to generate two and three dimensional gray level and color field maps.
Designing Waveform Sets with Good Correlation and Stopband Properties for MIMO Radar via the Gradient-Based Method

PubMed Central

Tang, Liang; Zhu, Yongfeng; Fu, Qiang

2017-01-01

Waveform sets with good correlation and/or stopband properties have received extensive attention and been widely used in multiple-input multiple-output (MIMO) radar. In this paper, we aim at designing unimodular waveform sets with good correlation and stopband properties. To formulate the problem, we construct two criteria to measure the correlation and stopband properties and then establish an unconstrained problem in the frequency domain. After deducing the phase gradient and the step size, an efficient gradient-based algorithm with monotonicity is proposed to minimize the objective function directly. For the design problem without considering the correlation weights, we develop a simplified algorithm, which only requires a few fast Fourier transform (FFT) operations and is more efficient. Because both of the algorithms can be implemented via the FFT operations and the Hadamard product, they are computationally efficient and can be used to design waveform sets with a large waveform number and waveform length. Numerical experiments show that the proposed algorithms can provide better performance than the state-of-the-art algorithms in terms of the computational complexity. PMID:28468308
Designing Waveform Sets with Good Correlation and Stopband Properties for MIMO Radar via the Gradient-Based Method.

PubMed

Tang, Liang; Zhu, Yongfeng; Fu, Qiang

2017-05-01

Waveform sets with good correlation and/or stopband properties have received extensive attention and been widely used in multiple-input multiple-output (MIMO) radar. In this paper, we aim at designing unimodular waveform sets with good correlation and stopband properties. To formulate the problem, we construct two criteria to measure the correlation and stopband properties and then establish an unconstrained problem in the frequency domain. After deducing the phase gradient and the step size, an efficient gradient-based algorithm with monotonicity is proposed to minimize the objective function directly. For the design problem without considering the correlation weights, we develop a simplified algorithm, which only requires a few fast Fourier transform (FFT) operations and is more efficient. Because both of the algorithms can be implemented via the FFT operations and the Hadamard product, they are computationally efficient and can be used to design waveform sets with a large waveform number and waveform length. Numerical experiments show that the proposed algorithms can provide better performance than the state-of-the-art algorithms in terms of the computational complexity.
Distortion analysis of subband adaptive filtering methods for FMRI active noise control systems.

PubMed

Milani, Ali A; Panahi, Issa M; Briggs, Richard

2007-01-01

Delayless subband filtering structure, as a high performance frequency domain filtering technique, is used for canceling broadband fMRI noise (8 kHz bandwidth). In this method, adaptive filtering is done in subbands and the coefficients of the main canceling filter are computed by stacking the subband weights together. There are two types of stacking methods called FFT and FFT-2. In this paper, we analyze the distortion introduced by these two stacking methods. The effect of the stacking distortion on the performance of different adaptive filters in FXLMS algorithm with non-minimum phase secondary path is explored. The investigation is done for different adaptive algorithms (nLMS, APA and RLS), different weight stacking methods, and different number of subbands.
Propane spectral resolution enhancement by the maximum entropy method

NASA Technical Reports Server (NTRS)

Bonavito, N. L.; Stewart, K. P.; Hurley, E. J.; Yeh, K. C.; Inguva, R.

1990-01-01

The Burg algorithm for maximum entropy power spectral density estimation is applied to a time series of data obtained from a Michelson interferometer and compared with a standard FFT estimate for resolution capability. The propane transmittance spectrum was estimated by use of the FFT with a 2 to the 18th data sample interferogram, giving a maximum unapodized resolution of 0.06/cm. This estimate was then interpolated by zero filling an additional 2 to the 18th points, and the final resolution was taken to be 0.06/cm. Comparison of the maximum entropy method (MEM) estimate with the FFT was made over a 45/cm region of the spectrum for several increasing record lengths of interferogram data beginning at 2 to the 10th. It is found that over this region the MEM estimate with 2 to the 16th data samples is in close agreement with the FFT estimate using 2 to the 18th samples.
Communication Studies of DMP and SMP Machines

NASA Technical Reports Server (NTRS)

Sohn, Andrew; Biswas, Rupak; Chancellor, Marisa K. (Technical Monitor)

1997-01-01

Understanding the interplay between machines and problems is key to obtaining high performance on parallel machines. This paper investigates the interplay between programming paradigms and communication capabilities of parallel machines. In particular, we explicate the communication capabilities of the IBM SP-2 distributed-memory multiprocessor and the SGI PowerCHALLENGEarray symmetric multiprocessor. Two benchmark problems of bitonic sorting and Fast Fourier Transform are selected for experiments. Communication-efficient algorithms are developed to exploit the overlapping capabilities of the machines. Programs are written in Message-Passing Interface for portability and identical codes are used for both machines. Various data sizes and message sizes are used to test the machines' communication capabilities. Experimental results indicate that the communication performance of the multiprocessors are consistent with the size of messages. The SP-2 is sensitive to message size but yields a much higher communication overlapping because of the communication co-processor. The PowerCHALLENGEarray is not highly sensitive to message size and yields a low communication overlapping. Bitonic sorting yields lower performance compared to FFT due to a smaller computation-to-communication ratio.
Wireless Intrusion Detection

DTIC Science & Technology

2007-03-01

32 4.4 Algorithm Pseudo - Code ...................................................................................34 4.5 WIND Interface With a...difference estimates of xc temporal derivatives, or by using a polynomial fit to the previous values of xc. 34 4.4 ALGORITHM PSEUDO - CODE Pseudo ...Phase Shift Keying DQPSK Differential Quadrature Phase Shift Keying EVM Error Vector Magnitude FFT Fast Fourier Transform FPGA Field Programmable
Microprocessor implementation of an FFT for ionospheric VLF observations

NASA Technical Reports Server (NTRS)

Elvidge, J.; Kintner, P.; Holzworth, R.

1984-01-01

A fast Fourier transform algorithm is implemented on a CMOS microprocessor for application to very low-frequency electric fields (less than 10 kHz) sensed on high-altitude scientific balloons. Two FFT's are calculated simultaneously by associating them with conjugate symmetric and conjugate antisymmetric results. One goal of the system was to detect spectral signatures associated with fast time variations present in natural signals such as whistlers and chorus. Although a full evaluation of the system was not possible for operational reasons, a measure of the system's success has been defined and evaluated.
A fast finite-difference algorithm for topology optimization of permanent magnets

NASA Astrophysics Data System (ADS)

Abert, Claas; Huber, Christian; Bruckner, Florian; Vogler, Christoph; Wautischer, Gregor; Suess, Dieter

2017-09-01

We present a finite-difference method for the topology optimization of permanent magnets that is based on the fast-Fourier-transform (FFT) accelerated computation of the stray-field. The presented method employs the density approach for topology optimization and uses an adjoint method for the gradient computation. Comparison to various state-of-the-art finite-element implementations shows a superior performance and accuracy. Moreover, the presented method is very flexible and easy to implement due to various preexisting FFT stray-field implementations that can be used.
Compressive-sampling-based positioning in wireless body area networks.

PubMed

Banitalebi-Dehkordi, Mehdi; Abouei, Jamshid; Plataniotis, Konstantinos N

2014-01-01

Recent achievements in wireless technologies have opened up enormous opportunities for the implementation of ubiquitous health care systems in providing rich contextual information and warning mechanisms against abnormal conditions. This helps with the automatic and remote monitoring/tracking of patients in hospitals and facilitates and with the supervision of fragile, elderly people in their own domestic environment through automatic systems to handle the remote drug delivery. This paper presents a new modeling and analysis framework for the multipatient positioning in a wireless body area network (WBAN) which exploits the spatial sparsity of patients and a sparse fast Fourier transform (FFT)-based feature extraction mechanism for monitoring of patients and for reporting the movement tracking to a central database server containing patient vital information. The main goal of this paper is to achieve a high degree of accuracy and resolution in the patient localization with less computational complexity in the implementation using the compressive sensing theory. We represent the patients' positions as a sparse vector obtained by the discrete segmentation of the patient movement space in a circular grid. To estimate this vector, a compressive-sampling-based two-level FFT (CS-2FFT) feature vector is synthesized for each received signal from the biosensors embedded on the patient's body at each grid point. This feature extraction process benefits in the combination of both short-time and long-time properties of the received signals. The robustness of the proposed CS-2FFT-based algorithm in terms of the average positioning error is numerically evaluated using the realistic parameters in the IEEE 802.15.6-WBAN standard in the presence of additive white Gaussian noise. Due to the circular grid pattern and the CS-2FFT feature extraction method, the proposed scheme represents a significant reduction in the computational complexity, while improving the level of the resolution and the localization accuracy when compared to some classical CS-based positioning algorithms.
Porting ONETEP to graphical processing unit-based coprocessors. 1. FFT box operations.

PubMed

Wilkinson, Karl; Skylaris, Chris-Kriton

2013-10-30

We present the first graphical processing unit (GPU) coprocessor-enabled version of the Order-N Electronic Total Energy Package (ONETEP) code for linear-scaling first principles quantum mechanical calculations on materials. This work focuses on porting to the GPU the parts of the code that involve atom-localized fast Fourier transform (FFT) operations. These are among the most computationally intensive parts of the code and are used in core algorithms such as the calculation of the charge density, the local potential integrals, the kinetic energy integrals, and the nonorthogonal generalized Wannier function gradient. We have found that direct porting of the isolated FFT operations did not provide any benefit. Instead, it was necessary to tailor the port to each of the aforementioned algorithms to optimize data transfer to and from the GPU. A detailed discussion of the methods used and tests of the resulting performance are presented, which show that individual steps in the relevant algorithms are accelerated by a significant amount. However, the transfer of data between the GPU and host machine is a significant bottleneck in the reported version of the code. In addition, an initial investigation into a dynamic precision scheme for the ONETEP energy calculation has been performed to take advantage of the enhanced single precision capabilities of GPUs. The methods used here result in no disruption to the existing code base. Furthermore, as the developments reported here concern the core algorithms, they will benefit the full range of ONETEP functionality. Our use of a directive-based programming model ensures portability to other forms of coprocessors and will allow this work to form the basis of future developments to the code designed to support emerging high-performance computing platforms. Copyright © 2013 Wiley Periodicals, Inc.
Precise and fast spatial-frequency analysis using the iterative local Fourier transform.

PubMed

Lee, Sukmock; Choi, Heejoo; Kim, Dae Wook

2016-09-19

The use of the discrete Fourier transform has decreased since the introduction of the fast Fourier transform (fFT), which is a numerically efficient computing process. This paper presents the iterative local Fourier transform (ilFT), a set of new processing algorithms that iteratively apply the discrete Fourier transform within a local and optimal frequency domain. The new technique achieves 2¹⁰ times higher frequency resolution than the fFT within a comparable computation time. The method's superb computing efficiency, high resolution, spectrum zoom-in capability, and overall performance are evaluated and compared to other advanced high-resolution Fourier transform techniques, such as the fFT combined with several fitting methods. The effectiveness of the ilFT is demonstrated through the data analysis of a set of Talbot self-images (1280 × 1024 pixels) obtained with an experimental setup using grating in a diverging beam produced by a coherent point source.
A parallel and modular deformable cell Car-Parrinello code

NASA Astrophysics Data System (ADS)

Cavazzoni, Carlo; Chiarotti, Guido L.

1999-12-01

We have developed a modular parallel code implementing the Car-Parrinello [Phys. Rev. Lett. 55 (1985) 2471] algorithm including the variable cell dynamics [Europhys. Lett. 36 (1994) 345; J. Phys. Chem. Solids 56 (1995) 510]. Our code is written in Fortran 90, and makes use of some new programming concepts like encapsulation, data abstraction and data hiding. The code has a multi-layer hierarchical structure with tree like dependences among modules. The modules include not only the variables but also the methods acting on them, in an object oriented fashion. The modular structure allows easier code maintenance, develop and debugging procedures, and is suitable for a developer team. The layer structure permits high portability. The code displays an almost linear speed-up in a wide range of number of processors independently of the architecture. Super-linear speed up is obtained with a "smart" Fast Fourier Transform (FFT) that uses the available memory on the single node (increasing for a fixed problem with the number of processing elements) as temporary buffer to store wave function transforms. This code has been used to simulate water and ammonia at giant planet conditions for systems as large as 64 molecules for ˜50 ps.
Lorentz boosted frame simulation technique in Particle-in-cell methods

NASA Astrophysics Data System (ADS)

Yu, Peicheng

In this dissertation, we systematically explore the use of a simulation method for modeling laser wakefield acceleration (LWFA) using the particle-in-cell (PIC) method, called the Lorentz boosted frame technique. In the lab frame the plasma length is typically four orders of magnitude larger than the laser pulse length. Using this technique, simulations are performed in a Lorentz boosted frame in which the plasma length, which is Lorentz contracted, and the laser length, which is Lorentz expanded, are now comparable. This technique has the potential to reduce the computational needs of a LWFA simulation by more than four orders of magnitude, and is useful if there is no or negligible reflection of the laser in the lab frame. To realize the potential of Lorentz boosted frame simulations for LWFA, the first obstacle to overcome is a robust and violent numerical instability, called the Numerical Cerenkov Instability (NCI), that leads to unphysical energy exchange between relativistically drifting particles and their radiation. This leads to unphysical noise that dwarfs the real physical processes. In this dissertation, we first present a theoretical analysis of this instability, and show that the NCI comes from the unphysical coupling of the electromagnetic (EM) modes and Langmuir modes (both main and aliasing) of the relativistically drifting plasma. We then discuss the methods to eliminate them. However, the use of FFTs can lead to parallel scalability issues when there are many more cells along the drifting direction than in the transverse direction(s). We then describe an algorithm that has the potential to address this issue by using a higher order finite difference operator for the derivative in the plasma drifting direction, while using the standard second order operators in the transverse direction(s). The NCI for this algorithm is analyzed, and it is shown that the NCI can be eliminated using the same strategies that were used for the hybrid FFT/Finite Difference solver. This scheme also requires a current correction and filtering which require FFTs. However, we show that in this case the FFTs can be done locally on each parallel partition. We also describe how the use of the hybrid FFT/Finite Difference or the hybrid higher order finite difference/second order finite difference methods permit combining the Lorentz boosted frame simulation technique with another "speed up" technique, called the quasi-3D algorithm, to gain unprecedented speed up for the LWFA simulations. In the quasi-3D algorithm the fields and currents are defined on an r--z PIC grid and expanded in azimuthal harmonics. The expansion is truncated with only a few modes so it has similar computational needs of a 2D r--z PIC code. We show that NCI has similar properties in r--z as in z-x slab geometry and show that the same strategies for eliminating the NCI in Cartesian geometry can be effective for the quasi-3D algorithm leading to the possibility of unprecedented speed up. We also describe a new code called UPIC-EMMA that is based on fully spectral (FFT) solver. The new code includes implementation of a moving antenna that can launch lasers in the boosted frame. We also describe how the new hybrid algorithms were implemented into OSIRIS. Examples of LWFA using the boosted frame using both UPIC-EMMA and OSIRIS are given, including the comparisons against the lab frame results. We also describe how to efficiently obtain the boosted frame simulations data that are needed to generate the transformed lab frame data, as well as how to use a moving window in the boosted frame. The NCI is also a major issue for modeling relativistic shocks with PIC algorithm. In relativistic shock simulations two counter-propagating plasmas drifting at relativistic speeds are colliding against each other. We show that the strategies for eliminating the NCI developed in this dissertation are enabling such simulations being run for much longer simulation times, which should open a path for major advances in relativistic shock research. (Abstract shortened by ProQuest.).
Super-resolution Doppler beam sharpening method using fast iterative adaptive approach-based spectral estimation

NASA Astrophysics Data System (ADS)

Mao, Deqing; Zhang, Yin; Zhang, Yongchao; Huang, Yulin; Yang, Jianyu

2018-01-01

Doppler beam sharpening (DBS) is a critical technology for airborne radar ground mapping in forward-squint region. In conventional DBS technology, the narrow-band Doppler filter groups formed by fast Fourier transform (FFT) method suffer from low spectral resolution and high side lobe levels. The iterative adaptive approach (IAA), based on the weighted least squares (WLS), is applied to the DBS imaging applications, forming narrower Doppler filter groups than the FFT with lower side lobe levels. Regrettably, the IAA is iterative, and requires matrix multiplication and inverse operation when forming the covariance matrix, its inverse and traversing the WLS estimate for each sampling point, resulting in a notably high computational complexity for cubic time. We propose a fast IAA (FIAA)-based super-resolution DBS imaging method, taking advantage of the rich matrix structures of the classical narrow-band filtering. First, we formulate the covariance matrix via the FFT instead of the conventional matrix multiplication operation, based on the typical Fourier structure of the steering matrix. Then, by exploiting the Gohberg-Semencul representation, the inverse of the Toeplitz covariance matrix is computed by the celebrated Levinson-Durbin (LD) and Toeplitz-vector algorithm. Finally, the FFT and fast Toeplitz-vector algorithm are further used to traverse the WLS estimates based on the data-dependent trigonometric polynomials. The method uses the Hermitian feature of the echo autocorrelation matrix R to achieve its fast solution and uses the Toeplitz structure of R to realize its fast inversion. The proposed method enjoys a lower computational complexity without performance loss compared with the conventional IAA-based super-resolution DBS imaging method. The results based on simulations and measured data verify the imaging performance and the operational efficiency.
Enhancing Image Processing Performance for PCID in a Heterogeneous Network of Multi-code Processors

NASA Astrophysics Data System (ADS)

Linderman, R.; Spetka, S.; Fitzgerald, D.; Emeny, S.

The Physically-Constrained Iterative Deconvolution (PCID) image deblurring code is being ported to heterogeneous networks of multi-core systems, including Intel Xeons and IBM Cell Broadband Engines. This paper reports results from experiments using the JAWS supercomputer at MHPCC (60 TFLOPS of dual-dual Xeon nodes linked with Infiniband) and the Cell Cluster at AFRL in Rome, NY. The Cell Cluster has 52 TFLOPS of Playstation 3 (PS3) nodes with IBM Cell Broadband Engine multi-cores and 15 dual-quad Xeon head nodes. The interconnect fabric includes Infiniband, 10 Gigabit Ethernet and 1 Gigabit Ethernet to each of the 336 PS3s. The results compare approaches to parallelizing FFT executions across the Xeons and the Cell's Synergistic Processing Elements (SPEs) for frame-level image processing. The experiments included Intel's Performance Primitives and Math Kernel Library, FFTW3.2, and Carnegie Mellon's SPIRAL. Optimization of FFTs in the PCID code led to a decrease in relative processing time for FFTs. Profiling PCID version 6.2, about one year ago, showed the 13 functions that accounted for the highest percentage of processing were all FFT processing functions. They accounted for over 88% of processing time in one run on Xeons. FFT optimizations led to improvement in the current PCID version 8.0. A recent profile showed that only two of the 19 functions with the highest processing time were FFT processing functions. Timing measurements showed that FFT processing for PCID version 8.0 has been reduced to less than 19% of overall processing time. We are working toward a goal of scaling to 200-400 cores per job (1-2 imagery frames/core). Running a pair of cores on each set of frames reduces latency by implementing parallel FFT processing. Our current results show scaling well out to 100 pairs of cores. These results support the next higher level of parallelism in PCID, where groups of several hundred frames each producing one resolved image are sent to cliques of several hundred cores in a round robin fashion. Current efforts toward further performance enhancement for PCID are shifting toward using the Playstations in conjunction with the Xeons to take advantage of outstanding price/performance as well as the Flops/Watt cost advantage. We are fine-tuning the PCID parallization strategy to balance processing over Xeons and Cell BEs to find an optimal partitioning of PCID over the heterogeneous processors. A high performance information management system that exploits native Infiniband multicast is used to improve latency among the head nodes. Using a publication/subscription oriented information management system to implement a unified communications platform makes runs on large HPCs with thousands of intercommunicating cores more flexible and more fault tolerant. It features a loose couplingof publishers to subscribers through intervening brokers. We are also working on enhancing performance for both Xeons and Cell BEs, buy moving selected operations to single precision. Techniques for adapting the code to single precision and performance results are reported.
Systolic Algorithms for Imaging from Space

DTIC Science & Technology

1989-07-31

on a keystone or trapezoidal grid [ Arikan & Munson, 1987]. The image reconstruction algorithm then simply applies an inverse 2-D FFT to the stored...rithm composed of groups of point targets, and we determined the effects of windowing and incor- poration of a Jacobian weighting factor [ Arikan ...the impulse response of the desired filter [ Arikan & Munson, 1989]. The necessary filtering is then accomplished through the physical mechanism of the
AUTOMATIC GENERATION OF FFT FOR TRANSLATIONS OF MULTIPOLE EXPANSIONS IN SPHERICAL HARMONICS

PubMed Central

Mirkovic, Dragan; Pettitt, B. Montgomery; Johnsson, S. Lennart

2009-01-01

The fast multipole method (FMM) is an efficient algorithm for calculating electrostatic interactions in molecular simulations and a promising alternative to Ewald summation methods. Translation of multipole expansion in spherical harmonics is the most important operation of the fast multipole method and the fast Fourier transform (FFT) acceleration of this operation is among the fastest methods of improving its performance. The technique relies on highly optimized implementation of fast Fourier transform routines for the desired expansion sizes, which need to incorporate the knowledge of symmetries and zero elements in the input arrays. Here a method is presented for automatic generation of such, highly optimized, routines. PMID:19763233
Accelerating and focusing protein-protein docking correlations using multi-dimensional rotational FFT generating functions.

PubMed

Ritchie, David W; Kozakov, Dima; Vajda, Sandor

2008-09-01

Predicting how proteins interact at the molecular level is a computationally intensive task. Many protein docking algorithms begin by using fast Fourier transform (FFT) correlation techniques to find putative rigid body docking orientations. Most such approaches use 3D Cartesian grids and are therefore limited to computing three dimensional (3D) translational correlations. However, translational FFTs can speed up the calculation in only three of the six rigid body degrees of freedom, and they cannot easily incorporate prior knowledge about a complex to focus and hence further accelerate the calculation. Furthemore, several groups have developed multi-term interaction potentials and others use multi-copy approaches to simulate protein flexibility, which both add to the computational cost of FFT-based docking algorithms. Hence there is a need to develop more powerful and more versatile FFT docking techniques. This article presents a closed-form 6D spherical polar Fourier correlation expression from which arbitrary multi-dimensional multi-property multi-resolution FFT correlations may be generated. The approach is demonstrated by calculating 1D, 3D and 5D rotational correlations of 3D shape and electrostatic expansions up to polynomial order L=30 on a 2 GB personal computer. As expected, 3D correlations are found to be considerably faster than 1D correlations but, surprisingly, 5D correlations are often slower than 3D correlations. Nonetheless, we show that 5D correlations will be advantageous when calculating multi-term knowledge-based interaction potentials. When docking the 84 complexes of the Protein Docking Benchmark, blind 3D shape plus electrostatic correlations take around 30 minutes on a contemporary personal computer and find acceptable solutions within the top 20 in 16 cases. Applying a simple angular constraint to focus the calculation around the receptor binding site produces acceptable solutions within the top 20 in 28 cases. Further constraining the search to the ligand binding site gives up to 48 solutions within the top 20, with calculation times of just a few minutes per complex. Hence the approach described provides a practical and fast tool for rigid body protein-protein docking, especially when prior knowledge about one or both binding sites is available.
Analysis of wave motion in one-dimensional structures through fast-Fourier-transform-based wavelet finite element method

NASA Astrophysics Data System (ADS)

Shen, Wei; Li, Dongsheng; Zhang, Shuaifang; Ou, Jinping

2017-07-01

This paper presents a hybrid method that combines the B-spline wavelet on the interval (BSWI) finite element method and spectral analysis based on fast Fourier transform (FFT) to study wave propagation in One-Dimensional (1D) structures. BSWI scaling functions are utilized to approximate the theoretical wave solution in the spatial domain and construct a high-accuracy dynamic stiffness matrix. Dynamic reduction on element level is applied to eliminate the interior degrees of freedom of BSWI elements and substantially reduce the size of the system matrix. The dynamic equations of the system are then transformed and solved in the frequency domain through FFT-based spectral analysis which is especially suitable for parallel computation. A comparative analysis of four different finite element methods is conducted to demonstrate the validity and efficiency of the proposed method when utilized in high-frequency wave problems. Other numerical examples are utilized to simulate the influence of crack and delamination on wave propagation in 1D rods and beams. Finally, the errors caused by FFT and their corresponding solutions are presented.

A VLSI architecture for simplified arithmetic Fourier transform algorithm

NASA Technical Reports Server (NTRS)

Reed, Irving S.; Shih, Ming-Tang; Truong, T. K.; Hendon, E.; Tufts, D. W.

1992-01-01

The arithmetic Fourier transform (AFT) is a number-theoretic approach to Fourier analysis which has been shown to perform competitively with the classical FFT in terms of accuracy, complexity, and speed. Theorems developed in a previous paper for the AFT algorithm are used here to derive the original AFT algorithm which Bruns found in 1903. This is shown to yield an algorithm of less complexity and of improved performance over certain recent AFT algorithms. A VLSI architecture is suggested for this simplified AFT algorithm. This architecture uses a butterfly structure which reduces the number of additions by 25 percent of that used in the direct method.
Low-Power Analog Processing for Sensing Applications: Low-Frequency Harmonic Signal Classification

PubMed Central

White, Daniel J.; William, Peter E.; Hoffman, Michael W.; Balkir, Sina

2013-01-01

A low-power analog sensor front-end is described that reduces the energy required to extract environmental sensing spectral features without using Fast Fouriér Transform (FFT) or wavelet transforms. An Analog Harmonic Transform (AHT) allows selection of only the features needed by the back-end, in contrast to the FFT, where all coefficients must be calculated simultaneously. We also show that the FFT coefficients can be easily calculated from the AHT results by a simple back-substitution. The scheme is tailored for low-power, parallel analog implementation in an integrated circuit (IC). Two different applications are tested with an ideal front-end model and compared to existing studies with the same data sets. Results from the military vehicle classification and identification of machine-bearing fault applications shows that the front-end suits a wide range of harmonic signal sources. Analog-related errors are modeled to evaluate the feasibility of and to set design parameters for an IC implementation to maintain good system-level performance. Design of a preliminary transistor-level integrator circuit in a 0.13 μm complementary metal-oxide-silicon (CMOS) integrated circuit process showed the ability to use online self-calibration to reduce fabrication errors to a sufficiently low level. Estimated power dissipation is about three orders of magnitude less than similar vehicle classification systems that use commercially available FFT spectral extraction. PMID:23892765
Big Data in Reciprocal Space: Sliding Fast Fourier Transforms for Determining Periodicity

DOE PAGES

Vasudevan, Rama K.; Belianinov, Alex; Gianfrancesco, Anthony G.; ...

2015-03-03

Significant advances in atomically resolved imaging of crystals and surfaces have occurred in the last decade allowing unprecedented insight into local crystal structures and periodicity. Yet, the analysis of the long-range periodicity from the local imaging data, critical to correlation of functional properties and chemistry to the local crystallography, remains a challenge. Here, we introduce a Sliding Fast Fourier Transform (FFT) filter to analyze atomically resolved images of in-situ grown La5/8Ca3/8MnO3 films. We demonstrate the ability of sliding FFT algorithm to differentiate two sub-lattices, resulting from a mixed-terminated surface. Principal Component Analysis (PCA) and Independent Component Analysis (ICA) of themore » Sliding FFT dataset reveal the distinct changes in crystallography, step edges and boundaries between the multiple sub-lattices. The method is universal for images with any periodicity, and is especially amenable to atomically resolved probe and electron-microscopy data for rapid identification of the sub-lattices present.« less
Big Data in Reciprocal Space: Sliding Fast Fourier Transforms for Determining Periodicity

DOE Office of Scientific and Technical Information (OSTI.GOV)

Vasudevan, Rama K.; Belianinov, Alex; Gianfrancesco, Anthony G.

Significant advances in atomically resolved imaging of crystals and surfaces have occurred in the last decade allowing unprecedented insight into local crystal structures and periodicity. Yet, the analysis of the long-range periodicity from the local imaging data, critical to correlation of functional properties and chemistry to the local crystallography, remains a challenge. Here, we introduce a Sliding Fast Fourier Transform (FFT) filter to analyze atomically resolved images of in-situ grown La5/8Ca3/8MnO3 films. We demonstrate the ability of sliding FFT algorithm to differentiate two sub-lattices, resulting from a mixed-terminated surface. Principal Component Analysis (PCA) and Independent Component Analysis (ICA) of themore » Sliding FFT dataset reveal the distinct changes in crystallography, step edges and boundaries between the multiple sub-lattices. The method is universal for images with any periodicity, and is especially amenable to atomically resolved probe and electron-microscopy data for rapid identification of the sub-lattices present.« less
Practical Sub-Nyquist Sampling via Array-Based Compressed Sensing Receiver Architecture

DTIC Science & Technology

2016-07-10

different array ele- ments at different sub-Nyquist sampling rates. Signal processing inspired by the sparse fast Fourier transform allows for signal...reconstruction algorithms can be computationally demanding (REF). The related sparse Fourier transform algorithms aim to reduce the processing time nec- essary to...compute the DFT of frequency-sparse signals [7]. In particular, the sparse fast Fourier transform (sFFT) achieves processing time better than the
Fractal dimension to classify the heart sound recordings with KNN and fuzzy c-mean clustering methods

NASA Astrophysics Data System (ADS)

Juniati, D.; Khotimah, C.; Wardani, D. E. K.; Budayasa, K.

2018-01-01

The heart abnormalities can be detected from heart sound. A heart sound can be heard directly with a stethoscope or indirectly by a phonocardiograph, a machine of the heart sound recording. This paper presents the implementation of fractal dimension theory to make a classification of phonocardiograms into a normal heart sound, a murmur, or an extrasystole. The main algorithm used to calculate the fractal dimension was Higuchi’s Algorithm. There were two steps to make a classification of phonocardiograms, feature extraction, and classification. For feature extraction, we used Discrete Wavelet Transform to decompose the signal of heart sound into several sub-bands depending on the selected level. After the decomposition process, the signal was processed using Fast Fourier Transform (FFT) to determine the spectral frequency. The fractal dimension of the FFT output was calculated using Higuchi Algorithm. The classification of fractal dimension of all phonocardiograms was done with KNN and Fuzzy c-mean clustering methods. Based on the research results, the best accuracy obtained was 86.17%, the feature extraction by DWT decomposition level 3 with the value of kmax 50, using 5-fold cross validation and the number of neighbors was 5 at K-NN algorithm. Meanwhile, for fuzzy c-mean clustering, the accuracy was 78.56%.
Fast Particle Methods for Multiscale Phenomena Simulations

NASA Technical Reports Server (NTRS)

Koumoutsakos, P.; Wray, A.; Shariff, K.; Pohorille, Andrew

2000-01-01

We are developing particle methods oriented at improving computational modeling capabilities of multiscale physical phenomena in : (i) high Reynolds number unsteady vortical flows, (ii) particle laden and interfacial flows, (iii)molecular dynamics studies of nanoscale droplets and studies of the structure, functions, and evolution of the earliest living cell. The unifying computational approach involves particle methods implemented in parallel computer architectures. The inherent adaptivity, robustness and efficiency of particle methods makes them a multidisciplinary computational tool capable of bridging the gap of micro-scale and continuum flow simulations. Using efficient tree data structures, multipole expansion algorithms, and improved particle-grid interpolation, particle methods allow for simulations using millions of computational elements, making possible the resolution of a wide range of length and time scales of these important physical phenomena.The current challenges in these simulations are in : [i] the proper formulation of particle methods in the molecular and continuous level for the discretization of the governing equations [ii] the resolution of the wide range of time and length scales governing the phenomena under investigation. [iii] the minimization of numerical artifacts that may interfere with the physics of the systems under consideration. [iv] the parallelization of processes such as tree traversal and grid-particle interpolations We are conducting simulations using vortex methods, molecular dynamics and smooth particle hydrodynamics, exploiting their unifying concepts such as : the solution of the N-body problem in parallel computers, highly accurate particle-particle and grid-particle interpolations, parallel FFT's and the formulation of processes such as diffusion in the context of particle methods. This approach enables us to transcend among seemingly unrelated areas of research.
Hybrid MPI-OpenMP Parallelism in the ONETEP Linear-Scaling Electronic Structure Code: Application to the Delamination of Cellulose Nanofibrils.

PubMed

Wilkinson, Karl A; Hine, Nicholas D M; Skylaris, Chris-Kriton

2014-11-11

We present a hybrid MPI-OpenMP implementation of Linear-Scaling Density Functional Theory within the ONETEP code. We illustrate its performance on a range of high performance computing (HPC) platforms comprising shared-memory nodes with fast interconnect. Our work has focused on applying OpenMP parallelism to the routines which dominate the computational load, attempting where possible to parallelize different loops from those already parallelized within MPI. This includes 3D FFT box operations, sparse matrix algebra operations, calculation of integrals, and Ewald summation. While the underlying numerical methods are unchanged, these developments represent significant changes to the algorithms used within ONETEP to distribute the workload across CPU cores. The new hybrid code exhibits much-improved strong scaling relative to the MPI-only code and permits calculations with a much higher ratio of cores to atoms. These developments result in a significantly shorter time to solution than was possible using MPI alone and facilitate the application of the ONETEP code to systems larger than previously feasible. We illustrate this with benchmark calculations from an amyloid fibril trimer containing 41,907 atoms. We use the code to study the mechanism of delamination of cellulose nanofibrils when undergoing sonification, a process which is controlled by a large number of interactions that collectively determine the structural properties of the fibrils. Many energy evaluations were needed for these simulations, and as these systems comprise up to 21,276 atoms this would not have been feasible without the developments described here.
The fast Padé transform in magnetic resonance spectroscopy for potential improvements in early cancer diagnostics

NASA Astrophysics Data System (ADS)

Belkic, Dzevad; Belkic, Karen

2005-09-01

The convergence rates of the fast Padé transform (FPT) and the fast Fourier transform (FFT) are compared. These two estimators are used to process a time-signal encoded at 4 T by means of one-dimensional magnetic resonance spectroscopy (MRS) for healthy human brain. It is found systematically that at any level of truncation of the full signal length, the clinically relevant resonances that determine concentrations of metabolites in the investigated tissue are significantly better resolved in the FPT than in the FFT. In particular, the FPT has a better resolution than the FFT for the same signal length. Moreover, the FPT can achieve the same resolution as the FFT by using twice shorter signals. Implications of these findings for two-dimensional magnetic resonance spectroscopy as well as for two- and three-dimensional magnetic resonance spectroscopic imaging are highlighted. Self-contained cross-validation of all the results from the FPT is secured by using two conceptually different, equivalent algorithms (inside and outside the unit-circle), that are both valid in the entire complex frequency plane. The difference between the results from these two variants of the FPT is indistinguishable from the background noise. This constitutes robust error analysis of proven validity. The FPT shows promise in applications of MRS for early cancer detection.
Moho Modeling Using FFT Technique

NASA Astrophysics Data System (ADS)

Chen, Wenjin; Tenzer, Robert

2017-04-01

To improve the numerical efficiency, the Fast Fourier Transform (FFT) technique was facilitated in Parker-Oldenburg's method for a regional gravimetric Moho recovery, which assumes the Earth's planar approximation. In this study, we extend this definition for global applications while assuming a spherical approximation of the Earth. In particular, we utilize the FFT technique for a global Moho recovery, which is practically realized in two numerical steps. The gravimetric forward modeling is first applied, based on methods for a spherical harmonic analysis and synthesis of the global gravity and lithospheric structure models, to compute the refined gravity field, which comprises mainly the gravitational signature of the Moho geometry. The gravimetric inverse problem is then solved iteratively in order to determine the Moho depth. The application of FFT technique to both numerical steps reduces the computation time to a fraction of that required without applying this fast algorithm. The developed numerical producers are used to estimate the Moho depth globally, and the gravimetric result is validated using the global (CRUST1.0) and regional (ESC) seismic Moho models. The comparison reveals a relatively good agreement between the gravimetric and seismic models, with the RMS of differences (of 4-5 km) at the level of expected uncertainties of used input datasets, while without the presence of significant systematic bias.
FPGA-based real-time swept-source OCT systems for B-scan live-streaming or volumetric imaging

NASA Astrophysics Data System (ADS)

Bandi, Vinzenz; Goette, Josef; Jacomet, Marcel; von Niederhäusern, Tim; Bachmann, Adrian H.; Duelk, Marcus

2013-03-01

We have developed a Swept-Source Optical Coherence Tomography (Ss-OCT) system with high-speed, real-time signal processing on a commercially available Data-Acquisition (DAQ) board with a Field-Programmable Gate Array (FPGA). The Ss-OCT system simultaneously acquires OCT and k-clock reference signals at 500MS/s. From the k-clock signal of each A-scan we extract a remap vector for the k-space linearization of the OCT signal. The linear but oversampled interpolation is followed by a 2048-point FFT, additional auxiliary computations, and a data transfer to a host computer for real-time, live-streaming of B-scan or volumetric C-scan OCT visualization. We achieve a 100 kHz A-scan rate by parallelization of our hardware algorithms, which run on standard and affordable, commercially available DAQ boards. Our main development tool for signal analysis as well as for hardware synthesis is MATLAB® with add-on toolboxes and 3rd-party tools.
PCTDSE: A parallel Cartesian-grid-based TDSE solver for modeling laser-atom interactions

NASA Astrophysics Data System (ADS)

Fu, Yongsheng; Zeng, Jiaolong; Yuan, Jianmin

2017-01-01

We present a parallel Cartesian-grid-based time-dependent Schrödinger equation (TDSE) solver for modeling laser-atom interactions. It can simulate the single-electron dynamics of atoms in arbitrary time-dependent vector potentials. We use a split-operator method combined with fast Fourier transforms (FFT), on a three-dimensional (3D) Cartesian grid. Parallelization is realized using a 2D decomposition strategy based on the Message Passing Interface (MPI) library, which results in a good parallel scaling on modern supercomputers. We give simple applications for the hydrogen atom using the benchmark problems coming from the references and obtain repeatable results. The extensions to other laser-atom systems are straightforward with minimal modifications of the source code.
Multiple Frequency Contrast Source Inversion Method for Vertical Electromagnetic Profiling: 2D Simulation Results and Analyses

NASA Astrophysics Data System (ADS)

Li, Jinghe; Song, Linping; Liu, Qing Huo

2016-02-01

A simultaneous multiple frequency contrast source inversion (CSI) method is applied to reconstructing hydrocarbon reservoir targets in a complex multilayered medium in two dimensions. It simulates the effects of a salt dome sedimentary formation in the context of reservoir monitoring. In this method, the stabilized biconjugate-gradient fast Fourier transform (BCGS-FFT) algorithm is applied as a fast solver for the 2D volume integral equation for the forward computation. The inversion technique with CSI combines the efficient FFT algorithm to speed up the matrix-vector multiplication and the stable convergence of the simultaneous multiple frequency CSI in the iteration process. As a result, this method is capable of making quantitative conductivity image reconstruction effectively for large-scale electromagnetic oil exploration problems, including the vertical electromagnetic profiling (VEP) survey investigated here. A number of numerical examples have been demonstrated to validate the effectiveness and capacity of the simultaneous multiple frequency CSI method for a limited array view in VEP.
Medical image reconstruction algorithm based on the geometric information between sensor detector and ROI

NASA Astrophysics Data System (ADS)

Ham, Woonchul; Song, Chulgyu; Lee, Kangsan; Roh, Seungkuk

2016-05-01

In this paper, we propose a new image reconstruction algorithm considering the geometric information of acoustic sources and senor detector and review the two-step reconstruction algorithm which was previously proposed based on the geometrical information of ROI(region of interest) considering the finite size of acoustic sensor element. In a new image reconstruction algorithm, not only mathematical analysis is very simple but also its software implementation is very easy because we don't need to use the FFT. We verify the effectiveness of the proposed reconstruction algorithm by showing the simulation results by using Matlab k-wave toolkit.
Modeling the Gross-Pitaevskii Equation Using the Quantum Lattice Gas Method

NASA Astrophysics Data System (ADS)

Oganesov, Armen

We present an improved Quantum Lattice Gas (QLG) algorithm as a mesoscopic unitary perturbative representation of the mean field Gross Pitaevskii (GP) equation for Bose-Einstein Condensates (BECs). The method employs an interleaved sequence of unitary collide and stream operators. QLG is applicable to many different scalar potentials in the weak interaction regime and has been used to model the Korteweg-de Vries (KdV), Burgers and GP equations. It can be implemented on both quantum and classical computers and is extremely scalable. We present results for 1D soliton solutions with positive and negative internal interactions, as well as vector solitons with inelastic scattering. In higher dimensions we look at the behavior of vortex ring reconnection. A further improvement is considered with a proper operator splitting technique via a Fourier transformation. This is great for quantum computers since the quantum FFT is exponentially faster than its classical counterpart which involves non-local data on the entire lattice (Quantum FFT is the backbone of the Shor algorithm for quantum factorization). We also present an imaginary time method in which we transform the Schrodinger equation into a diffusion equation for recovering ground state initial conditions of a quantum system suitable for the QLG algorithm.
Exploitation of Microdoppler and Multiple Scattering Phenomena for Radar Target Recognition

DTIC Science & Technology

2006-08-24

is tested with measurement data. The resulting GPR images demonstrate the effectiveness of the proposed algorithm. INTRODUCTION Subsurface imaging to...utilizes the fast Fourier . transform (FFT) to expedite the imaging GPR. Recently, we re- .... ported a fast and effective SAR-based subsurface ... imaging tech- nique that can provide good resolutions in both the range and cross-range domains I111. Our algorithm differs from Witten’s [91 and Hansen’s
THE PSTD ALGORITHM: A TIME-DOMAIN METHOD REQUIRING ONLY TWO CELLS PER WAVELENGTH. (R825225)

EPA Science Inventory

A pseudospectral time-domain (PSTD) method is developed for solutions of Maxwell's equations. It uses the fast Fourier transform (FFT), instead of finite differences on conventional finite-difference-time-domain (FDTD) methods, to represent spatial derivatives. Because the Fourie...
Optimal Simulations by Butterfly Networks: Extended Abstract,

DTIC Science & Technology

1987-11-01

Typescript , Univ. of Massachusetts; submitted for nublication. 1_2.2 Ll, - W 12. ifliU 1.8 UI1.25 . l i I 61 MICROCOPY RESOLUTION TEST CHART NATIONAL...1987): An optimal mapping of the FFT algorithm onto the tlypercube architecture. Typescript , Univ. of Massachusetts; submitted for publication. (HR I
mm_par2.0: An object-oriented molecular dynamics simulation program parallelized using a hierarchical scheme with MPI and OPENMP

NASA Astrophysics Data System (ADS)

Oh, Kwang Jin; Kang, Ji Hoon; Myung, Hun Joo

2012-02-01

We have revised a general purpose parallel molecular dynamics simulation program mm_par using the object-oriented programming. We parallelized the revised version using a hierarchical scheme in order to utilize more processors for a given system size. The benchmark result will be presented here. New version program summaryProgram title: mm_par2.0 Catalogue identifier: ADXP_v2_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADXP_v2_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC license, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 2 390 858 No. of bytes in distributed program, including test data, etc.: 25 068 310 Distribution format: tar.gz Programming language: C++ Computer: Any system operated by Linux or Unix Operating system: Linux Classification: 7.7 External routines: We provide wrappers for FFTW [1], Intel MKL library [2] FFT routine, and Numerical recipes [3] FFT, random number generator, and eigenvalue solver routines, SPRNG [4] random number generator, Mersenne Twister [5] random number generator, space filling curve routine. Catalogue identifier of previous version: ADXP_v1_0 Journal reference of previous version: Comput. Phys. Comm. 174 (2006) 560 Does the new version supersede the previous version?: Yes Nature of problem: Structural, thermodynamic, and dynamical properties of fluids and solids from microscopic scales to mesoscopic scales. Solution method: Molecular dynamics simulation in NVE, NVT, and NPT ensemble, Langevin dynamics simulation, dissipative particle dynamics simulation. Reasons for new version: First, object-oriented programming has been used, which is known to be open for extension and closed for modification. It is also known to be better for maintenance. Second, version 1.0 was based on atom decomposition and domain decomposition scheme [6] for parallelization. However, atom decomposition is not popular due to its poor scalability. On the other hand, domain decomposition scheme is better for scalability. It still has a limitation in utilizing a large number of cores on recent petascale computers due to the requirement that the domain size is larger than the potential cutoff distance. To go beyond such a limitation, a hierarchical parallelization scheme has been adopted in this new version and implemented using MPI [7] and OPENMP [8]. Summary of revisions: (1) Object-oriented programming has been used. (2) A hierarchical parallelization scheme has been adopted. (3) SPME routine has been fully parallelized with parallel 3D FFT using volumetric decomposition scheme [9]. K.J.O. thanks Mr. Seung Min Lee for useful discussion on programming and debugging. Running time: Running time depends on system size and methods used. For test system containing a protein (PDB id: 5DHFR) with CHARMM22 force field [10] and 7023 TIP3P [11] waters in simulation box having dimension 62.23 Å×62.23 Å×62.23 Å, the benchmark results are given in Fig. 1. Here the potential cutoff distance was set to 12 Å and the switching function was applied from 10 Å for the force calculation in real space. For the SPME [12] calculation, K, K, and K were set to 64 and the interpolation order was set to 4. To do the fast Fourier transform, we used Intel MKL library. All bonds including hydrogen atoms were constrained using SHAKE/RATTLE algorithms [13,14]. The code was compiled using Intel compiler version 11.1 and mvapich2 version 1.5. Fig. 2 shows performance gains from using CUDA-enabled version [15] of mm_par for 5DHFR simulation in water on Intel Core2Quad 2.83 GHz and GeForce GTX 580. Even though mm_par2.0 is not ported yet for GPU, its performance data would be useful to expect mm_par2.0 performance on GPU. Timing results for 1000 MD steps. 1, 2, 4, and 8 in the figure mean the number of OPENMP threads. Timing results for 1000 MD steps from double precision simulation on CPU, single precision simulation on GPU, and double precision simulation on GPU.
Big data in reciprocal space: Sliding fast Fourier transforms for determining periodicity

DOE Office of Scientific and Technical Information (OSTI.GOV)

Vasudevan, Rama K., E-mail: rvv@ornl.gov; Belianinov, Alex; Baddorf, Arthur P.

Significant advances in atomically resolved imaging of crystals and surfaces have occurred in the last decade allowing unprecedented insight into local crystal structures and periodicity. Yet, the analysis of the long-range periodicity from the local imaging data, critical to correlation of functional properties and chemistry to the local crystallography, remains a challenge. Here, we introduce a Sliding Fast Fourier Transform (FFT) filter to analyze atomically resolved images of in-situ grown La{sub 5/8}Ca{sub 3/8}MnO{sub 3} (LCMO) films. We demonstrate the ability of sliding FFT algorithm to differentiate two sub-lattices, resulting from a mixed-terminated surface. Principal Component Analysis and Independent Component Analysismore » of the Sliding FFT dataset reveal the distinct changes in crystallography, step edges, and boundaries between the multiple sub-lattices. The implications for the LCMO system are discussed. The method is universal for images with any periodicity, and is especially amenable to atomically resolved probe and electron-microscopy data for rapid identification of the sub-lattices present.« less

Efficient matrix approach to optical wave propagation and Linear Canonical Transforms.

PubMed

Shakir, Sami A; Fried, David L; Pease, Edwin A; Brennan, Terry J; Dolash, Thomas M

2015-10-05

The Fresnel diffraction integral form of optical wave propagation and the more general Linear Canonical Transforms (LCT) are cast into a matrix transformation form. Taking advantage of recent efficient matrix multiply algorithms, this approach promises an efficient computational and analytical tool that is competitive with FFT based methods but offers better behavior in terms of aliasing, transparent boundary condition, and flexibility in number of sampling points and computational window sizes of the input and output planes being independent. This flexibility makes the method significantly faster than FFT based propagators when only a single point, as in Strehl metrics, or a limited number of points, as in power-in-the-bucket metrics, are needed in the output observation plane.
Robust High-Capacity Audio Watermarking Based on FFT Amplitude Modification

NASA Astrophysics Data System (ADS)

Fallahpour, Mehdi; Megías, David

This paper proposes a novel robust audio watermarking algorithm to embed data and extract it in a bit-exact manner based on changing the magnitudes of the FFT spectrum. The key point is selecting a frequency band for embedding based on the comparison between the original and the MP3 compressed/decompressed signal and on a suitable scaling factor. The experimental results show that the method has a very high capacity (about 5kbps), without significant perceptual distortion (ODG about -0.25) and provides robustness against common audio signal processing such as added noise, filtering and MPEG compression (MP3). Furthermore, the proposed method has a larger capacity (number of embedded bits to number of host bits rate) than recent image data hiding methods.
FFT analysis of sensible-heat solar-dynamic receivers

NASA Astrophysics Data System (ADS)

Lund, Kurt O.

The use of solar dynamic receivers with sensible energy storage in single-phase materials is considered. The feasibility of single-phase designs with weight and thermal performance comparable to existing two-phase designs is addressed. Linearized heat transfer equations are formulated for the receiver heat storage, representing the periodic input solar flux as the sum of steady and oscillating distributions. The steady component is solved analytically to produce the desired receiver steady outlet gas temperature, and the FFT algorithm is applied to the oscillating components to obtain the amplitudes and mode shapes of the oscillating solid and gas temperatures. The results indicate that sensible-heat receiver designs with performance comparable to state-of-the-art two-phase receivers are available.
A fast pulse design for parallel excitation with gridding conjugate gradient.

PubMed

Feng, Shuo; Ji, Jim

2013-01-01

Parallel excitation (pTx) is recognized as a crucial technique in high field MRI to address the transmit field inhomogeneity problem. However, it can be time consuming to design pTx pulses which is not desirable. In this work, we propose a pulse design with gridding conjugate gradient (CG) based on the small-tip-angle approximation. The two major time consuming matrix-vector multiplications are substituted by two operators which involves with FFT and gridding only. Simulation results have shown that the proposed method is 3 times faster than conventional method and the memory cost is reduced by 1000 times.
A software simulation study of a (255,223) Reed-Solomon encoder-decoder

NASA Technical Reports Server (NTRS)

Pollara, F.

1985-01-01

A set of software programs which simulates a (255,223) Reed-Solomon encoder/decoder pair is described. The transform decoder algorithm uses a modified Euclid algorithm, and closely follows the pipeline architecture proposed for the hardware decoder. Uncorrectable error patterns are detected by a simple test, and the inverse transform is computed by a finite field FFT. Numerical examples of the decoder operation are given for some test codewords, with and without errors. The use of the software package is briefly described.
A hybrid method for transient wave propagation in a multilayered solid

NASA Astrophysics Data System (ADS)

Tian, Jiayong; Xie, Zhoumin

2009-08-01

We present a hybrid method for the evaluation of transient elastic-wave propagation in a multilayered solid, integrating reverberation matrix method with the theory of generalized rays. Adopting reverberation matrix formulation, Laplace-Fourier domain solutions of elastic waves in the multilayered solid are expanded into the sum of a series of generalized-ray group integrals. Each generalized-ray group integral containing Kth power of reverberation matrix R represents the set of K-times reflections and refractions of source waves arriving at receivers in the multilayered solid, which was computed by fast inverse Laplace transform (FILT) and fast Fourier transform (FFT) algorithms. However, the calculation burden and low precision of FILT-FFT algorithm limit the application of reverberation matrix method. In this paper, we expand each of generalized-ray group integrals into the sum of a series of generalized-ray integrals, each of which is accurately evaluated by Cagniard-De Hoop method in the theory of generalized ray. The numerical examples demonstrate that the proposed method makes it possible to calculate the early-time transient response in the complex multilayered-solid configuration efficiently.
Numerical implementation of non-local polycrystal plasticity using fast Fourier transforms

DOE Office of Scientific and Technical Information (OSTI.GOV)

Lebensohn, Ricardo A.; Needleman, Alan

Here, we present the numerical implementation of a non-local polycrystal plasticity theory using the FFT-based formulation of Suquet and co-workers. Gurtin (2002) non-local formulation, with geometry changes neglected, has been incorporated in the EVP-FFT algorithm of Lebensohn et al. (2012). Numerical procedures for the accurate estimation of higher order derivatives of micromechanical fields, required for feedback into single crystal constitutive relations, are identified and applied. A simple case of a periodic laminate made of two fcc crystals with different plastic properties is first used to assess the soundness and numerical stability of the proposed algorithm and to study the influencemore » of different model parameters on the predictions of the non-local model. Different behaviors at grain boundaries are explored, and the one consistent with the micro-clamped condition gives the most pronounced size effect. The formulation is applied next to 3-D fcc polycrystals, illustrating the possibilities offered by the proposed numerical scheme to analyze the mechanical response of polycrystalline aggregates in three dimensions accounting for size dependence arising from plastic strain gradients with reasonable computing times.« less
Numerical implementation of non-local polycrystal plasticity using fast Fourier transforms

DOE PAGES

Lebensohn, Ricardo A.; Needleman, Alan

2016-03-28

Here, we present the numerical implementation of a non-local polycrystal plasticity theory using the FFT-based formulation of Suquet and co-workers. Gurtin (2002) non-local formulation, with geometry changes neglected, has been incorporated in the EVP-FFT algorithm of Lebensohn et al. (2012). Numerical procedures for the accurate estimation of higher order derivatives of micromechanical fields, required for feedback into single crystal constitutive relations, are identified and applied. A simple case of a periodic laminate made of two fcc crystals with different plastic properties is first used to assess the soundness and numerical stability of the proposed algorithm and to study the influencemore » of different model parameters on the predictions of the non-local model. Different behaviors at grain boundaries are explored, and the one consistent with the micro-clamped condition gives the most pronounced size effect. The formulation is applied next to 3-D fcc polycrystals, illustrating the possibilities offered by the proposed numerical scheme to analyze the mechanical response of polycrystalline aggregates in three dimensions accounting for size dependence arising from plastic strain gradients with reasonable computing times.« less
Adaptive Cross-correlation Algorithm and Experiment of Extended Scene Shack-Hartmann Wavefront Sensing

NASA Technical Reports Server (NTRS)

Sidick, Erkin; Morgan, Rhonda M.; Green, Joseph J.; Ohara, Catherine M.; Redding, David C.

2007-01-01

We have developed a new, adaptive cross-correlation (ACC) algorithm to estimate with high accuracy the shift as large as several pixels in two extended-scene images captured by a Shack-Hartmann wavefront sensor (SH-WFS). It determines the positions of all of the extended-scene image cells relative to a reference cell using an FFT-based iterative image shifting algorithm. It works with both point-source spot images as well as extended scene images. We have also set up a testbed for extended0scene SH-WFS, and tested the ACC algorithm with the measured data of both point-source and extended-scene images. In this paper we describe our algorithm and present out experimental results.
Highly Efficient Parallel Multigrid Solver For Large-Scale Simulation of Grain Growth Using the Structural Phase Field Crystal Model

NASA Astrophysics Data System (ADS)

Guan, Zhen; Pekurovsky, Dmitry; Luce, Jason; Thornton, Katsuyo; Lowengrub, John

The structural phase field crystal (XPFC) model can be used to model grain growth in polycrystalline materials at diffusive time-scales while maintaining atomic scale resolution. However, the governing equation of the XPFC model is an integral-partial-differential-equation (IPDE), which poses challenges in implementation onto high performance computing (HPC) platforms. In collaboration with the XSEDE Extended Collaborative Support Service, we developed a distributed memory HPC solver for the XPFC model, which combines parallel multigrid and P3DFFT. The performance benchmarking on the Stampede supercomputer indicates near linear strong and weak scaling for both multigrid and transfer time between multigrid and FFT modules up to 1024 cores. Scalability of the FFT module begins to decline at 128 cores, but it is sufficient for the type of problem we will be examining. We have demonstrated simulations using 1024 cores, and we expect to achieve 4096 cores and beyond. Ongoing work involves optimization of MPI/OpenMP-based codes for the Intel KNL Many-Core Architecture. This optimizes the code for coming pre-exascale systems, in particular many-core systems such as Stampede 2.0 and Cori 2 at NERSC, without sacrificing efficiency on other general HPC systems.
Scientific Visualization and Simulation for Multi-dimensional Marine Environment Data

NASA Astrophysics Data System (ADS)

Su, T.; Liu, H.; Wang, W.; Song, Z.; Jia, Z.

2017-12-01

As higher attention on the ocean and rapid development of marine detection, there are increasingly demands for realistic simulation and interactive visualization of marine environment in real time. Based on advanced technology such as GPU rendering, CUDA parallel computing and rapid grid oriented strategy, a series of efficient and high-quality visualization methods, which can deal with large-scale and multi-dimensional marine data in different environmental circumstances, has been proposed in this paper. Firstly, a high-quality seawater simulation is realized by FFT algorithm, bump mapping and texture animation technology. Secondly, large-scale multi-dimensional marine hydrological environmental data is virtualized by 3d interactive technologies and volume rendering techniques. Thirdly, seabed terrain data is simulated with improved Delaunay algorithm, surface reconstruction algorithm, dynamic LOD algorithm and GPU programming techniques. Fourthly, seamless modelling in real time for both ocean and land based on digital globe is achieved by the WebGL technique to meet the requirement of web-based application. The experiments suggest that these methods can not only have a satisfying marine environment simulation effect, but also meet the rendering requirements of global multi-dimension marine data. Additionally, a simulation system for underwater oil spill is established by OSG 3D-rendering engine. It is integrated with the marine visualization method mentioned above, which shows movement processes, physical parameters, current velocity and direction for different types of deep water oil spill particle (oil spill particles, hydrates particles, gas particles, etc.) dynamically and simultaneously in multi-dimension. With such application, valuable reference and decision-making information can be provided for understanding the progress of oil spill in deep water, which is helpful for ocean disaster forecasting, warning and emergency response.
An efficient 3-dim FFT for plane wave electronic structure calculations on massively parallel machines composed of multiprocessor nodes

NASA Astrophysics Data System (ADS)

Goedecker, Stefan; Boulet, Mireille; Deutsch, Thierry

2003-08-01

Three-dimensional Fast Fourier Transforms (FFTs) are the main computational task in plane wave electronic structure calculations. Obtaining a high performance on a large numbers of processors is non-trivial on the latest generation of parallel computers that consist of nodes made up of a shared memory multiprocessors. A non-dogmatic method for obtaining high performance for such 3-dim FFTs in a combined MPI/OpenMP programming paradigm will be presented. Exploiting the peculiarities of plane wave electronic structure calculations, speedups of up to 160 and speeds of up to 130 Gflops were obtained on 256 processors.
Frequency Estimator Performance for a Software-Based Beacon Receiver

NASA Technical Reports Server (NTRS)

Zemba, Michael J.; Morse, Jacquelynne Rose; Nessel, James A.; Miranda, Felix

2014-01-01

As propagation terminals have evolved, their design has trended more toward a software-based approach that facilitates convenient adjustment and customization of the receiver algorithms. One potential improvement is the implementation of a frequency estimation algorithm, through which the primary frequency component of the received signal can be estimated with a much greater resolution than with a simple peak search of the FFT spectrum. To select an estimator for usage in a QV-band beacon receiver, analysis of six frequency estimators was conducted to characterize their effectiveness as they relate to beacon receiver design.
A satellite-based radar wind sensor

NASA Technical Reports Server (NTRS)

Xin, Weizhuang

1991-01-01

The objective is to investigate the application of Doppler radar systems for global wind measurement. A model of the satellite-based radar wind sounder (RAWS) is discussed, and many critical problems in the designing process, such as the antenna scan pattern, tracking the Doppler shift caused by satellite motion, and backscattering of radar signals from different types of clouds, are discussed along with their computer simulations. In addition, algorithms for measuring mean frequency of radar echoes, such as the Fast Fourier Transform (FFT) estimator, the covariance estimator, and the estimators based on autoregressive models, are discussed. Monte Carlo computer simulations were used to compare the performance of these algorithms. Anti-alias methods are discussed for the FFT and the autoregressive methods. Several algorithms for reducing radar ambiguity were studied, such as random phase coding methods and staggered pulse repitition frequncy (PRF) methods. Computer simulations showed that these methods are not applicable to the RAWS because of the broad spectral widths of the radar echoes from clouds. A waveform modulation method using the concept of spread spectrum and correlation detection was developed to solve the radar ambiguity. Radar ambiguity functions were used to analyze the effective signal-to-noise ratios for the waveform modulation method. The results showed that, with suitable bandwidth product and modulation of the waveform, this method can achieve the desired maximum range and maximum frequency of the radar system.
A fast D.F.T. algorithm using complex integer transforms

NASA Technical Reports Server (NTRS)

Reed, I. S.; Truong, T. K.

1978-01-01

Winograd (1976) has developed a new class of algorithms which depend heavily on the computation of a cyclic convolution for computing the conventional DFT (discrete Fourier transform); this new algorithm, for a few hundred transform points, requires substantially fewer multiplications than the conventional FFT algorithm. Reed and Truong have defined a special class of finite Fourier-like transforms over GF(q squared), where q = 2 to the p power minus 1 is a Mersenne prime for p = 2, 3, 5, 7, 13, 17, 19, 31, 61. In the present paper it is shown that Winograd's algorithm can be combined with the aforementioned Fourier-like transform to yield a new algorithm for computing the DFT. A fast method for accurately computing the DFT of a sequence of complex numbers of very long transform-lengths is thus obtained.
On the Hilbert-Huang Transform Data Processing System Development

NASA Technical Reports Server (NTRS)

Kizhner, Semion; Flatley, Thomas P.; Huang, Norden E.; Cornwell, Evette; Smith, Darell

2003-01-01

One of the main heritage tools used in scientific and engineering data spectrum analysis is the Fourier Integral Transform and its high performance digital equivalent - the Fast Fourier Transform (FFT). The Fourier view of nonlinear mechanics that had existed for a long time, and the associated FFT (fairly recent development), carry strong a-priori assumptions about the source data, such as linearity and of being stationary. Natural phenomena measurements are essentially nonlinear and nonstationary. A very recent development at the National Aeronautics and Space Administration (NASA) Goddard Space Flight Center (GSFC), known as the Hilbert-Huang Transform (HHT) proposes a novel approach to the solution for the nonlinear class of spectrum analysis problems. Using the Empirical Mode Decomposition (EMD) followed by the Hilbert Transform of the empirical decomposition data (HT), the HHT allows spectrum analysis of nonlinear and nonstationary data by using an engineering a-posteriori data processing, based on the EMD algorithm. This results in a non-constrained decomposition of a source real value data vector into a finite set of Intrinsic Mode Functions (IMF) that can be further analyzed for spectrum interpretation by the classical Hilbert Transform. This paper describes phase one of the development of a new engineering tool, the HHT Data Processing System (HHTDPS). The HHTDPS allows applying the "T to a data vector in a fashion similar to the heritage FFT. It is a generic, low cost, high performance personal computer (PC) based system that implements the HHT computational algorithms in a user friendly, file driven environment. This paper also presents a quantitative analysis for a complex waveform data sample, a summary of technology commercialization efforts and the lessons learned from this new technology development.
A microcomputer based frequency-domain processor for laser Doppler anemometry

NASA Technical Reports Server (NTRS)

Horne, W. Clifton; Adair, Desmond

1988-01-01

A prototype multi-channel laser Doppler anemometry (LDA) processor was assembled using a wideband transient recorder and a microcomputer with an array processor for fast Fourier transform (FFT) computations. The prototype instrument was used to acquire, process, and record signals from a three-component wind tunnel LDA system subject to various conditions of noise and flow turbulence. The recorded data was used to evaluate the effectiveness of burst acceptance criteria, processing algorithms, and selection of processing parameters such as record length. The recorded signals were also used to obtain comparative estimates of signal-to-noise ratio between time-domain and frequency-domain signal detection schemes. These comparisons show that the FFT processing scheme allows accurate processing of signals for which the signal-to-noise ratio is 10 to 15 dB less than is practical using counter processors.
Real-time spectral analysis of HRV signals: an interactive and user-friendly PC system.

PubMed

Basano, L; Canepa, F; Ottonello, P

1998-01-01

We present a real-time system, built around a PC and a low-cost data acquisition board, for the spectral analysis of the heart rate variability signal. The Windows-like operating environment on which it is based makes the computer program very user-friendly even for non-specialized personnel. The Power Spectral Density is computed through the use of a hybrid method, in which a classical FFT analysis follows an autoregressive finite-extension of data; the stationarity of the sequence is continuously checked. The use of this algorithm gives a high degree of robustness of the spectral estimation. Moreover, always in real time, the FFT of every data block is computed and displayed in order to corroborate the results as well as to allow the user to interactively choose a proper AR model order.
Properly used ''aliasing'' can give better resolution from fewer points in Fourier transform spectroscopy

NASA Astrophysics Data System (ADS)

D'Astous, Y.; Blanchard, M.

1982-05-01

In the past years, the Journal has published a number of articles1-5 devoted to the introduction of Fourier transform spectroscopy in the undergraduate labs. In most papers, the proposed experimental setup consists of a Michelson interferometer, a light source, a light detector, and a chart recorder. The student uses this setup to record an interferogram which is then Fourier transformed to obtain the spectrogram of the light source. Although attempts have been made to ease the task of performing the required Fourier transform,6 the use of computers and Cooley-Tukey's fast Fourier transform (FFT) algorithm7 is by far the simplest method to use. However, to be able to use FFT, one has to get a number of samples of the interferogram, a tedious job which should be kept to a minimum. (AIP)
An improved scheme for Flip-OFDM based on Hartley transform in short-range IM/DD systems.

PubMed

Zhou, Ji; Qiao, Yaojun; Cai, Zhuo; Ji, Yuefeng

2014-08-25

In this paper, an improved Flip-OFDM scheme is proposed for IM/DD optical systems, where the modulation/demodulation processing takes advantage of the fast Hartley transform (FHT) algorithm. We realize the improved scheme in one symbol period while conventional Flip-OFDM scheme based on fast Fourier transform (FFT) in two consecutive symbol periods. So the complexity of many operations in improved scheme is half of that in conventional scheme, such as CP operation, polarity inversion and symbol delay. Compared to FFT with complex input constellation, the complexity of FHT with real input constellation is halved. The transmission experiment over 50-km SSMF has been realized to verify the feasibility of improved scheme. In conclusion, the improved scheme has the same BER performance with conventional scheme, but great superiority on complexity.

Algorithm Development for a Real-Time Military Noise Monitor

DTIC Science & Technology

2006-03-24

Duration ESLM Enhanced Sound Level Meter ERDC-CERL Engineer Research and Development Center/Construction Engineering Research Laboratory FFT...Fast Fourier Transform FTIG Fort Indiantown Gap Kurt Kurtosis LD Larson Davis Leq Equivalent Sound Level L8eq 8-hr Equivalent...Sound Level Lpk Peak Sound Level m Spectral Slope MCBCL Marine Corps Base Camp Lejeune Neg Number of negative samples NI National
Synthesis, Analysis, and Processing of Fractal Signals

DTIC Science & Technology

1991-10-01

coordinator in hockey, squash, volleyball, and softball, but also for reminding me periodically that 1/f noise can exist outside a computer. More...similar signals as Fourier-based representations are for stationary and periodic signals. Furthermore, because wave- let transformations can be...and periodic signals. Furthermore, just as the discovery of fast Fourier transform (FFT) algorithms dramatically increased the viability the Fourier
A digitally implemented preambleless demodulator for maritime and mobile data communications

NASA Astrophysics Data System (ADS)

Chalmers, Harvey; Shenoy, Ajit; Verahrami, Farhad B.

The hardware design and software algorithms for a low-bit-rate, low-cost, all-digital preambleless demodulator are described. The demodulator operates under severe high-noise conditions, fast Doppler frequency shifts, large frequency offsets, and multipath fading. Sophisticated algorithms, including a fast Fourier transform (FFT)-based burst acquisition algorithm, a cycle-slip resistant carrier phase tracker, an innovative Doppler tracker, and a fast acquisition symbol synchronizer, were developed and extensively simulated for reliable burst reception. The compact digital signal processor (DSP)-based demodulator hardware uses a unique personal computer test interface for downloading test data files. The demodulator test results demonstrate a near-ideal performance within 0.2 dB of theory.
Real time display Fourier-domain OCT using multi-thread parallel computing with data vectorization

NASA Astrophysics Data System (ADS)

Eom, Tae Joong; Kim, Hoon Seop; Kim, Chul Min; Lee, Yeung Lak; Choi, Eun-Seo

2011-03-01

We demonstrate a real-time display of processed OCT images using multi-thread parallel computing with a quad-core CPU of a personal computer. The data of each A-line are treated as one vector to maximize the data translation rate between the cores of the CPU and RAM stored image data. A display rate of 29.9 frames/sec for processed OCT data (4096 FFT-size x 500 A-scans) is achieved in our system using a wavelength swept source with 52-kHz swept frequency. The data processing times of the OCT image and a Doppler OCT image with a 4-time average are 23.8 msec and 91.4 msec.
Nested Conjugate Gradient Algorithm with Nested Preconditioning for Non-linear Image Restoration.

PubMed

Skariah, Deepak G; Arigovindan, Muthuvel

2017-06-19

We develop a novel optimization algorithm, which we call Nested Non-Linear Conjugate Gradient algorithm (NNCG), for image restoration based on quadratic data fitting and smooth non-quadratic regularization. The algorithm is constructed as a nesting of two conjugate gradient (CG) iterations. The outer iteration is constructed as a preconditioned non-linear CG algorithm; the preconditioning is performed by the inner CG iteration that is linear. The inner CG iteration, which performs preconditioning for outer CG iteration, itself is accelerated by an another FFT based non-iterative preconditioner. We prove that the method converges to a stationary point for both convex and non-convex regularization functionals. We demonstrate experimentally that proposed method outperforms the well-known majorization-minimization method used for convex regularization, and a non-convex inertial-proximal method for non-convex regularization functional.
Wavelet phase extracting demodulation algorithm based on scale factor for optical fiber Fabry-Perot sensing.

PubMed

Zhang, Baolin; Tong, Xinglin; Hu, Pan; Guo, Qian; Zheng, Zhiyuan; Zhou, Chaoran

2016-12-26

Optical fiber Fabry-Perot (F-P) sensors have been used in various on-line monitoring of physical parameters such as acoustics, temperature and pressure. In this paper, a wavelet phase extracting demodulation algorithm for optical fiber F-P sensing is first proposed. In application of this demodulation algorithm, search range of scale factor is determined by estimated cavity length which is obtained by fast Fourier transform (FFT) algorithm. Phase information of each point on the optical interference spectrum can be directly extracted through the continuous complex wavelet transform without de-noising. And the cavity length of the optical fiber F-P sensor is calculated by the slope of fitting curve of the phase. Theorical analysis and experiment results show that this algorithm can greatly reduce the amount of computation and improve demodulation speed and accuracy.
Hardware architecture design of image restoration based on time-frequency domain computation

NASA Astrophysics Data System (ADS)

Wen, Bo; Zhang, Jing; Jiao, Zipeng

2013-10-01

The image restoration algorithms based on time-frequency domain computation is high maturity and applied widely in engineering. To solve the high-speed implementation of these algorithms, the TFDC hardware architecture is proposed. Firstly, the main module is designed, by analyzing the common processing and numerical calculation. Then, to improve the commonality, the iteration control module is planed for iterative algorithms. In addition, to reduce the computational cost and memory requirements, the necessary optimizations are suggested for the time-consuming module, which include two-dimensional FFT/IFFT and the plural calculation. Eventually, the TFDC hardware architecture is adopted for hardware design of real-time image restoration system. The result proves that, the TFDC hardware architecture and its optimizations can be applied to image restoration algorithms based on TFDC, with good algorithm commonality, hardware realizability and high efficiency.
A fast estimator for the bispectrum and beyond - a practical method for measuring non-Gaussianity in 21-cm maps

NASA Astrophysics Data System (ADS)

Watkinson, Catherine A.; Majumdar, Suman; Pritchard, Jonathan R.; Mondal, Rajesh

2017-12-01

In this paper, we establish the accuracy and robustness of a fast estimator for the bispectrum - the 'FFT-bispectrum estimator'. The implementation of the estimator presented here offers speed and simplicity benefits over a direct-measurement approach. We also generalize the derivation so it may be easily be applied to any order polyspectra, such as the trispectrum, with the cost of only a handful of Fast-Fourier Transforms (FFTs). All lower order statistics can also be calculated simultaneously for little extra cost. To test the estimator, we make use of a non-linear density field, and for a more strongly non-Gaussian test case, we use a toy-model of reionization in which ionized bubbles at a given redshift are all of equal size and are randomly distributed. Our tests find that the FFT-estimator remains accurate over a wide range of k, and so should be extremely useful for analysis of 21-cm observations. The speed of the FFT-bispectrum estimator makes it suitable for sampling applications, such as Bayesian inference. The algorithm we describe should prove valuable in the analysis of simulations and observations, and whilst, we apply it within the field of cosmology, this estimator is useful in any field that deals with non-Gaussian data.
On the inversion of geodetic integrals defined over the sphere using 1-D FFT

NASA Astrophysics Data System (ADS)

García, R. V.; Alejo, C. A.

2005-08-01

An iterative method is presented which performs inversion of integrals defined over the sphere. The method is based on one-dimensional fast Fourier transform (1-D FFT) inversion and is implemented with the projected Landweber technique, which is used to solve constrained least-squares problems reducing the associated 1-D cyclic-convolution error. The results obtained are as precise as the direct matrix inversion approach, but with better computational efficiency. A case study uses the inversion of Hotine’s integral to obtain gravity disturbances from geoid undulations. Numerical convergence is also analyzed and comparisons with respect to the direct matrix inversion method using conjugate gradient (CG) iteration are presented. Like the CG method, the number of iterations needed to get the optimum (i.e., small) error decreases as the measurement noise increases. Nevertheless, for discrete data given over a whole parallel band, the method can be applied directly without implementing the projected Landweber method, since no cyclic convolution error exists.
GPU-Accelerated Forward and Back-Projections with Spatially Varying Kernels for 3D DIRECT TOF PET Reconstruction.

PubMed

Ha, S; Matej, S; Ispiryan, M; Mueller, K

2013-02-01

We describe a GPU-accelerated framework that efficiently models spatially (shift) variant system response kernels and performs forward- and back-projection operations with these kernels for the DIRECT (Direct Image Reconstruction for TOF) iterative reconstruction approach. Inherent challenges arise from the poor memory cache performance at non-axis aligned TOF directions. Focusing on the GPU memory access patterns, we utilize different kinds of GPU memory according to these patterns in order to maximize the memory cache performance. We also exploit the GPU instruction-level parallelism to efficiently hide long latencies from the memory operations. Our experiments indicate that our GPU implementation of the projection operators has slightly faster or approximately comparable time performance than FFT-based approaches using state-of-the-art FFTW routines. However, most importantly, our GPU framework can also efficiently handle any generic system response kernels, such as spatially symmetric and shift-variant as well as spatially asymmetric and shift-variant, both of which an FFT-based approach cannot cope with.
GPU-Accelerated Forward and Back-Projections With Spatially Varying Kernels for 3D DIRECT TOF PET Reconstruction

NASA Astrophysics Data System (ADS)

Ha, S.; Matej, S.; Ispiryan, M.; Mueller, K.

2013-02-01

We describe a GPU-accelerated framework that efficiently models spatially (shift) variant system response kernels and performs forward- and back-projection operations with these kernels for the DIRECT (Direct Image Reconstruction for TOF) iterative reconstruction approach. Inherent challenges arise from the poor memory cache performance at non-axis aligned TOF directions. Focusing on the GPU memory access patterns, we utilize different kinds of GPU memory according to these patterns in order to maximize the memory cache performance. We also exploit the GPU instruction-level parallelism to efficiently hide long latencies from the memory operations. Our experiments indicate that our GPU implementation of the projection operators has slightly faster or approximately comparable time performance than FFT-based approaches using state-of-the-art FFTW routines. However, most importantly, our GPU framework can also efficiently handle any generic system response kernels, such as spatially symmetric and shift-variant as well as spatially asymmetric and shift-variant, both of which an FFT-based approach cannot cope with.
Application of the one-dimensional Fourier transform for tracking moving objects in noisy environments

NASA Technical Reports Server (NTRS)

Rajala, S. A.; Riddle, A. N.; Snyder, W. E.

1983-01-01

In Riddle and Rajala (1981), an algorithm was presented which operates on an image sequence to identify all sets of pixels having the same velocity. The algorithm operates by performing a transformation in which all pixels with the same two-dimensional velocity map to a peak in a transform space. The transform can be decomposed into applications of the one-dimensional Fourier transform and therefore can gain from the computational advantages of the FFT. The aim of this paper is the concern with the fundamental limitations of that algorithm, particularly as relates to its sensitivity to image-disturbing parameters as noise, jitter, and clutter. A modification to the algorithm is then proposed which increases its robustness in the presence of these disturbances.
Efficient Terahertz Wide-Angle NUFFT-Based Inverse Synthetic Aperture Imaging Considering Spherical Wavefront.

PubMed

Gao, Jingkun; Deng, Bin; Qin, Yuliang; Wang, Hongqiang; Li, Xiang

2016-12-14

An efficient wide-angle inverse synthetic aperture imaging method considering the spherical wavefront effects and suitable for the terahertz band is presented. Firstly, the echo signal model under spherical wave assumption is established, and the detailed wavefront curvature compensation method accelerated by 1D fast Fourier transform (FFT) is discussed. Then, to speed up the reconstruction procedure, the fast Gaussian gridding (FGG)-based nonuniform FFT (NUFFT) is employed to focus the image. Finally, proof-of-principle experiments are carried out and the results are compared with the ones obtained by the convolution back-projection (CBP) algorithm. The results demonstrate the effectiveness and the efficiency of the presented method. This imaging method can be directly used in the field of nondestructive detection and can also be used to provide a solution for the calculation of the far-field RCSs (Radar Cross Section) of targets in the terahertz regime.
A combined finite element-boundary element formulation for solution of two-dimensional problems via CGFFT

NASA Technical Reports Server (NTRS)

Collins, Jeffery D.; Jin, Jian-Ming; Volakis, John L.

1990-01-01

A method for the computation of electromagnetic scattering from arbitrary two-dimensional bodies is presented. The method combines the finite element and boundary element methods leading to a system for solution via the conjugate gradient Fast Fourier Transform (FFT) algorithm. Two forms of boundaries aimed at reducing the storage requirement of the boundary integral are investigated. It is shown that the boundary integral becomes convolutional when a circular enclosure is chosen, resulting in reduced storage requirement when the system is solved via the conjugate gradient FFT method. The same holds for the ogival enclosure, except that some of the boundary integrals are not convolutional and must be carefully treated to maintain O(N) memory requirement. Results for several circular and ogival structures are presented and shown to be in excellent agreement with those obtained by traditional methods.
Frequency Estimator Performance for a Software-Based Beacon Receiver

NASA Technical Reports Server (NTRS)

Zemba, Michael J.; Morse, Jacquelynne R.; Nessel, James A.

2014-01-01

As propagation terminals have evolved, their design has trended more toward a software-based approach that facilitates convenient adjustment and customization of the receiver algorithms. One potential improvement is the implementation of a frequency estimation algorithm, through which the primary frequency component of the received signal can be estimated with a much greater resolution than with a simple peak search of the FFT spectrum. To select an estimator for usage in a Q/V-band beacon receiver, analysis of six frequency estimators was conducted to characterize their effectiveness as they relate to beacon receiver design.
Multi-carrier Communications over Time-varying Acoustic Channels

NASA Astrophysics Data System (ADS)

Aval, Yashar M.

Acoustic communication is an enabling technology for many autonomous undersea systems, such as those used for ocean monitoring, offshore oil and gas industry, aquaculture, or port security. There are three main challenges in achieving reliable high-rate underwater communication: the bandwidth of acoustic channels is extremely limited, the propagation delays are long, and the Doppler distortions are more pronounced than those found in wireless radio channels. In this dissertation we focus on assessing the fundamental limitations of acoustic communication, and designing efficient signal processing methods that cam overcome these limitations. We address the fundamental question of acoustic channel capacity (achievable rate) for single-input-multi-output (SIMO) acoustic channels using a per-path Rician fading model, and focusing on two scenarios: narrowband channels where the channel statistics can be approximated as frequency- independent, and wideband channels where the nominal path loss is frequency-dependent. In each scenario, we compare several candidate power allocation techniques, and show that assigning uniform power across all frequencies for the first scenario, and assigning uniform power across a selected frequency-band for the second scenario, are the best practical choices in most cases, because the long propagation delay renders the feedback information outdated for power allocation based on the estimated channel response. We quantify our results using the channel information extracted form the 2010 Mobile Acoustic Communications Experiment (MACE'10). Next, we focus on achieving reliable high-rate communication over underwater acoustic channels. Specifically, we investigate orthogonal frequency division multiplexing (OFDM) as the state-of-the-art technique for dealing with frequency-selective multipath channels, and propose a class of methods that compensate for the time-variation of the underwater acoustic channel. These methods are based on multiple-FFT demodulation, and are implemented as partial (P), shaped (S), fractional (F), and Taylor series expansion (T) FFT demodulation. They replace the conventional FFT demodulation with a few FFTs and a combiner. The input to each FFT is a specific transformation of the input signal (P,S,F,T), while the combiner performs weighted summation of the FFT outputs. We design an adaptive algorithm of stochastic gradient type to learn the combiner weights for coherent and differentially coherent detection. The algorithm is cast into the framework of multiple receiving elements to take advantage of spatial diversity. Synthetic data, as well as experimental data from the MACE'10 experiment are used to demonstrate the performance of the proposed methods, showing significant improvement over conventional detection techniques with or without inter-carrier interference equalization (5 dB--7 dB on average over multiple hours), as well as improved bandwidth efficiency.
Enhancement of lung sounds based on empirical mode decomposition and Fourier transform algorithm.

PubMed

Mondal, Ashok; Banerjee, Poulami; Somkuwar, Ajay

2017-02-01

There is always heart sound (HS) signal interfering during the recording of lung sound (LS) signals. This obscures the features of LS signals and creates confusion on pathological states, if any, of the lungs. In this work, a new method is proposed for reduction of heart sound interference which is based on empirical mode decomposition (EMD) technique and prediction algorithm. In this approach, first the mixed signal is split into several components in terms of intrinsic mode functions (IMFs). Thereafter, HS-included segments are localized and removed from them. The missing values of the gap thus produced, is predicted by a new Fast Fourier Transform (FFT) based prediction algorithm and the time domain LS signal is reconstructed by taking an inverse FFT of the estimated missing values. The experiments have been conducted on simulated and recorded HS corrupted LS signals at three different flow rates and various SNR levels. The performance of the proposed method is evaluated by qualitative and quantitative analysis of the results. It is found that the proposed method is superior to the baseline method in terms of quantitative and qualitative measurement. The developed method gives better results compared to baseline method for different SNR levels. Our method gives cross correlation index (CCI) of 0.9488, signal to deviation ratio (SDR) of 9.8262, and normalized maximum amplitude error (NMAE) of 26.94 for 0 dB SNR value. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
Fast frequency acquisition via adaptive least squares algorithm

NASA Technical Reports Server (NTRS)

Kumar, R.

1986-01-01

A new least squares algorithm is proposed and investigated for fast frequency and phase acquisition of sinusoids in the presence of noise. This algorithm is a special case of more general, adaptive parameter-estimation techniques. The advantages of the algorithms are their conceptual simplicity, flexibility and applicability to general situations. For example, the frequency to be acquired can be time varying, and the noise can be nonGaussian, nonstationary and colored. As the proposed algorithm can be made recursive in the number of observations, it is not necessary to have a priori knowledge of the received signal-to-noise ratio or to specify the measurement time. This would be required for batch processing techniques, such as the fast Fourier transform (FFT). The proposed algorithm improves the frequency estimate on a recursive basis as more and more observations are obtained. When the algorithm is applied in real time, it has the extra advantage that the observations need not be stored. The algorithm also yields a real time confidence measure as to the accuracy of the estimator.
Field Dislocation Mechanics for heterogeneous elastic materials: A numerical spectral approach

DOE Office of Scientific and Technical Information (OSTI.GOV)

Djaka, Komlan Senam; Villani, Aurelien; Taupin, Vincent

Spectral methods using Fast Fourier Transform (FFT) algorithms have recently seen a surge in interest in the mechanics of materials community. The present work addresses the critical question of determining accurate local mechanical fields using FFT methods without artificial fluctuations arising from materials and defects induced discontinuities. Precisely, this work introduces a numerical approach based on intrinsic discrete Fourier transforms for the simultaneous treatment of material discontinuities arising from the presence of dislocations and from elastic stiffness heterogeneities. To this end, the elasto-static equations of the field dislocation mechanics theory for periodic heterogeneous materials are numerically solved with FFT inmore » the case of dislocations in proximity of inclusions of varying stiffness. An optimal intrinsic discrete Fourier transform method is sought based on two distinct schemes. A centered finite difference scheme for differential rules are used for numerically solving the Poisson-type equation in the Fourier space, while centered finite differences on a rotated grid is chosen for the computation of the modified Fourier–Green’s operator associated with the Lippmann–Schwinger-type equation. By comparing different methods with analytical solutions for an edge dislocation in a composite material, it is found that the present spectral method is accurate, devoid of any numerical oscillation, and efficient even for an infinite phase elastic contrast like a hole embedded in a matrix containing a dislocation. The present FFT method is then used to simulate physical cases such as the elastic fields of dislocation dipoles located near the matrix/inclusion interface in a 2D composite material and the ones due to dislocation loop distributions surrounding cubic inclusions in 3D composite material. In these configurations, the spectral method allows investigating accurately the elastic interactions and image stresses due to dislocation fields in the presence of elastic inhomogeneities.« less
Field Dislocation Mechanics for heterogeneous elastic materials: A numerical spectral approach

DOE PAGES

Djaka, Komlan Senam; Villani, Aurelien; Taupin, Vincent; ...

2017-03-01

Spectral methods using Fast Fourier Transform (FFT) algorithms have recently seen a surge in interest in the mechanics of materials community. The present work addresses the critical question of determining accurate local mechanical fields using FFT methods without artificial fluctuations arising from materials and defects induced discontinuities. Precisely, this work introduces a numerical approach based on intrinsic discrete Fourier transforms for the simultaneous treatment of material discontinuities arising from the presence of dislocations and from elastic stiffness heterogeneities. To this end, the elasto-static equations of the field dislocation mechanics theory for periodic heterogeneous materials are numerically solved with FFT inmore » the case of dislocations in proximity of inclusions of varying stiffness. An optimal intrinsic discrete Fourier transform method is sought based on two distinct schemes. A centered finite difference scheme for differential rules are used for numerically solving the Poisson-type equation in the Fourier space, while centered finite differences on a rotated grid is chosen for the computation of the modified Fourier–Green’s operator associated with the Lippmann–Schwinger-type equation. By comparing different methods with analytical solutions for an edge dislocation in a composite material, it is found that the present spectral method is accurate, devoid of any numerical oscillation, and efficient even for an infinite phase elastic contrast like a hole embedded in a matrix containing a dislocation. The present FFT method is then used to simulate physical cases such as the elastic fields of dislocation dipoles located near the matrix/inclusion interface in a 2D composite material and the ones due to dislocation loop distributions surrounding cubic inclusions in 3D composite material. In these configurations, the spectral method allows investigating accurately the elastic interactions and image stresses due to dislocation fields in the presence of elastic inhomogeneities.« less

Fast Fourier Transform Co-Processor (FFTC)- Towards Embedded GFLOPs

NASA Astrophysics Data System (ADS)

Kuehl, Christopher; Liebstueckel, Uwe; Tejerina, Isaac; Uemminghaus, Michael; Wite, Felix; Kolb, Michael; Suess, Martin; Weigand, Roland

2012-08-01

Many signal processing applications and algorithms perform their operations on the data in the transform domain to gain efficiency. The Fourier Transform Co- Processor has been developed with the aim to offload General Purpose Processors from performing these transformations and therefore to boast the overall performance of a processing module. The IP of the commercial PowerFFT processor has been selected and adapted to meet the constraints of the space environment.In frame of the ESA activity “Fast Fourier Transform DSP Co-processor (FFTC)” (ESTEC/Contract No. 15314/07/NL/LvH/ma) the objectives were the following:Production of prototypes of a space qualified version of the commercial PowerFFT chip called FFTC based on the PowerFFT IP.The development of a stand-alone FFTC Accelerator Board (FTAB) based on the FFTC including the Controller FPGA and SpaceWire Interfaces to verify the FFTC function and performance.The FFTC chip performs its calculations with floating point precision. Stand alone it is capable computing FFTs of up to 1K complex samples in length in only 10μsec. This corresponds to an equivalent processing performance of 4.7 GFlops. In this mode the maximum sustained data throughput reaches 6.4Gbit/s. When connected to up to 4 EDAC protected SDRAM memory banks the FFTC can perform long FFTs with up to 1M complex samples in length or multidimensional FFT- based processing tasks.A Controller FPGA on the FTAB takes care of the SDRAM addressing. The instructions commanded via the Controller FPGA are used to set up the data flow and generate the memory addresses.The presentation will give and overview on the project, including the results of the validation of the FFTC ASIC prototypes.
Fast Fourier Transform Co-processor (FFTC), towards embedded GFLOPs

NASA Astrophysics Data System (ADS)

Kuehl, Christopher; Liebstueckel, Uwe; Tejerina, Isaac; Uemminghaus, Michael; Witte, Felix; Kolb, Michael; Suess, Martin; Weigand, Roland; Kopp, Nicholas

2012-10-01

Many signal processing applications and algorithms perform their operations on the data in the transform domain to gain efficiency. The Fourier Transform Co-Processor has been developed with the aim to offload General Purpose Processors from performing these transformations and therefore to boast the overall performance of a processing module. The IP of the commercial PowerFFT processor has been selected and adapted to meet the constraints of the space environment. In frame of the ESA activity "Fast Fourier Transform DSP Co-processor (FFTC)" (ESTEC/Contract No. 15314/07/NL/LvH/ma) the objectives were the following: • Production of prototypes of a space qualified version of the commercial PowerFFT chip called FFTC based on the PowerFFT IP. • The development of a stand-alone FFTC Accelerator Board (FTAB) based on the FFTC including the Controller FPGA and SpaceWire Interfaces to verify the FFTC function and performance. The FFTC chip performs its calculations with floating point precision. Stand alone it is capable computing FFTs of up to 1K complex samples in length in only 10μsec. This corresponds to an equivalent processing performance of 4.7 GFlops. In this mode the maximum sustained data throughput reaches 6.4Gbit/s. When connected to up to 4 EDAC protected SDRAM memory banks the FFTC can perform long FFTs with up to 1M complex samples in length or multidimensional FFT-based processing tasks. A Controller FPGA on the FTAB takes care of the SDRAM addressing. The instructions commanded via the Controller FPGA are used to set up the data flow and generate the memory addresses. The paper will give an overview on the project, including the results of the validation of the FFTC ASIC prototypes.
Applications Performance on NAS Intel Paragon XP/S - 15#

NASA Technical Reports Server (NTRS)

Saini, Subhash; Simon, Horst D.; Copper, D. M. (Technical Monitor)

1994-01-01

The Numerical Aerodynamic Simulation (NAS) Systems Division received an Intel Touchstone Sigma prototype model Paragon XP/S- 15 in February, 1993. The i860 XP microprocessor with an integrated floating point unit and operating in dual -instruction mode gives peak performance of 75 million floating point operations (NIFLOPS) per second for 64 bit floating point arithmetic. It is used in the Paragon XP/S-15 which has been installed at NAS, NASA Ames Research Center. The NAS Paragon has 208 nodes and its peak performance is 15.6 GFLOPS. Here, we will report on early experience using the Paragon XP/S- 15. We have tested its performance using both kernels and applications of interest to NAS. We have measured the performance of BLAS 1, 2 and 3 both assembly-coded and Fortran coded on NAS Paragon XP/S- 15. Furthermore, we have investigated the performance of a single node one-dimensional FFT, a distributed two-dimensional FFT and a distributed three-dimensional FFT Finally, we measured the performance of NAS Parallel Benchmarks (NPB) on the Paragon and compare it with the performance obtained on other highly parallel machines, such as CM-5, CRAY T3D, IBM SP I, etc. In particular, we investigated the following issues, which can strongly affect the performance of the Paragon: a. Impact of the operating system: Intel currently uses as a default an operating system OSF/1 AD from the Open Software Foundation. The paging of Open Software Foundation (OSF) server at 22 MB to make more memory available for the application degrades the performance. We found that when the limit of 26 NIB per node out of 32 MB available is reached, the application is paged out of main memory using virtual memory. When the application starts paging, the performance is considerably reduced. We found that dynamic memory allocation can help applications performance under certain circumstances. b. Impact of data cache on the i860/XP: We measured the performance of the BLAS both assembly coded and Fortran coded. We found that the measured performance of assembly-coded BLAS is much less than what memory bandwidth limitation would predict. The influence of data cache on different sizes of vectors is also investigated using one-dimensional FFTs. c. Impact of processor layout: There are several different ways processors can be laid out within the two-dimensional grid of processors on the Paragon. We have used the FFT example to investigate performance differences based on processors layout.
FPGA-based voltage and current dual drive system for high frame rate electrical impedance tomography.

PubMed

Khan, Shadab; Manwaring, Preston; Borsic, Andrea; Halter, Ryan

2015-04-01

Electrical impedance tomography (EIT) is used to image the electrical property distribution of a tissue under test. An EIT system comprises complex hardware and software modules, which are typically designed for a specific application. Upgrading these modules is a time-consuming process, and requires rigorous testing to ensure proper functioning of new modules with the existing ones. To this end, we developed a modular and reconfigurable data acquisition (DAQ) system using National Instruments' (NI) hardware and software modules, which offer inherent compatibility over generations of hardware and software revisions. The system can be configured to use up to 32-channels. This EIT system can be used to interchangeably apply current or voltage signal, and measure the tissue response in a semi-parallel fashion. A novel signal averaging algorithm, and 512-point fast Fourier transform (FFT) computation block was implemented on the FPGA. FFT output bins were classified as signal or noise. Signal bins constitute a tissue's response to a pure or mixed tone signal. Signal bins' data can be used for traditional applications, as well as synchronous frequency-difference imaging. Noise bins were used to compute noise power on the FPGA. Noise power represents a metric of signal quality, and can be used to ensure proper tissue-electrode contact. Allocation of these computationally expensive tasks to the FPGA reduced the required bandwidth between PC, and the FPGA for high frame rate EIT. In 16-channel configuration, with a signal-averaging factor of 8, the DAQ frame rate at 100 kHz exceeded 110 frames s (-1), and signal-to-noise ratio exceeded 90 dB across the spectrum. Reciprocity error was found to be for frequencies up to 1 MHz. Static imaging experiments were performed on a high-conductivity inclusion placed in a saline filled tank; the inclusion was clearly localized in the reconstructions obtained for both absolute current and voltage mode data.
Real-time implementing wavefront reconstruction for adaptive optics

NASA Astrophysics Data System (ADS)

Wang, Caixia; Li, Mei; Wang, Chunhong; Zhou, Luchun; Jiang, Wenhan

2004-12-01

The capability of real time wave-front reconstruction is important for an adaptive optics (AO) system. The bandwidth of system and the real-time processing ability of the wave-front processor is mainly affected by the speed of calculation. The system requires enough number of subapertures and high sampling frequency to compensate atmospheric turbulence. The number of reconstruction operation is increased accordingly. Since the performance of AO system improves with the decrease of calculation latency, it is necessary to study how to increase the speed of wavefront reconstruction. There are two methods to improve the real time of the reconstruction. One is to convert the wavefront reconstruction matrix, such as by wavelet or FFT. The other is enhancing the performance of the processing element. Analysis shows that the latency cutting is performed with the cost of reconstruction precision by the former method. In this article, the latter method is adopted. From the characteristic of the wavefront reconstruction algorithm, a systolic array by FPGA is properly designed to implement real-time wavefront reconstruction. The system delay is reduced greatly by the utilization of pipeline and parallel processing. The minimum latency of reconstruction is the reconstruction calculation of one subaperture.
An Improved Extrapolation Scheme for Truncated CT Data Using 2D Fourier-Based Helgason-Ludwig Consistency Conditions.

PubMed

Xia, Yan; Berger, Martin; Bauer, Sebastian; Hu, Shiyang; Aichert, Andre; Maier, Andreas

2017-01-01

We improve data extrapolation for truncated computed tomography (CT) projections by using Helgason-Ludwig (HL) consistency conditions that mathematically describe the overlap of information between projections. First, we theoretically derive a 2D Fourier representation of the HL consistency conditions from their original formulation (projection moment theorem), for both parallel-beam and fan-beam imaging geometry. The derivation result indicates that there is a zero energy region forming a double-wedge shape in 2D Fourier domain. This observation is also referred to as the Fourier property of a sinogram in the previous literature. The major benefit of this representation is that the consistency conditions can be efficiently evaluated via 2D fast Fourier transform (FFT). Then, we suggest a method that extrapolates the truncated projections with data from a uniform ellipse of which the parameters are determined by optimizing these consistency conditions. The forward projection of the optimized ellipse can be used to complete the truncation data. The proposed algorithm is evaluated using simulated data and reprojections of clinical data. Results show that the root mean square error (RMSE) is reduced substantially, compared to a state-of-the-art extrapolation method.
An Improved Extrapolation Scheme for Truncated CT Data Using 2D Fourier-Based Helgason-Ludwig Consistency Conditions

PubMed Central

Berger, Martin; Bauer, Sebastian; Hu, Shiyang; Aichert, Andre

2017-01-01

We improve data extrapolation for truncated computed tomography (CT) projections by using Helgason-Ludwig (HL) consistency conditions that mathematically describe the overlap of information between projections. First, we theoretically derive a 2D Fourier representation of the HL consistency conditions from their original formulation (projection moment theorem), for both parallel-beam and fan-beam imaging geometry. The derivation result indicates that there is a zero energy region forming a double-wedge shape in 2D Fourier domain. This observation is also referred to as the Fourier property of a sinogram in the previous literature. The major benefit of this representation is that the consistency conditions can be efficiently evaluated via 2D fast Fourier transform (FFT). Then, we suggest a method that extrapolates the truncated projections with data from a uniform ellipse of which the parameters are determined by optimizing these consistency conditions. The forward projection of the optimized ellipse can be used to complete the truncation data. The proposed algorithm is evaluated using simulated data and reprojections of clinical data. Results show that the root mean square error (RMSE) is reduced substantially, compared to a state-of-the-art extrapolation method. PMID:28808441
Effect of frontal facial type and sex on preferred chin projection.

PubMed

Choi, Jin-Young; Kim, Taeyun; Kim, Hyung-Mo; Lee, Sang-Hoon; Cho, Il-Sik; Baek, Seung-Hak

2017-03-01

To investigate the effects of frontal facial type (FFT) and sex on preferred chin projection (CP) in three-dimensional (3D) facial images. Six 3D facial images were acquired using a 3D facial scanner (euryprosopic [Eury-FFT], mesoprosopic [Meso-FFT], and leptoprosopic [Lepto-FFT] for each sex). After normal CP in each 3D facial image was set to 10° of the facial profile angle (glabella-subnasale-pogonion), CPs were morphed by gradations of 2° from normal (moderately protrusive [6°], slightly protrusive [8°], slightly retrusive [12°], and moderately retrusive [14°]). Seventy-five dental students (48 men and 27 women) were asked to rate the CPs (6°, 8°, 10°, 12°, and 14°) from the most to least preferred in each 3D image. Statistical analyses included the Kolmogorov-Smirnov test, Kruskal-Wallis test, and Bonferroni correction. No significant difference was observed in the distribution of preferred CP in the same FFT between male and female evaluators. In Meso-FFT, the normal CP was the most preferred without any sex difference. However, in Eury-FFT, the slightly protrusive CP was favored in male 3D images, but the normal CP was preferred in female 3D images. In Lepto-FFT, the normal CP was favored in male 3D images, whereas the slightly retrusive CP was favored in female 3D images. The mean preferred CP angle differed significantly according to FFT (Eury-FFT: male, 8.7°, female, 9.9°; Meso-FFT: male, 9.8°, female, 10.7°; Lepto-FFT: male, 10.8°, female, 11.4°; p < 0.001). Our findings might serve as guidelines for setting the preferred CP according to FFT and sex.
Fast parallel algorithm for slicing STL based on pipeline

NASA Astrophysics Data System (ADS)

Ma, Xulong; Lin, Feng; Yao, Bo

2016-05-01

In Additive Manufacturing field, the current researches of data processing mainly focus on a slicing process of large STL files or complicated CAD models. To improve the efficiency and reduce the slicing time, a parallel algorithm has great advantages. However, traditional algorithms can't make full use of multi-core CPU hardware resources. In the paper, a fast parallel algorithm is presented to speed up data processing. A pipeline mode is adopted to design the parallel algorithm. And the complexity of the pipeline algorithm is analyzed theoretically. To evaluate the performance of the new algorithm, effects of threads number and layers number are investigated by a serial of experiments. The experimental results show that the threads number and layers number are two remarkable factors to the speedup ratio. The tendency of speedup versus threads number reveals a positive relationship which greatly agrees with the Amdahl's law, and the tendency of speedup versus layers number also keeps a positive relationship agreeing with Gustafson's law. The new algorithm uses topological information to compute contours with a parallel method of speedup. Another parallel algorithm based on data parallel is used in experiments to show that pipeline parallel mode is more efficient. A case study at last shows a suspending performance of the new parallel algorithm. Compared with the serial slicing algorithm, the new pipeline parallel algorithm can make full use of the multi-core CPU hardware, accelerate the slicing process, and compared with the data parallel slicing algorithm, the new slicing algorithm in this paper adopts a pipeline parallel model, and a much higher speedup ratio and efficiency is achieved.
A fast and accurate frequency estimation algorithm for sinusoidal signal with harmonic components

NASA Astrophysics Data System (ADS)

Hu, Jinghua; Pan, Mengchun; Zeng, Zhidun; Hu, Jiafei; Chen, Dixiang; Tian, Wugang; Zhao, Jianqiang; Du, Qingfa

2016-10-01

Frequency estimation is a fundamental problem in many applications, such as traditional vibration measurement, power system supervision, and microelectromechanical system sensors control. In this paper, a fast and accurate frequency estimation algorithm is proposed to deal with low efficiency problem in traditional methods. The proposed algorithm consists of coarse and fine frequency estimation steps, and we demonstrate that it is more efficient than conventional searching methods to achieve coarse frequency estimation (location peak of FFT amplitude) by applying modified zero-crossing technique. Thus, the proposed estimation algorithm requires less hardware and software sources and can achieve even higher efficiency when the experimental data increase. Experimental results with modulated magnetic signal show that the root mean square error of frequency estimation is below 0.032 Hz with the proposed algorithm, which has lower computational complexity and better global performance than conventional frequency estimation methods.
Optimizing Excited-State Electronic-Structure Codes for Intel Knights Landing: A Case Study on the BerkeleyGW Software

DOE Office of Scientific and Technical Information (OSTI.GOV)

Deslippe, Jack; da Jornada, Felipe H.; Vigil-Fowler, Derek

2016-10-06

We profile and optimize calculations performed with the BerkeleyGW code on the Xeon-Phi architecture. BerkeleyGW depends both on hand-tuned critical kernels as well as on BLAS and FFT libraries. We describe the optimization process and performance improvements achieved. We discuss a layered parallelization strategy to take advantage of vector, thread and node-level parallelism. We discuss locality changes (including the consequence of the lack of L3 cache) and effective use of the on-package high-bandwidth memory. We show preliminary results on Knights-Landing including a roofline study of code performance before and after a number of optimizations. We find that the GW methodmore » is particularly well-suited for many-core architectures due to the ability to exploit a large amount of parallelism over plane-wave components, band-pairs, and frequencies.« less
A parallel algorithm for the eigenvalues and eigenvectors for a general complex matrix

NASA Technical Reports Server (NTRS)

Shroff, Gautam

1989-01-01

A new parallel Jacobi-like algorithm is developed for computing the eigenvalues of a general complex matrix. Most parallel methods for this parallel typically display only linear convergence. Sequential norm-reducing algorithms also exit and they display quadratic convergence in most cases. The new algorithm is a parallel form of the norm-reducing algorithm due to Eberlein. It is proven that the asymptotic convergence rate of this algorithm is quadratic. Numerical experiments are presented which demonstrate the quadratic convergence of the algorithm and certain situations where the convergence is slow are also identified. The algorithm promises to be very competitive on a variety of parallel architectures.
Parallel algorithms for placement and routing in VLSI design. Ph.D. Thesis

NASA Technical Reports Server (NTRS)

Brouwer, Randall Jay

1991-01-01

The computational requirements for high quality synthesis, analysis, and verification of very large scale integration (VLSI) designs have rapidly increased with the fast growing complexity of these designs. Research in the past has focused on the development of heuristic algorithms, special purpose hardware accelerators, or parallel algorithms for the numerous design tasks to decrease the time required for solution. Two new parallel algorithms are proposed for two VLSI synthesis tasks, standard cell placement and global routing. The first algorithm, a parallel algorithm for global routing, uses hierarchical techniques to decompose the routing problem into independent routing subproblems that are solved in parallel. Results are then presented which compare the routing quality to the results of other published global routers and which evaluate the speedups attained. The second algorithm, a parallel algorithm for cell placement and global routing, hierarchically integrates a quadrisection placement algorithm, a bisection placement algorithm, and the previous global routing algorithm. Unique partitioning techniques are used to decompose the various stages of the algorithm into independent tasks which can be evaluated in parallel. Finally, results are presented which evaluate the various algorithm alternatives and compare the algorithm performance to other placement programs. Measurements are presented on the parallel speedups available.
Efficient convolutional sparse coding

DOEpatents

Wohlberg, Brendt

2017-06-20

Computationally efficient algorithms may be applied for fast dictionary learning solving the convolutional sparse coding problem in the Fourier domain. More specifically, efficient convolutional sparse coding may be derived within an alternating direction method of multipliers (ADMM) framework that utilizes fast Fourier transforms (FFT) to solve the main linear system in the frequency domain. Such algorithms may enable a significant reduction in computational cost over conventional approaches by implementing a linear solver for the most critical and computationally expensive component of the conventional iterative algorithm. The theoretical computational cost of the algorithm may be reduced from O(M.sup.3N) to O(MN log N), where N is the dimensionality of the data and M is the number of elements in the dictionary. This significant improvement in efficiency may greatly increase the range of problems that can practically be addressed via convolutional sparse representations.
Iterative algorithms for large sparse linear systems on parallel computers

NASA Technical Reports Server (NTRS)

Adams, L. M.

1982-01-01

Algorithms for assembling in parallel the sparse system of linear equations that result from finite difference or finite element discretizations of elliptic partial differential equations, such as those that arise in structural engineering are developed. Parallel linear stationary iterative algorithms and parallel preconditioned conjugate gradient algorithms are developed for solving these systems. In addition, a model for comparing parallel algorithms on array architectures is developed and results of this model for the algorithms are given.
Performance comparison of the Prophecy (forecasting) Algorithm in FFT form for unseen feature and time-series prediction

NASA Astrophysics Data System (ADS)

Jaenisch, Holger; Handley, James

2013-06-01

We introduce a generalized numerical prediction and forecasting algorithm. We have previously published it for malware byte sequence feature prediction and generalized distribution modeling for disparate test article analysis. We show how non-trivial non-periodic extrapolation of a numerical sequence (forecast and backcast) from the starting data is possible. Our ancestor-progeny prediction can yield new options for evolutionary programming. Our equations enable analytical integrals and derivatives to any order. Interpolation is controllable from smooth continuous to fractal structure estimation. We show how our generalized trigonometric polynomial can be derived using a Fourier transform.
3-Dimensional stereo implementation of photoacoustic imaging based on a new image reconstruction algorithm without using discrete Fourier transform

NASA Astrophysics Data System (ADS)

Ham, Woonchul; Song, Chulgyu

2017-05-01

In this paper, we propose a new three-dimensional stereo image reconstruction algorithm for a photoacoustic medical imaging system. We also introduce and discuss a new theoretical algorithm by using the physical concept of Radon transform. The main key concept of proposed theoretical algorithm is to evaluate the existence possibility of the acoustic source within a searching region by using the geometric distance between each sensor element of acoustic detector and the corresponding searching region denoted by grid. We derive the mathematical equation for the magnitude of the existence possibility which can be used for implementing a new proposed algorithm. We handle and derive mathematical equations of proposed algorithm for the one-dimensional sensing array case as well as two dimensional sensing array case too. A mathematical k-wave simulation data are used for comparing the image quality of the proposed algorithm with that of general conventional algorithm in which the FFT should be necessarily used. From the k-wave Matlab simulation results, we can prove the effectiveness of the proposed reconstruction algorithm.
A class of parallel algorithms for computation of the manipulator inertia matrix

NASA Technical Reports Server (NTRS)

Fijany, Amir; Bejczy, Antal K.

1989-01-01

Parallel and parallel/pipeline algorithms for computation of the manipulator inertia matrix are presented. An algorithm based on composite rigid-body spatial inertia method, which provides better features for parallelization, is used for the computation of the inertia matrix. Two parallel algorithms are developed which achieve the time lower bound in computation. Also described is the mapping of these algorithms with topological variation on a two-dimensional processor array, with nearest-neighbor connection, and with cardinality variation on a linear processor array. An efficient parallel/pipeline algorithm for the linear array was also developed, but at significantly higher efficiency.
Simulation for noise cancellation using LMS adaptive filter

NASA Astrophysics Data System (ADS)

Lee, Jia-Haw; Ooi, Lu-Ean; Ko, Ying-Hao; Teoh, Choe-Yung

2017-06-01

In this paper, the fundamental algorithm of noise cancellation, Least Mean Square (LMS) algorithm is studied and enhanced with adaptive filter. The simulation of the noise cancellation using LMS adaptive filter algorithm is developed. The noise corrupted speech signal and the engine noise signal are used as inputs for LMS adaptive filter algorithm. The filtered signal is compared to the original noise-free speech signal in order to highlight the level of attenuation of the noise signal. The result shows that the noise signal is successfully canceled by the developed adaptive filter. The difference of the noise-free speech signal and filtered signal are calculated and the outcome implies that the filtered signal is approaching the noise-free speech signal upon the adaptive filtering. The frequency range of the successfully canceled noise by the LMS adaptive filter algorithm is determined by performing Fast Fourier Transform (FFT) on the signals. The LMS adaptive filter algorithm shows significant noise cancellation at lower frequency range.
Introducing parallelism to histogramming functions for GEM systems

NASA Astrophysics Data System (ADS)

Krawczyk, Rafał D.; Czarski, Tomasz; Kolasinski, Piotr; Pozniak, Krzysztof T.; Linczuk, Maciej; Byszuk, Adrian; Chernyshova, Maryna; Juszczyk, Bartlomiej; Kasprowicz, Grzegorz; Wojenski, Andrzej; Zabolotny, Wojciech

2015-09-01

This article is an assessment of potential parallelization of histogramming algorithms in GEM detector system. Histogramming and preprocessing algorithms in MATLAB were analyzed with regard to adding parallelism. Preliminary implementation of parallel strip histogramming resulted in speedup. Analysis of algorithms parallelizability is presented. Overview of potential hardware and software support to implement parallel algorithm is discussed.

Parallelizing flow-accumulation calculations on graphics processing units—From iterative DEM preprocessing algorithm to recursive multiple-flow-direction algorithm

NASA Astrophysics Data System (ADS)

Qin, Cheng-Zhi; Zhan, Lijun

2012-06-01

As one of the important tasks in digital terrain analysis, the calculation of flow accumulations from gridded digital elevation models (DEMs) usually involves two steps in a real application: (1) using an iterative DEM preprocessing algorithm to remove the depressions and flat areas commonly contained in real DEMs, and (2) using a recursive flow-direction algorithm to calculate the flow accumulation for every cell in the DEM. Because both algorithms are computationally intensive, quick calculation of the flow accumulations from a DEM (especially for a large area) presents a practical challenge to personal computer (PC) users. In recent years, rapid increases in hardware capacity of the graphics processing units (GPUs) provided in modern PCs have made it possible to meet this challenge in a PC environment. Parallel computing on GPUs using a compute-unified-device-architecture (CUDA) programming model has been explored to speed up the execution of the single-flow-direction algorithm (SFD). However, the parallel implementation on a GPU of the multiple-flow-direction (MFD) algorithm, which generally performs better than the SFD algorithm, has not been reported. Moreover, GPU-based parallelization of the DEM preprocessing step in the flow-accumulation calculations has not been addressed. This paper proposes a parallel approach to calculate flow accumulations (including both iterative DEM preprocessing and a recursive MFD algorithm) on a CUDA-compatible GPU. For the parallelization of an MFD algorithm (MFD-md), two different parallelization strategies using a GPU are explored. The first parallelization strategy, which has been used in the existing parallel SFD algorithm on GPU, has the problem of computing redundancy. Therefore, we designed a parallelization strategy based on graph theory. The application results show that the proposed parallel approach to calculate flow accumulations on a GPU performs much faster than either sequential algorithms or other parallel GPU-based algorithms based on existing parallelization strategies.
The development of a scalable parallel 3-D CFD algorithm for turbomachinery. M.S. Thesis Final Report

NASA Technical Reports Server (NTRS)

Luke, Edward Allen

1993-01-01

Two algorithms capable of computing a transonic 3-D inviscid flow field about rotating machines are considered for parallel implementation. During the study of these algorithms, a significant new method of measuring the performance of parallel algorithms is developed. The theory that supports this new method creates an empirical definition of scalable parallel algorithms that is used to produce quantifiable evidence that a scalable parallel application was developed. The implementation of the parallel application and an automated domain decomposition tool are also discussed.
New correction procedures for the fast field program which extend its range

NASA Technical Reports Server (NTRS)

West, M.; Sack, R. A.

1990-01-01

A fast field program (FFP) algorithm was developed based on the method of Lee et al., for the prediction of sound pressure level from low frequency, high intensity sources. In order to permit accurate predictions at distances greater than 2 km, new correction procedures have had to be included in the algorithm. Certain functions, whose Hankel transforms can be determined analytically, are subtracted from the depth dependent Green's function. The distance response is then obtained as the sum of these transforms and the Fast Fourier Transformation (FFT) of the residual k dependent function. One procedure, which permits the elimination of most complex exponentials, has allowed significant changes in the structure of the FFP algorithm, which has resulted in a substantial reduction in computation time.
Parallel Lattice Basis Reduction Using a Multi-threaded Schnorr-Euchner LLL Algorithm

NASA Astrophysics Data System (ADS)

Backes, Werner; Wetzel, Susanne

In this paper, we introduce a new parallel variant of the LLL lattice basis reduction algorithm. Our new, multi-threaded algorithm is the first to provide an efficient, parallel implementation of the Schorr-Euchner algorithm for today’s multi-processor, multi-core computer architectures. Experiments with sparse and dense lattice bases show a speed-up factor of about 1.8 for the 2-thread and about factor 3.2 for the 4-thread version of our new parallel lattice basis reduction algorithm in comparison to the traditional non-parallel algorithm.
Parallel O(log n) algorithms for open- and closed-chain rigid multibody systems based on a new mass matrix factorization technique

NASA Technical Reports Server (NTRS)

Fijany, Amir

1993-01-01

In this paper, parallel O(log n) algorithms for computation of rigid multibody dynamics are developed. These parallel algorithms are derived by parallelization of new O(n) algorithms for the problem. The underlying feature of these O(n) algorithms is a drastically different strategy for decomposition of interbody force which leads to a new factorization of the mass matrix (M). Specifically, it is shown that a factorization of the inverse of the mass matrix in the form of the Schur Complement is derived as M(exp -1) = C - B(exp *)A(exp -1)B, wherein matrices C, A, and B are block tridiagonal matrices. The new O(n) algorithm is then derived as a recursive implementation of this factorization of M(exp -1). For the closed-chain systems, similar factorizations and O(n) algorithms for computation of Operational Space Mass Matrix lambda and its inverse lambda(exp -1) are also derived. It is shown that these O(n) algorithms are strictly parallel, that is, they are less efficient than other algorithms for serial computation of the problem. But, to our knowledge, they are the only known algorithms that can be parallelized and that lead to both time- and processor-optimal parallel algorithms for the problem, i.e., parallel O(log n) algorithms with O(n) processors. The developed parallel algorithms, in addition to their theoretical significance, are also practical from an implementation point of view due to their simple architectural requirements.
High-Performance Psychometrics: The Parallel-E Parallel-M Algorithm for Generalized Latent Variable Models. Research Report. ETS RR-16-34

ERIC Educational Resources Information Center

von Davier, Matthias

2016-01-01

This report presents results on a parallel implementation of the expectation-maximization (EM) algorithm for multidimensional latent variable models. The developments presented here are based on code that parallelizes both the E step and the M step of the parallel-E parallel-M algorithm. Examples presented in this report include item response…
Fast algorithm of adaptive Fourier series

NASA Astrophysics Data System (ADS)

Gao, You; Ku, Min; Qian, Tao

2018-05-01

Adaptive Fourier decomposition (AFD, precisely 1-D AFD or Core-AFD) was originated for the goal of positive frequency representations of signals. It achieved the goal and at the same time offered fast decompositions of signals. There then arose several types of AFDs. AFD merged with the greedy algorithm idea, and in particular, motivated the so-called pre-orthogonal greedy algorithm (Pre-OGA) that was proven to be the most efficient greedy algorithm. The cost of the advantages of the AFD type decompositions is, however, the high computational complexity due to the involvement of maximal selections of the dictionary parameters. The present paper offers one formulation of the 1-D AFD algorithm by building the FFT algorithm into it. Accordingly, the algorithm complexity is reduced, from the original $\\mathcal{O}(M N^2)$ to $\\mathcal{O}(M N\\log_2 N)$, where $N$ denotes the number of the discretization points on the unit circle and $M$ denotes the number of points in $[0,1)$. This greatly enhances the applicability of AFD. Experiments are carried out to show the high efficiency of the proposed algorithm.
ProperCAD: A portable object-oriented parallel environment for VLSI CAD

NASA Technical Reports Server (NTRS)

Ramkumar, Balkrishna; Banerjee, Prithviraj

1993-01-01

Most parallel algorithms for VLSI CAD proposed to date have one important drawback: they work efficiently only on machines that they were designed for. As a result, algorithms designed to date are dependent on the architecture for which they are developed and do not port easily to other parallel architectures. A new project under way to address this problem is described. A Portable object-oriented parallel environment for CAD algorithms (ProperCAD) is being developed. The objectives of this research are (1) to develop new parallel algorithms that run in a portable object-oriented environment (CAD algorithms using a general purpose platform for portable parallel programming called CARM is being developed and a C++ environment that is truly object-oriented and specialized for CAD applications is also being developed); and (2) to design the parallel algorithms around a good sequential algorithm with a well-defined parallel-sequential interface (permitting the parallel algorithm to benefit from future developments in sequential algorithms). One CAD application that has been implemented as part of the ProperCAD project, flat VLSI circuit extraction, is described. The algorithm, its implementation, and its performance on a range of parallel machines are discussed in detail. It currently runs on an Encore Multimax, a Sequent Symmetry, Intel iPSC/2 and i860 hypercubes, a NCUBE 2 hypercube, and a network of Sun Sparc workstations. Performance data for other applications that were developed are provided: namely test pattern generation for sequential circuits, parallel logic synthesis, and standard cell placement.
Using the fast fourier transform in binding free energy calculations.

PubMed

Nguyen, Trung Hai; Zhou, Huan-Xiang; Minh, David D L

2018-04-30

According to implicit ligand theory, the standard binding free energy is an exponential average of the binding potential of mean force (BPMF), an exponential average of the interaction energy between the unbound ligand ensemble and a rigid receptor. Here, we use the fast Fourier transform (FFT) to efficiently evaluate BPMFs by calculating interaction energies when rigid ligand configurations from the unbound ensemble are discretely translated across rigid receptor conformations. Results for standard binding free energies between T4 lysozyme and 141 small organic molecules are in good agreement with previous alchemical calculations based on (1) a flexible complex ( R≈0.9 for 24 systems) and (2) flexible ligand with multiple rigid receptor configurations ( R≈0.8 for 141 systems). While the FFT is routinely used for molecular docking, to our knowledge this is the first time that the algorithm has been used for rigorous binding free energy calculations. © 2017 Wiley Periodicals, Inc. © 2017 Wiley Periodicals, Inc.
Frequency characteristics of the heart rate variability produced by Cheyne-Stokes respiration during 24-hr ambulatory electrocardiographic monitoring.

PubMed

Ichimaru, Y; Yanaga, T

1989-06-01

Spectral analysis of heart rates during 24-hr ambulatory electrocardiographic monitoring has been carried out to characterize the heart rate spectral components of Cheyne-Stokes respiration (CSR) by using fast Fourier transformation (FFT). Eight patients with congestive heart failure were selected for the study. FFT analyses have been performed for 614.4 sec. Out of the power spectrum, five parameters were extracted to characterize the CSR. The low peak frequencies in eight subjects were between 0.0179 Hz (56 sec) and 0.0081 Hz (123 sec). The algorithms used to detect CSR are the followings: (i) if the LFPA/ULFA ratios were above the absolute value of 1.0, and (ii) the LFPP/MLFP ratios were above the absolute values of 4.0, then the power spectrum is suggestive of CSR. We conclude that the automatic detection of CSR by heart rate spectral analysis during ambulatory ECG monitoring may afford a tool for the evaluation of the patients with congestive heart failure.
GPU Optimizations for a Production Molecular Docking Code*

PubMed Central

Landaverde, Raphael; Herbordt, Martin C.

2015-01-01

Modeling molecular docking is critical to both understanding life processes and designing new drugs. In previous work we created the first published GPU-accelerated docking code (PIPER) which achieved a roughly 5× speed-up over a contemporaneous 4 core CPU. Advances in GPU architecture and in the CPU code, however, have since reduced this relalative performance by a factor of 10. In this paper we describe the upgrade of GPU PIPER. This required an entire rewrite, including algorithm changes and moving most remaining non-accelerated CPU code onto the GPU. The result is a 7× improvement in GPU performance and a 3.3× speedup over the CPU-only code. We find that this difference in time is almost entirely due to the difference in run times of the 3D FFT library functions on CPU (MKL) and GPU (cuFFT), respectively. The GPU code has been integrated into the ClusPro docking server which has over 4000 active users. PMID:26594667
GPU Optimizations for a Production Molecular Docking Code.

PubMed

Landaverde, Raphael; Herbordt, Martin C

2014-09-01

Modeling molecular docking is critical to both understanding life processes and designing new drugs. In previous work we created the first published GPU-accelerated docking code (PIPER) which achieved a roughly 5× speed-up over a contemporaneous 4 core CPU. Advances in GPU architecture and in the CPU code, however, have since reduced this relalative performance by a factor of 10. In this paper we describe the upgrade of GPU PIPER. This required an entire rewrite, including algorithm changes and moving most remaining non-accelerated CPU code onto the GPU. The result is a 7× improvement in GPU performance and a 3.3× speedup over the CPU-only code. We find that this difference in time is almost entirely due to the difference in run times of the 3D FFT library functions on CPU (MKL) and GPU (cuFFT), respectively. The GPU code has been integrated into the ClusPro docking server which has over 4000 active users.
Multitasking TORT under UNICOS: Parallel performance models and measurements

DOE Office of Scientific and Technical Information (OSTI.GOV)

Barnett, A.; Azmy, Y.Y.

1999-09-27

The existing parallel algorithms in the TORT discrete ordinates code were updated to function in a UNICOS environment. A performance model for the parallel overhead was derived for the existing algorithms. The largest contributors to the parallel overhead were identified and a new algorithm was developed. A parallel overhead model was also derived for the new algorithm. The results of the comparison of parallel performance models were compared to applications of the code to two TORT standard test problems and a large production problem. The parallel performance models agree well with the measured parallel overhead.
Multitasking TORT Under UNICOS: Parallel Performance Models and Measurements

DOE Office of Scientific and Technical Information (OSTI.GOV)

Azmy, Y.Y.; Barnett, D.A.

1999-09-27

The existing parallel algorithms in the TORT discrete ordinates were updated to function in a UNI-COS environment. A performance model for the parallel overhead was derived for the existing algorithms. The largest contributors to the parallel overhead were identified and a new algorithm was developed. A parallel overhead model was also derived for the new algorithm. The results of the comparison of parallel performance models were compared to applications of the code to two TORT standard test problems and a large production problem. The parallel performance models agree well with the measured parallel overhead.
Chemical mass transport between fluid fine tailings and the overlying water cover of an oil sands end pit lake

NASA Astrophysics Data System (ADS)

Dompierre, Kathryn A.; Barbour, S. Lee; North, Rebecca L.; Carey, Sean K.; Lindsay, Matthew B. J.

2017-06-01

Fluid fine tailings (FFT) are a principal by-product of the bitumen extraction process at oil sands mines. Base Mine Lake (BML)—the first full-scale demonstration oil sands end pit lake (EPL)—contains approximately 1.9 × 108 m3 of FFT stored under a water cover within a decommissioned mine pit. Chemical mass transfer from the FFT to the water cover can occur via two key processes: (1) advection-dispersion driven by tailings settlement; and (2) FFT disturbance due to fluid movement in the water cover. Dissolved chloride (Cl) was used to evaluate the water cover mass balance and to track mass transport within the underlying FFT based on field sampling and numerical modeling. Results indicated that FFT was the dominant Cl source to the water cover and that the FFT is exhibiting a transient advection-dispersion mass transport regime with intermittent disturbance near the FFT-water interface. The advective pore water flux was estimated by the mass balance to be 0.002 m3 m-2 d-1, which represents 0.73 m of FFT settlement per year. However, the FFT pore water Cl concentrations and corresponding mass transport simulations indicated that advection rates and disturbance depths vary between sample locations. The disturbance depth was estimated to vary with location between 0.75 and 0.95 m. This investigation provides valuable insight for assessing the geochemical evolution of the water cover and performance of EPLs as an oil sands reclamation strategy.
Energy Efficient GNSS Signal Acquisition Using Singular Value Decomposition (SVD).

PubMed

Bermúdez Ordoñez, Juan Carlos; Arnaldo Valdés, Rosa María; Gómez Comendador, Fernando

2018-05-16

A significant challenge in global navigation satellite system (GNSS) signal processing is a requirement for a very high sampling rate. The recently-emerging compressed sensing (CS) theory makes processing GNSS signals at a low sampling rate possible if the signal has a sparse representation in a certain space. Based on CS and SVD theories, an algorithm for sampling GNSS signals at a rate much lower than the Nyquist rate and reconstructing the compressed signal is proposed in this research, which is validated after the output from that process still performs signal detection using the standard fast Fourier transform (FFT) parallel frequency space search acquisition. The sparse representation of the GNSS signal is the most important precondition for CS, by constructing a rectangular Toeplitz matrix (TZ) of the transmitted signal, calculating the left singular vectors using SVD from the TZ, to achieve sparse signal representation. Next, obtaining the M-dimensional observation vectors based on the left singular vectors of the SVD, which are equivalent to the sampler operator in standard compressive sensing theory, the signal can be sampled below the Nyquist rate, and can still be reconstructed via ℓ 1 minimization with accuracy using convex optimization. As an added value, there is a GNSS signal acquisition enhancement effect by retaining the useful signal and filtering out noise by projecting the signal into the most significant proper orthogonal modes (PODs) which are the optimal distributions of signal power. The algorithm is validated with real recorded signals, and the results show that the proposed method is effective for sampling, reconstructing intermediate frequency (IF) GNSS signals in the time discrete domain.
Energy Efficient GNSS Signal Acquisition Using Singular Value Decomposition (SVD)

PubMed Central

Arnaldo Valdés, Rosa María; Gómez Comendador, Fernando

2018-01-01

A significant challenge in global navigation satellite system (GNSS) signal processing is a requirement for a very high sampling rate. The recently-emerging compressed sensing (CS) theory makes processing GNSS signals at a low sampling rate possible if the signal has a sparse representation in a certain space. Based on CS and SVD theories, an algorithm for sampling GNSS signals at a rate much lower than the Nyquist rate and reconstructing the compressed signal is proposed in this research, which is validated after the output from that process still performs signal detection using the standard fast Fourier transform (FFT) parallel frequency space search acquisition. The sparse representation of the GNSS signal is the most important precondition for CS, by constructing a rectangular Toeplitz matrix (TZ) of the transmitted signal, calculating the left singular vectors using SVD from the TZ, to achieve sparse signal representation. Next, obtaining the M-dimensional observation vectors based on the left singular vectors of the SVD, which are equivalent to the sampler operator in standard compressive sensing theory, the signal can be sampled below the Nyquist rate, and can still be reconstructed via ℓ1 minimization with accuracy using convex optimization. As an added value, there is a GNSS signal acquisition enhancement effect by retaining the useful signal and filtering out noise by projecting the signal into the most significant proper orthogonal modes (PODs) which are the optimal distributions of signal power. The algorithm is validated with real recorded signals, and the results show that the proposed method is effective for sampling, reconstructing intermediate frequency (IF) GNSS signals in the time discrete domain. PMID:29772731
An exploratory analysis of PubMed's free full-text limit on citation retrieval for clinical questions.

PubMed

Krieger, Mary M; Richter, Randy R; Austin, Tricia M

2008-10-01

The research sought to determine (1) how use of the PubMed free full-text (FFT) limit affects citation retrieval and (2) how use of the FFT limit impacts the types of articles and levels of evidence retrieved. Four clinical questions based on a research agenda for physical therapy were searched in PubMed both with and without the use of the FFT limit. Retrieved citations were examined for relevancy to each question. Abstracts of relevant citations were reviewed to determine the types of articles and levels of evidence. Descriptive analysis was used to compare the total number of citations, number of relevant citations, types of articles, and levels of evidence both with and without the use of the FFT limit. Across all 4 questions, the FFT limit reduced the number of citations to 11.1% of the total number of citations retrieved without the FFT limit. Additionally, high-quality evidence such as systematic reviews and randomized controlled trials were missed when the FFT limit was used. Health sciences librarians play a key role in educating users about the potential impact the FFT limit has on the number of citations, types of articles, and levels of evidence retrieved.
Scalable Parallel Density-based Clustering and Applications

NASA Astrophysics Data System (ADS)

Patwary, Mostofa Ali

2014-04-01

Recently, density-based clustering algorithms (DBSCAN and OPTICS) have gotten significant attention of the scientific community due to their unique capability of discovering arbitrary shaped clusters and eliminating noise data. These algorithms have several applications, which require high performance computing, including finding halos and subhalos (clusters) from massive cosmology data in astrophysics, analyzing satellite images, X-ray crystallography, and anomaly detection. However, parallelization of these algorithms are extremely challenging as they exhibit inherent sequential data access order, unbalanced workload resulting in low parallel efficiency. To break the data access sequentiality and to achieve high parallelism, we develop new parallel algorithms, both for DBSCAN and OPTICS, designed using graph algorithmic techniques. For example, our parallel DBSCAN algorithm exploits the similarities between DBSCAN and computing connected components. Using datasets containing up to a billion floating point numbers, we show that our parallel density-based clustering algorithms significantly outperform the existing algorithms, achieving speedups up to 27.5 on 40 cores on shared memory architecture and speedups up to 5,765 using 8,192 cores on distributed memory architecture. In our experiments, we found that while achieving the scalability, our algorithms produce clustering results with comparable quality to the classical algorithms.
Breaking the polar-nonpolar division in solvation free energy prediction.

PubMed

Wang, Bao; Wang, Chengzhang; Wu, Kedi; Wei, Guo-Wei

2018-02-05

Implicit solvent models divide solvation free energies into polar and nonpolar additive contributions, whereas polar and nonpolar interactions are inseparable and nonadditive. We present a feature functional theory (FFT) framework to break this ad hoc division. The essential ideas of FFT are as follows: (i) representability assumption: there exists a microscopic feature vector that can uniquely characterize and distinguish one molecule from another; (ii) feature-function relationship assumption: the macroscopic features, including solvation free energy, of a molecule is a functional of microscopic feature vectors; and (iii) similarity assumption: molecules with similar microscopic features have similar macroscopic properties, such as solvation free energies. Based on these assumptions, solvation free energy prediction is carried out in the following protocol. First, we construct a molecular microscopic feature vector that is efficient in characterizing the solvation process using quantum mechanics and Poisson-Boltzmann theory. Microscopic feature vectors are combined with macroscopic features, that is, physical observable, to form extended feature vectors. Additionally, we partition a solvation dataset into queries according to molecular compositions. Moreover, for each target molecule, we adopt a machine learning algorithm for its nearest neighbor search, based on the selected microscopic feature vectors. Finally, from the extended feature vectors of obtained nearest neighbors, we construct a functional of solvation free energy, which is employed to predict the solvation free energy of the target molecule. The proposed FFT model has been extensively validated via a large dataset of 668 molecules. The leave-one-out test gives an optimal root-mean-square error (RMSE) of 1.05 kcal/mol. FFT predictions of SAMPL0, SAMPL1, SAMPL2, SAMPL3, and SAMPL4 challenge sets deliver the RMSEs of 0.61, 1.86, 1.64, 0.86, and 1.14 kcal/mol, respectively. Using a test set of 94 molecules and its associated training set, the present approach was carefully compared with a classic solvation model based on weighted solvent accessible surface area. © 2017 Wiley Periodicals, Inc. © 2017 Wiley Periodicals, Inc.

Parallel consistent labeling algorithms

DOE Office of Scientific and Technical Information (OSTI.GOV)

Samal, A.; Henderson, T.

Mackworth and Freuder have analyzed the time complexity of several constraint satisfaction algorithms. Mohr and Henderson have given new algorithms, AC-4 and PC-3, for arc and path consistency, respectively, and have shown that the arc consistency algorithm is optimal in time complexity and of the same order space complexity as the earlier algorithms. In this paper, they give parallel algorithms for solving node and arc consistency. They show that any parallel algorithm for enforcing arc consistency in the worst case must have O(na) sequential steps, where n is number of nodes, and a is the number of labels per node.more » They give several parallel algorithms to do arc consistency. It is also shown that they all have optimal time complexity. The results of running the parallel algorithms on a BBN Butterfly multiprocessor are also presented.« less
Parallel CE/SE Computations via Domain Decomposition

NASA Technical Reports Server (NTRS)

Himansu, Ananda; Jorgenson, Philip C. E.; Wang, Xiao-Yen; Chang, Sin-Chung

2000-01-01

This paper describes the parallelization strategy and achieved parallel efficiency of an explicit time-marching algorithm for solving conservation laws. The Space-Time Conservation Element and Solution Element (CE/SE) algorithm for solving the 2D and 3D Euler equations is parallelized with the aid of domain decomposition. The parallel efficiency of the resultant algorithm on a Silicon Graphics Origin 2000 parallel computer is checked.
Comparison of multihardware parallel implementations for a phase unwrapping algorithm

NASA Astrophysics Data System (ADS)

Hernandez-Lopez, Francisco Javier; Rivera, Mariano; Salazar-Garibay, Adan; Legarda-Sáenz, Ricardo

2018-04-01

Phase unwrapping is an important problem in the areas of optical metrology, synthetic aperture radar (SAR) image analysis, and magnetic resonance imaging (MRI) analysis. These images are becoming larger in size and, particularly, the availability and need for processing of SAR and MRI data have increased significantly with the acquisition of remote sensing data and the popularization of magnetic resonators in clinical diagnosis. Therefore, it is important to develop faster and accurate phase unwrapping algorithms. We propose a parallel multigrid algorithm of a phase unwrapping method named accumulation of residual maps, which builds on a serial algorithm that consists of the minimization of a cost function; minimization achieved by means of a serial Gauss-Seidel kind algorithm. Our algorithm also optimizes the original cost function, but unlike the original work, our algorithm is a parallel Jacobi class with alternated minimizations. This strategy is known as the chessboard type, where red pixels can be updated in parallel at same iteration since they are independent. Similarly, black pixels can be updated in parallel in an alternating iteration. We present parallel implementations of our algorithm for different parallel multicore architecture such as CPU-multicore, Xeon Phi coprocessor, and Nvidia graphics processing unit. In all the cases, we obtain a superior performance of our parallel algorithm when compared with the original serial version. In addition, we present a detailed comparative performance of the developed parallel versions.
A Parallel Particle Swarm Optimization Algorithm Accelerated by Asynchronous Evaluations

NASA Technical Reports Server (NTRS)

Venter, Gerhard; Sobieszczanski-Sobieski, Jaroslaw

2005-01-01

A parallel Particle Swarm Optimization (PSO) algorithm is presented. Particle swarm optimization is a fairly recent addition to the family of non-gradient based, probabilistic search algorithms that is based on a simplified social model and is closely tied to swarming theory. Although PSO algorithms present several attractive properties to the designer, they are plagued by high computational cost as measured by elapsed time. One approach to reduce the elapsed time is to make use of coarse-grained parallelization to evaluate the design points. Previous parallel PSO algorithms were mostly implemented in a synchronous manner, where all design points within a design iteration are evaluated before the next iteration is started. This approach leads to poor parallel speedup in cases where a heterogeneous parallel environment is used and/or where the analysis time depends on the design point being analyzed. This paper introduces an asynchronous parallel PSO algorithm that greatly improves the parallel e ciency. The asynchronous algorithm is benchmarked on a cluster assembled of Apple Macintosh G5 desktop computers, using the multi-disciplinary optimization of a typical transport aircraft wing as an example.
A parallel algorithm for the two-dimensional time fractional diffusion equation with implicit difference method.

PubMed

Gong, Chunye; Bao, Weimin; Tang, Guojian; Jiang, Yuewen; Liu, Jie

2014-01-01

It is very time consuming to solve fractional differential equations. The computational complexity of two-dimensional fractional differential equation (2D-TFDE) with iterative implicit finite difference method is O(M(x)M(y)N(2)). In this paper, we present a parallel algorithm for 2D-TFDE and give an in-depth discussion about this algorithm. A task distribution model and data layout with virtual boundary are designed for this parallel algorithm. The experimental results show that the parallel algorithm compares well with the exact solution. The parallel algorithm on single Intel Xeon X5540 CPU runs 3.16-4.17 times faster than the serial algorithm on single CPU core. The parallel efficiency of 81 processes is up to 88.24% compared with 9 processes on a distributed memory cluster system. We do think that the parallel computing technology will become a very basic method for the computational intensive fractional applications in the near future.
Risk factors for pericardial effusion after chemoradiotherapy for thoracic esophageal cancer—comparison of four-field technique and traditional two opposed fields technique

PubMed Central

Takata, Noriko; Kataoka, Masaaki; Hamamoto, Yasushi; Tsuruoka, Shintaro; Kanzaki, Hiromitsu; Uwatsu, Kotaro; Nagasaki, Kei; Mochizuki, Teruhito

2018-01-01

Abstract Pericardial effusion is an important late toxicity after concurrent chemoradiotherapy (CCRT) for locally advanced esophageal cancer. We investigated the clinical and dosimetric factors that were related to pericardial effusion among patients with thoracic esophageal cancer who were treated with definitive CCRT using the two opposed fields technique (TFT) or the four-field technique (FFT), as well as the effectiveness of FFT. During 2007–2015, 169 patients with middle and/or lower thoracic esophageal cancer received definitive CCRT, and 94 patients were evaluable (51 FFT cases and 43 TFT cases). Pericardial effusion was observed in 74 patients (79%) and appeared at 1–18.5 months (median: 5.25 months) after CCRT. The 1-year incidences of pericardial effusions were 73.2% and 76.7% in the FFT and TFT groups, respectively (P = 0.6395). The mean doses to the pericardium were 28.6 Gy and 31.8 Gy in the FFT and TFT groups, respectively (P = 0.0259), and the V40 Gy proportions were 33.5% and 48.2% in the FFT and TFT groups, respectively (P < 0.0001). Grade 3 pericardial effusion was not observed in patients with a pericardial V40 Gy of <40%, or in patients who were treated using the FFT. Although the mean pericardial dose and V40 Gy in the FFT group were smaller than those in the TFT group, the incidences of pericardial effusion after CCRT were similar in both groups. As symptomatic pericardial effusion was not observed in patients with a pericardial V40 Gy of <40% or in the FFT group, it appears that FFT with a V40 Gy of <40% could help minimize symptomatic pericardial effusion. PMID:29659940
Risk factors for pericardial effusion after chemoradiotherapy for thoracic esophageal cancer-comparison of four-field technique and traditional two opposed fields technique.

PubMed

Takata, Noriko; Kataoka, Masaaki; Hamamoto, Yasushi; Tsuruoka, Shintaro; Kanzaki, Hiromitsu; Uwatsu, Kotaro; Nagasaki, Kei; Mochizuki, Teruhito

2018-05-01

Pericardial effusion is an important late toxicity after concurrent chemoradiotherapy (CCRT) for locally advanced esophageal cancer. We investigated the clinical and dosimetric factors that were related to pericardial effusion among patients with thoracic esophageal cancer who were treated with definitive CCRT using the two opposed fields technique (TFT) or the four-field technique (FFT), as well as the effectiveness of FFT. During 2007-2015, 169 patients with middle and/or lower thoracic esophageal cancer received definitive CCRT, and 94 patients were evaluable (51 FFT cases and 43 TFT cases). Pericardial effusion was observed in 74 patients (79%) and appeared at 1-18.5 months (median: 5.25 months) after CCRT. The 1-year incidences of pericardial effusions were 73.2% and 76.7% in the FFT and TFT groups, respectively (P = 0.6395). The mean doses to the pericardium were 28.6 Gy and 31.8 Gy in the FFT and TFT groups, respectively (P = 0.0259), and the V40 Gy proportions were 33.5% and 48.2% in the FFT and TFT groups, respectively (P < 0.0001). Grade 3 pericardial effusion was not observed in patients with a pericardial V40 Gy of <40%, or in patients who were treated using the FFT. Although the mean pericardial dose and V40 Gy in the FFT group were smaller than those in the TFT group, the incidences of pericardial effusion after CCRT were similar in both groups. As symptomatic pericardial effusion was not observed in patients with a pericardial V40 Gy of <40% or in the FFT group, it appears that FFT with a V40 Gy of <40% could help minimize symptomatic pericardial effusion.
Transforming a Fructan:Fructan 6G-Fructosyltransferase from Perennial Ryegrass into a Sucrose:Sucrose 1-Fructosyltransferase1[C

PubMed Central

Lasseur, Bertrand; Schroeven, Lindsey; Lammens, Willem; Le Roy, Katrien; Spangenberg, German; Manduzio, Hélène; Vergauwen, Rudy; Lothier, Jérémy; Prud'homme, Marie-Pascale; Van den Ende, Wim

2009-01-01

Fructosyltransferases (FTs) synthesize fructans, fructose polymers accumulating in economically important cool-season grasses and cereals. FTs might be crucial for plant survival under stress conditions in species in which fructans represent the major form of reserve carbohydrate, such as perennial ryegrass (Lolium perenne). Two FT types can be distinguished: those using sucrose (S-type enzymes: sucrose:sucrose 1-fructosyltransferase [1-SST], sucrose:fructan 6-fructosyltransferase) and those using fructans (F-type enzymes: fructan:fructan 1-fructosyltransferase [1-FFT], fructan:fructan 6G-fructosyltransferase [6G-FFT]) as preferential donor substrate. Here, we report, to our knowledge for the first time, the transformation of an F-type enzyme (6G-FFT/1-FFT) into an S-type enzyme (1-SST) using perennial ryegrass 6G-FFT/1-FFT (Lp6G-FFT/1-FFT) and 1-SST (Lp1-SST) as model enzymes. This transformation was accomplished by mutating three amino acids (N340D, W343R, and S415N) in the vicinity of the active site of Lp6G-FFT/1-FFT. In addition, effects of each amino acid mutation alone or in combination have been studied. Our results strongly suggest that the amino acid at position 343 (tryptophan or arginine) can greatly determine the donor substrate characteristics by influencing the position of the amino acid at position 340. Moreover, the presence of arginine-343 negatively affects the formation of neofructan-type linkages. The results are compared with recent findings on donor substrate selectivity within the group of plant cell wall invertases and fructan exohydrolases. Taken together, these insights contribute to our knowledge of structure/function relationships within plant family 32 glycosyl hydrolases and open the way to the production of tailor-made fructans on a larger scale. PMID:18952861
GPUs benchmarking in subpixel image registration algorithm

NASA Astrophysics Data System (ADS)

Sanz-Sabater, Martin; Picazo-Bueno, Jose Angel; Micó, Vicente; Ferrerira, Carlos; Granero, Luis; Garcia, Javier

2015-05-01

Image registration techniques are used among different scientific fields, like medical imaging or optical metrology. The straightest way to calculate shifting between two images is using the cross correlation, taking the highest value of this correlation image. Shifting resolution is given in whole pixels which cannot be enough for certain applications. Better results can be achieved interpolating both images, as much as the desired resolution we want to get, and applying the same technique described before, but the memory needed by the system is significantly higher. To avoid memory consuming we are implementing a subpixel shifting method based on FFT. With the original images, subpixel shifting can be achieved multiplying its discrete Fourier transform by a linear phase with different slopes. This method is high time consuming method because checking a concrete shifting means new calculations. The algorithm, highly parallelizable, is very suitable for high performance computing systems. GPU (Graphics Processing Unit) accelerated computing became very popular more than ten years ago because they have hundreds of computational cores in a reasonable cheap card. In our case, we are going to register the shifting between two images, doing the first approach by FFT based correlation, and later doing the subpixel approach using the technique described before. We consider it as `brute force' method. So we will present a benchmark of the algorithm consisting on a first approach (pixel resolution) and then do subpixel resolution approaching, decreasing the shifting step in every loop achieving a high resolution in few steps. This program will be executed in three different computers. At the end, we will present the results of the computation, with different kind of CPUs and GPUs, checking the accuracy of the method, and the time consumed in each computer, discussing the advantages, disadvantages of the use of GPUs.
Improved grid-noise removal in single-frame digital moiré 3D shape measurement

NASA Astrophysics Data System (ADS)

Mohammadi, Fatemeh; Kofman, Jonathan

2016-11-01

A single-frame grid-noise removal technique was developed for application in single-frame digital-moiré 3D shape measurement. The ability of the stationary wavelet transform (SWT) to prevent oscillation artifacts near discontinuities, and the ability of the Fourier transform (FFT) applied to wavelet coefficients to separate grid-noise from useful image information, were combined in a new technique, SWT-FFT, to remove grid-noise from moiré-pattern images generated by digital moiré. In comparison to previous grid-noise removal techniques in moiré, SWT-FFT avoids the requirement for mechanical translation of optical components and capture of multiple frames, to enable single-frame moiré-based measurement. Experiments using FFT, Discrete Wavelet Transform (DWT), DWT-FFT, and SWT-FFT were performed on moiré-pattern images containing grid noise, generated by digital moiré, for several test objects. SWT-FFT had the best performance in removing high-frequency grid-noise, both straight and curved lines, minimizing artifacts, and preserving the moiré pattern without blurring and degradation. SWT-FFT also had the lowest noise amplitude in the reconstructed height and lowest roughness index for all test objects, indicating best grid-noise removal in comparison to the other techniques.
An exploratory analysis of PubMed's free full-text limit on citation retrieval for clinical questions

PubMed Central

Krieger, Mary M.; Richter, Randy R.; Austin, Tricia M.

2008-01-01

Objective: The research sought to determine (1) how use of the PubMed free full-text (FFT) limit affects citation retrieval and (2) how use of the FFT limit impacts the types of articles and levels of evidence retrieved. Methods: Four clinical questions based on a research agenda for physical therapy were searched in PubMed both with and without the use of the FFT limit. Retrieved citations were examined for relevancy to each question. Abstracts of relevant citations were reviewed to determine the types of articles and levels of evidence. Descriptive analysis was used to compare the total number of citations, number of relevant citations, types of articles, and levels of evidence both with and without the use of the FFT limit. Results: Across all 4 questions, the FFT limit reduced the number of citations to 11.1% of the total number of citations retrieved without the FFT limit. Additionally, high-quality evidence such as systematic reviews and randomized controlled trials were missed when the FFT limit was used. Conclusions: Health sciences librarians play a key role in educating users about the potential impact the FFT limit has on the number of citations, types of articles, and levels of evidence retrieved. PMID:18974812
Parallel Architectures and Parallel Algorithms for Integrated Vision Systems. Ph.D. Thesis

NASA Technical Reports Server (NTRS)

Choudhary, Alok Nidhi

1989-01-01

Computer vision is regarded as one of the most complex and computationally intensive problems. An integrated vision system (IVS) is a system that uses vision algorithms from all levels of processing to perform for a high level application (e.g., object recognition). An IVS normally involves algorithms from low level, intermediate level, and high level vision. Designing parallel architectures for vision systems is of tremendous interest to researchers. Several issues are addressed in parallel architectures and parallel algorithms for integrated vision systems.
Efficient Parallel Kernel Solvers for Computational Fluid Dynamics Applications

NASA Technical Reports Server (NTRS)

Sun, Xian-He

1997-01-01

Distributed-memory parallel computers dominate today's parallel computing arena. These machines, such as Intel Paragon, IBM SP2, and Cray Origin2OO, have successfully delivered high performance computing power for solving some of the so-called "grand-challenge" problems. Despite initial success, parallel machines have not been widely accepted in production engineering environments due to the complexity of parallel programming. On a parallel computing system, a task has to be partitioned and distributed appropriately among processors to reduce communication cost and to attain load balance. More importantly, even with careful partitioning and mapping, the performance of an algorithm may still be unsatisfactory, since conventional sequential algorithms may be serial in nature and may not be implemented efficiently on parallel machines. In many cases, new algorithms have to be introduced to increase parallel performance. In order to achieve optimal performance, in addition to partitioning and mapping, a careful performance study should be conducted for a given application to find a good algorithm-machine combination. This process, however, is usually painful and elusive. The goal of this project is to design and develop efficient parallel algorithms for highly accurate Computational Fluid Dynamics (CFD) simulations and other engineering applications. The work plan is 1) developing highly accurate parallel numerical algorithms, 2) conduct preliminary testing to verify the effectiveness and potential of these algorithms, 3) incorporate newly developed algorithms into actual simulation packages. The work plan has well achieved. Two highly accurate, efficient Poisson solvers have been developed and tested based on two different approaches: (1) Adopting a mathematical geometry which has a better capacity to describe the fluid, (2) Using compact scheme to gain high order accuracy in numerical discretization. The previously developed Parallel Diagonal Dominant (PDD) algorithm and Reduced Parallel Diagonal Dominant (RPDD) algorithm have been carefully studied on different parallel platforms for different applications, and a NASA simulation code developed by Man M. Rai and his colleagues has been parallelized and implemented based on data dependency analysis. These achievements are addressed in detail in the paper.
An efficient parallel algorithm: Poststack and prestack Kirchhoff 3D depth migration using flexi-depth iterations

NASA Astrophysics Data System (ADS)

Rastogi, Richa; Srivastava, Abhishek; Khonde, Kiran; Sirasala, Kirannmayi M.; Londhe, Ashutosh; Chavhan, Hitesh

2015-07-01

This paper presents an efficient parallel 3D Kirchhoff depth migration algorithm suitable for current class of multicore architecture. The fundamental Kirchhoff depth migration algorithm exhibits inherent parallelism however, when it comes to 3D data migration, as the data size increases the resource requirement of the algorithm also increases. This challenges its practical implementation even on current generation high performance computing systems. Therefore a smart parallelization approach is essential to handle 3D data for migration. The most compute intensive part of Kirchhoff depth migration algorithm is the calculation of traveltime tables due to its resource requirements such as memory/storage and I/O. In the current research work, we target this area and develop a competent parallel algorithm for post and prestack 3D Kirchhoff depth migration, using hybrid MPI+OpenMP programming techniques. We introduce a concept of flexi-depth iterations while depth migrating data in parallel imaging space, using optimized traveltime table computations. This concept provides flexibility to the algorithm by migrating data in a number of depth iterations, which depends upon the available node memory and the size of data to be migrated during runtime. Furthermore, it minimizes the requirements of storage, I/O and inter-node communication, thus making it advantageous over the conventional parallelization approaches. The developed parallel algorithm is demonstrated and analysed on Yuva II, a PARAM series of supercomputers. Optimization, performance and scalability experiment results along with the migration outcome show the effectiveness of the parallel algorithm.
Evaluation of finite difference and FFT-based solutions of the transport of intensity equation.

PubMed

Zhang, Hongbo; Zhou, Wen-Jing; Liu, Ying; Leber, Donald; Banerjee, Partha; Basunia, Mahmudunnabi; Poon, Ting-Chung

2018-01-01

A finite difference method is proposed for solving the transport of intensity equation. Simulation results show that although slower than fast Fourier transform (FFT)-based methods, finite difference methods are able to reconstruct the phase with better accuracy due to relaxed assumptions for solving the transport of intensity equation relative to FFT methods. Finite difference methods are also more flexible than FFT methods in dealing with different boundary conditions.
Parallel Algorithms for Least Squares and Related Computations.

DTIC Science & Technology

1991-03-22

for dense computations in linear algebra . The work has recently been published in a general reference book on parallel algorithms by SIAM. AFO SR...written his Ph.D. dissertation with the principal investigator. (See publication 6.) • Parallel Algorithms for Dense Linear Algebra Computations. Our...and describe and to put into perspective a selection of the more important parallel algorithms for numerical linear algebra . We give a major new
Genetic algorithms using SISAL parallel programming language

DOE Office of Scientific and Technical Information (OSTI.GOV)

Tejada, S.

1994-05-06

Genetic algorithms are a mathematical optimization technique developed by John Holland at the University of Michigan [1]. The SISAL programming language possesses many of the characteristics desired to implement genetic algorithms. SISAL is a deterministic, functional programming language which is inherently parallel. Because SISAL is functional and based on mathematical concepts, genetic algorithms can be efficiently translated into the language. Several of the steps involved in genetic algorithms, such as mutation, crossover, and fitness evaluation, can be parallelized using SISAL. In this paper I will l discuss the implementation and performance of parallel genetic algorithms in SISAL.
On the suitability of the connection machine for direct particle simulation

NASA Technical Reports Server (NTRS)

Dagum, Leonard

1990-01-01

The algorithmic structure was examined of the vectorizable Stanford particle simulation (SPS) method and the structure is reformulated in data parallel form. Some of the SPS algorithms can be directly translated to data parallel, but several of the vectorizable algorithms have no direct data parallel equivalent. This requires the development of new, strictly data parallel algorithms. In particular, a new sorting algorithm is developed to identify collision candidates in the simulation and a master/slave algorithm is developed to minimize communication cost in large table look up. Validation of the method is undertaken through test calculations for thermal relaxation of a gas, shock wave profiles, and shock reflection from a stationary wall. A qualitative measure is provided of the performance of the Connection Machine for direct particle simulation. The massively parallel architecture of the Connection Machine is found quite suitable for this type of calculation. However, there are difficulties in taking full advantage of this architecture because of lack of a broad based tradition of data parallel programming. An important outcome of this work has been new data parallel algorithms specifically of use for direct particle simulation but which also expand the data parallel diction.
Performance of MDockPP in CAPRI rounds 28-29 and 31-35 including the prediction of water-mediated interactions.

PubMed

Xu, Xianjin; Qiu, Liming; Yan, Chengfei; Ma, Zhiwei; Grinter, Sam Z; Zou, Xiaoqin

2017-03-01

Protein-protein interactions are either through direct contacts between two binding partners or mediated by structural waters. Both direct contacts and water-mediated interactions are crucial to the formation of a protein-protein complex. During the recent CAPRI rounds, a novel parallel searching strategy for predicting water-mediated interactions is introduced into our protein-protein docking method, MDockPP. Briefly, a FFT-based docking algorithm is employed in generating putative binding modes, and an iteratively derived statistical potential-based scoring function, ITScorePP, in conjunction with biological information is used to assess and rank the binding modes. Up to 10 binding modes are selected as the initial protein-protein complex structures for MD simulations in explicit solvent. Water molecules near the interface are clustered based on the snapshots extracted from independent equilibrated trajectories. Then, protein-ligand docking is employed for a parallel search for water molecules near the protein-protein interface. The water molecules generated by ligand docking and the clustered water molecules generated by MD simulations are merged, referred to as the predicted structural water molecules. Here, we report the performance of this protocol for CAPRI rounds 28-29 and 31-35 containing 20 valid docking targets and 11 scoring targets. In the docking experiments, we predicted correct binding modes for nine targets, including one high-accuracy, two medium-accuracy, and six acceptable predictions. Regarding the two targets for the prediction of water-mediated interactions, we achieved models ranked as "excellent" in accordance with the CAPRI evaluation criteria; one of these two targets is considered as a difficult target for structural water prediction. Proteins 2017; 85:424-434. © 2016 Wiley Periodicals, Inc. © 2016 Wiley Periodicals, Inc.
Runtime support for parallelizing data mining algorithms

NASA Astrophysics Data System (ADS)

Jin, Ruoming; Agrawal, Gagan

2002-03-01

With recent technological advances, shared memory parallel machines have become more scalable, and offer large main memories and high bus bandwidths. They are emerging as good platforms for data warehousing and data mining. In this paper, we focus on shared memory parallelization of data mining algorithms. We have developed a series of techniques for parallelization of data mining algorithms, including full replication, full locking, fixed locking, optimized full locking, and cache-sensitive locking. Unlike previous work on shared memory parallelization of specific data mining algorithms, all of our techniques apply to a large number of common data mining algorithms. In addition, we propose a reduction-object based interface for specifying a data mining algorithm. We show how our runtime system can apply any of the technique we have developed starting from a common specification of the algorithm.

Theoretical Studies of a Transient Stimulated Raman Amplifier

DTIC Science & Technology

1988-04-19

follows: I. contour plot of pump intensity . 1. sections of pump intensity 2. sections of pump phase 3. sections of pump amplitude (real/ imag ) I...contour plot of pump FFT intensity 4. sections of pump FFT intensity 5. sections of pump FFT phase 6. sections of pump FFT amplitude (real/ imag ) II...contour plot of Stokes intensity 7. sections of Stokes intensity 8. sections of Stokes phase 9. sections of Stokes amplitude (real/ imag ) IV. contour plot
A reconfigurable multicarrier demodulator architecture

NASA Technical Reports Server (NTRS)

Kwatra, S. C.; Jamali, M. M.

1991-01-01

An architecture based on parallel and pipline design approaches has been developed for the Frequency Division Multiple Access/Time Domain Multiplexed (FDMA/TDM) conversion system. The architecture has two main modules namely the transmultiplexer and the demodulator. The transmultiplexer has two pipelined modules. These are the shared multiplexed polyphase filter and the Fast Fourier Transform (FFT). The demodulator consists of carrier, clock, and data recovery modules which are interactive. Progress on the design of the MultiCarrier Demodulator (MCD) using commercially available chips and Application Specific Integrated Circuits (ASIC) and simulation studies using Viewlogic software will be presented at the conference.
Special purpose computer system with highly parallel pipelines for flow visualization using holography technology

NASA Astrophysics Data System (ADS)

Masuda, Nobuyuki; Sugie, Takashige; Ito, Tomoyoshi; Tanaka, Shinjiro; Hamada, Yu; Satake, Shin-ichi; Kunugi, Tomoaki; Sato, Kazuho

2010-12-01

We have designed a PC cluster system with special purpose computer boards for visualization of fluid flow using digital holographic particle tracking velocimetry (DHPTV). In this board, there is a Field Programmable Gate Array (FPGA) chip in which is installed a pipeline for calculating the intensity of an object from a hologram by fast Fourier transform (FFT). This cluster system can create 1024 reconstructed images from a 1024×1024-grid hologram in 0.77 s. It is expected that this system will contribute to the analysis of fluid flow using DHPTV.
Optimal Padding for the Two-Dimensional Fast Fourier Transform

NASA Technical Reports Server (NTRS)

Dean, Bruce H.; Aronstein, David L.; Smith, Jeffrey S.

2011-01-01

One-dimensional Fast Fourier Transform (FFT) operations work fastest on grids whose size is divisible by a power of two. Because of this, padding grids (that are not already sized to a power of two) so that their size is the next highest power of two can speed up operations. While this works well for one-dimensional grids, it does not work well for two-dimensional grids. For a two-dimensional grid, there are certain pad sizes that work better than others. Therefore, the need exists to generalize a strategy for determining optimal pad sizes. There are three steps in the FFT algorithm. The first is to perform a one-dimensional transform on each row in the grid. The second step is to transpose the resulting matrix. The third step is to perform a one-dimensional transform on each row in the resulting grid. Steps one and three both benefit from padding the row to the next highest power of two, but the second step needs a novel approach. An algorithm was developed that struck a balance between optimizing the grid pad size with prime factors that are small (which are optimal for one-dimensional operations), and with prime factors that are large (which are optimal for two-dimensional operations). This algorithm optimizes based on average run times, and is not fine-tuned for any specific application. It increases the amount of times that processor-requested data is found in the set-associative processor cache. Cache retrievals are 4-10 times faster than conventional memory retrievals. The tested implementation of the algorithm resulted in faster execution times on all platforms tested, but with varying sized grids. This is because various computer architectures process commands differently. The test grid was 512 512. Using a 540 540 grid on a Pentium V processor, the code ran 30 percent faster. On a PowerPC, a 256x256 grid worked best. A Core2Duo computer preferred either a 1040x1040 (15 percent faster) or a 1008x1008 (30 percent faster) grid. There are many industries that can benefit from this algorithm, including optics, image-processing, signal-processing, and engineering applications.
Parallel and Preemptable Dynamically Dimensioned Search Algorithms for Single and Multi-objective Optimization in Water Resources

NASA Astrophysics Data System (ADS)

Tolson, B.; Matott, L. S.; Gaffoor, T. A.; Asadzadeh, M.; Shafii, M.; Pomorski, P.; Xu, X.; Jahanpour, M.; Razavi, S.; Haghnegahdar, A.; Craig, J. R.

2015-12-01

We introduce asynchronous parallel implementations of the Dynamically Dimensioned Search (DDS) family of algorithms including DDS, discrete DDS, PA-DDS and DDS-AU. These parallel algorithms are unique from most existing parallel optimization algorithms in the water resources field in that parallel DDS is asynchronous and does not require an entire population (set of candidate solutions) to be evaluated before generating and then sending a new candidate solution for evaluation. One key advance in this study is developing the first parallel PA-DDS multi-objective optimization algorithm. The other key advance is enhancing the computational efficiency of solving optimization problems (such as model calibration) by combining a parallel optimization algorithm with the deterministic model pre-emption concept. These two efficiency techniques can only be combined because of the asynchronous nature of parallel DDS. Model pre-emption functions to terminate simulation model runs early, prior to completely simulating the model calibration period for example, when intermediate results indicate the candidate solution is so poor that it will definitely have no influence on the generation of further candidate solutions. The computational savings of deterministic model preemption available in serial implementations of population-based algorithms (e.g., PSO) disappear in synchronous parallel implementations as these algorithms. In addition to the key advances above, we implement the algorithms across a range of computation platforms (Windows and Unix-based operating systems from multi-core desktops to a supercomputer system) and package these for future modellers within a model-independent calibration software package called Ostrich as well as MATLAB versions. Results across multiple platforms and multiple case studies (from 4 to 64 processors) demonstrate the vast improvement over serial DDS-based algorithms and highlight the important role model pre-emption plays in the performance of parallel, pre-emptable DDS algorithms. Case studies include single- and multiple-objective optimization problems in water resources model calibration and in many cases linear or near linear speedups are observed.
Parallel transformation of K-SVD solar image denoising algorithm

NASA Astrophysics Data System (ADS)

Liang, Youwen; Tian, Yu; Li, Mei

2017-02-01

The images obtained by observing the sun through a large telescope always suffered with noise due to the low SNR. K-SVD denoising algorithm can effectively remove Gauss white noise. Training dictionaries for sparse representations is a time consuming task, due to the large size of the data involved and to the complexity of the training algorithms. In this paper, an OpenMP parallel programming language is proposed to transform the serial algorithm to the parallel version. Data parallelism model is used to transform the algorithm. Not one atom but multiple atoms updated simultaneously is the biggest change. The denoising effect and acceleration performance are tested after completion of the parallel algorithm. Speedup of the program is 13.563 in condition of using 16 cores. This parallel version can fully utilize the multi-core CPU hardware resources, greatly reduce running time and easily to transplant in multi-core platform.
A parallel simulated annealing algorithm for standard cell placement on a hypercube computer

NASA Technical Reports Server (NTRS)

Jones, Mark Howard

1987-01-01

A parallel version of a simulated annealing algorithm is presented which is targeted to run on a hypercube computer. A strategy for mapping the cells in a two dimensional area of a chip onto processors in an n-dimensional hypercube is proposed such that both small and large distance moves can be applied. Two types of moves are allowed: cell exchanges and cell displacements. The computation of the cost function in parallel among all the processors in the hypercube is described along with a distributed data structure that needs to be stored in the hypercube to support parallel cost evaluation. A novel tree broadcasting strategy is used extensively in the algorithm for updating cell locations in the parallel environment. Studies on the performance of the algorithm on example industrial circuits show that it is faster and gives better final placement results than the uniprocessor simulated annealing algorithms. An improved uniprocessor algorithm is proposed which is based on the improved results obtained from parallelization of the simulated annealing algorithm.
Fast Fourier transform-based Retinex and alpha-rooting color image enhancement

NASA Astrophysics Data System (ADS)

Grigoryan, Artyom M.; Agaian, Sos S.; Gonzales, Analysa M.

2015-05-01

Efficiency in terms of both accuracy and speed is highly important in any system, especially when it comes to image processing. The purpose of this paper is to improve an existing implementation of multi-scale retinex (MSR) by utilizing the fast Fourier transforms (FFT) within the illumination estimation step of the algorithm to improve the speed at which Gaussian blurring filters were applied to the original input image. In addition, alpha-rooting can be used as a separate technique to achieve a sharper image in order to fuse its results with those of the retinex algorithm for the sake of achieving the best image possible as shown by the values of the considered color image enhancement measure (EMEC).
Massively parallel algorithms for real-time wavefront control of a dense adaptive optics system

DOE Office of Scientific and Technical Information (OSTI.GOV)

Fijany, A.; Milman, M.; Redding, D.

1994-12-31

In this paper massively parallel algorithms and architectures for real-time wavefront control of a dense adaptive optic system (SELENE) are presented. The authors have already shown that the computation of a near optimal control algorithm for SELENE can be reduced to the solution of a discrete Poisson equation on a regular domain. Although, this represents an optimal computation, due the large size of the system and the high sampling rate requirement, the implementation of this control algorithm poses a computationally challenging problem since it demands a sustained computational throughput of the order of 10 GFlops. They develop a novel algorithm,more » designated as Fast Invariant Imbedding algorithm, which offers a massive degree of parallelism with simple communication and synchronization requirements. Due to these features, this algorithm is significantly more efficient than other Fast Poisson Solvers for implementation on massively parallel architectures. The authors also discuss two massively parallel, algorithmically specialized, architectures for low-cost and optimal implementation of the Fast Invariant Imbedding algorithm.« less
THE TAIWAN-AMERICAN OCCULTATION SURVEY PROJECT STELLAR VARIABILITY. I. DETECTION OF LOW-AMPLITUDE {delta} SCUTI STARS

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kim, D.-W.; Protopapas, P.; Alcock, C.

2010-02-15

We analyzed data accumulated during 2005 and 2006 by the Taiwan-American Occultation Survey (TAOS) in order to detect short-period variable stars (periods of {approx}<1 hr) such as {delta} Scuti. TAOS is designed for the detection of stellar occultation by small-size Kuiper Belt Objects and is operating four 50 cm telescopes at an effective cadence of 5 Hz. The four telescopes simultaneously monitor the same patch of the sky in order to reduce false positives. To detect short-period variables, we used the fast Fourier transform algorithm (FFT) in as much as the data points in TAOS light curves are evenly spaced.more » Using FFT, we found 41 short-period variables with amplitudes smaller than a few hundredths of a magnitude and periods of about an hour, which suggest that they are low-amplitude {delta} Scuti stars. The light curves of TAOS {delta} Scuti stars are accessible online at the Time Series Center Web site (http://timemachine.iic.harvard.edu)« less
Displacement and frequency analyses of vibratory systems

NASA Astrophysics Data System (ADS)

Low, K. H.

1995-02-01

This paper deals with the frequency and response studies of vibratory systems, which are represented by a set of n coupled second-order differential equations. The following numerical methods are used in the response analysis: central difference, fourth-order Runge-Kutta and modal methods. Data generated in the response analysis are processed to obtain the system frequencies by using the fast Fourier transform (FFT) or harmonic response methods. Two types of the windows are used in the FFT analysis: rectangular and Hanning windows. Examples of two, four and seven degrees of freedom systems are considered, to illustrate the proposed algorithms. Comparisons with those existing results confirm the validity of the proposed methods. The Hanning window attenuates the results that give a narrower bandwidth around the peak if compared with those using the rectangular window. It is also found that in free vibrations of a multi-mass system, the masses will vibrate in a manner that is the superposition of the natural frequencies of the system, while the system will vibrate at the driving frequency in forced vibrations.
Characterization of physical mass transport through oil sands fluid fine tailings in an end pit lake: a multi-tracer study.

PubMed

Dompierre, Kathryn A; Barbour, S Lee

2016-06-01

Soft tailings pose substantial challenges for mine reclamation due to their high void ratios and low shear strengths, particularly for conventional terrestrial reclamation practices. Oil sands mine operators have proposed the development of end pit lakes to contain the soft tailings, called fluid fine tailings (FFT), generated when bitumen is removed from oil sands ore. End pit lakes would be constructed within mined-out pits with FFT placed below the lake water. However, the feasibility of isolating the underlying FFT has yet to be fully evaluated. Chemical constituents of interest may move from the FFT into the lake water via two key processes: (1) advective-diffusive mass transport with upward pore water flow caused by settling of the FFT; and (2) mixing created by wind events or unstable density profiles through the lake water and upper portion of the FFT. In 2013 and 2014, temperature and stable isotopes of water profiles were measured through the FFT and lake water in the first end pit lake developed by Syncrude Canada Ltd. Numerical modelling was undertaken to simulate these profiles to identify the key mechanisms controlling conservative mass transport in the FFT. Shallow mixing of the upper 1.1 m of FFT with lake water was required to explain the observed temperature and isotopic profiles. Following mixing, the re-establishment of both the temperature and isotope profiles required an upward advective flux of approximately 1.5 m/year, consistent with average FFT settling rates measured at the study site. These findings provide important insight on the ability to sequester soft tailings in an end pit lake, and offer a foundation for future research on the development of end pit lakes as an oil sands reclamation strategy. Copyright © 2016 Elsevier B.V. All rights reserved.
Application of image recognition algorithms for statistical description of nano- and microstructured surfaces

DOE Office of Scientific and Technical Information (OSTI.GOV)

Mărăscu, V.; Dinescu, G.; Faculty of Physics, University of Bucharest, 405 Atomistilor Street, Bucharest-Magurele

In this paper we propose a statistical approach for describing the self-assembling of sub-micronic polystyrene beads on silicon surfaces, as well as the evolution of surface topography due to plasma treatments. Algorithms for image recognition are used in conjunction with Scanning Electron Microscopy (SEM) imaging of surfaces. In a first step, greyscale images of the surface covered by the polystyrene beads are obtained. Further, an adaptive thresholding method was applied for obtaining binary images. The next step consisted in automatic identification of polystyrene beads dimensions, by using Hough transform algorithm, according to beads radius. In order to analyze the uniformitymore » of the self–assembled polystyrene beads, the squared modulus of 2-dimensional Fast Fourier Transform (2- D FFT) was applied. By combining these algorithms we obtain a powerful and fast statistical tool for analysis of micro and nanomaterials with aspect features regularly distributed on surface upon SEM examination.« less
Parallel Algorithms for Groebner-Basis Reduction

DTIC Science & Technology

1987-09-25

22209 ELEMENT NO. NO. NO. ACCESSION NO. 11. TITLE (Include Security Classification) * PARALLEL ALGORITHMS FOR GROEBNER -BASIS REDUCTION 12. PERSONAL...All other editions are obsolete. Productivity Engineering in the UNIXt Environment p Parallel Algorithms for Groebner -Basis Reduction Technical Report
Real-time blind image deconvolution based on coordinated framework of FPGA and DSP

NASA Astrophysics Data System (ADS)

Wang, Ze; Li, Hang; Zhou, Hua; Liu, Hongjun

2015-10-01

Image restoration takes a crucial place in several important application domains. With the increasing of computation requirement as the algorithms become much more complexity, there has been a significant rise in the need for accelerating implementation. In this paper, we focus on an efficient real-time image processing system for blind iterative deconvolution method by means of the Richardson-Lucy (R-L) algorithm. We study the characteristics of algorithm, and an image restoration processing system based on the coordinated framework of FPGA and DSP (CoFD) is presented. Single precision floating-point processing units with small-scale cascade and special FFT/IFFT processing modules are adopted to guarantee the accuracy of the processing. Finally, Comparing experiments are done. The system could process a blurred image of 128×128 pixels within 32 milliseconds, and is up to three or four times faster than the traditional multi-DSPs systems.
Parallel and fault-tolerant algorithms for hypercube multiprocessors

DOE Office of Scientific and Technical Information (OSTI.GOV)

Aykanat, C.

1988-01-01

Several techniques for increasing the performance of parallel algorithms on distributed-memory message-passing multi-processor systems are investigated. These techniques are effectively implemented for the parallelization of the Scaled Conjugate Gradient (SCG) algorithm on a hypercube connected message-passing multi-processor. Significant performance improvement is achieved by using these techniques. The SCG algorithm is used for the solution phase of an FE modeling system. Almost linear speed-up is achieved, and it is shown that hypercube topology is scalable for an FE class of problem. The SCG algorithm is also shown to be suitable for vectorization, and near supercomputer performance is achieved on a vectormore » hypercube multiprocessor by exploiting both parallelization and vectorization. Fault-tolerance issues for the parallel SCG algorithm and for the hypercube topology are also addressed.« less
Hybrid massively parallel fast sweeping method for static Hamilton-Jacobi equations

NASA Astrophysics Data System (ADS)

Detrixhe, Miles; Gibou, Frédéric

2016-10-01

The fast sweeping method is a popular algorithm for solving a variety of static Hamilton-Jacobi equations. Fast sweeping algorithms for parallel computing have been developed, but are severely limited. In this work, we present a multilevel, hybrid parallel algorithm that combines the desirable traits of two distinct parallel methods. The fine and coarse grained components of the algorithm take advantage of heterogeneous computer architecture common in high performance computing facilities. We present the algorithm and demonstrate its effectiveness on a set of example problems including optimal control, dynamic games, and seismic wave propagation. We give results for convergence, parallel scaling, and show state-of-the-art speedup values for the fast sweeping method.
STS-48 Commander Creighton, in LES, stands at JSC FFT side hatch

NASA Technical Reports Server (NTRS)

1991-01-01

STS-48 Discovery, Orbiter Vehicle (OV) 103, Commander John O. Creighton, wearing a launch and entry suit (LES), stands at the side hatch of JSC's full fuselage trainer (FFT). Creighton will enter the FFT shuttle mockup through the side hatch and take his assigned position on the forward flight deck. Creighton, along with the other crewmembers, is participating in a post-landing emergency egress exercise. The FFT is located in the Mockup and Integration Laboratory (MAIL) Bldg 9A.
SU-E-J-91: FFT Based Medical Image Registration Using a Graphics Processing Unit (GPU).

PubMed

Luce, J; Hoggarth, M; Lin, J; Block, A; Roeske, J

2012-06-01

To evaluate the efficiency gains obtained from using a Graphics Processing Unit (GPU) to perform a Fourier Transform (FT) based image registration. Fourier-based image registration involves obtaining the FT of the component images, and analyzing them in Fourier space to determine the translations and rotations of one image set relative to another. An important property of FT registration is that by enlarging the images (adding additional pixels), one can obtain translations and rotations with sub-pixel resolution. The expense, however, is an increased computational time. GPUs may decrease the computational time associated with FT image registration by taking advantage of their parallel architecture to perform matrix computations much more efficiently than a Central Processor Unit (CPU). In order to evaluate the computational gains produced by a GPU, images with known translational shifts were utilized. A program was written in the Interactive Data Language (IDL; Exelis, Boulder, CO) to performCPU-based calculations. Subsequently, the program was modified using GPU bindings (Tech-X, Boulder, CO) to perform GPU-based computation on the same system. Multiple image sizes were used, ranging from 256×256 to 2304×2304. The time required to complete the full algorithm by the CPU and GPU were benchmarked and the speed increase was defined as the ratio of the CPU-to-GPU computational time. The ratio of the CPU-to- GPU time was greater than 1.0 for all images, which indicates the GPU is performing the algorithm faster than the CPU. The smallest improvement, a 1.21 ratio, was found with the smallest image size of 256×256, and the largest speedup, a 4.25 ratio, was observed with the largest image size of 2304×2304. GPU programming resulted in a significant decrease in computational time associated with a FT image registration algorithm. The inclusion of the GPU may provide near real-time, sub-pixel registration capability. © 2012 American Association of Physicists in Medicine.
A Motion Detection Algorithm Using Local Phase Information

PubMed Central

Lazar, Aurel A.; Ukani, Nikul H.; Zhou, Yiyin

2016-01-01

Previous research demonstrated that global phase alone can be used to faithfully represent visual scenes. Here we provide a reconstruction algorithm by using only local phase information. We also demonstrate that local phase alone can be effectively used to detect local motion. The local phase-based motion detector is akin to models employed to detect motion in biological vision, for example, the Reichardt detector. The local phase-based motion detection algorithm introduced here consists of two building blocks. The first building block measures/evaluates the temporal change of the local phase. The temporal derivative of the local phase is shown to exhibit the structure of a second order Volterra kernel with two normalized inputs. We provide an efficient, FFT-based algorithm for implementing the change of the local phase. The second processing building block implements the detector; it compares the maximum of the Radon transform of the local phase derivative with a chosen threshold. We demonstrate examples of applying the local phase-based motion detection algorithm on several video sequences. We also show how the locally detected motion can be used for segmenting moving objects in video scenes and compare our local phase-based algorithm to segmentation achieved with a widely used optic flow algorithm. PMID:26880882

Protein-ligand docking using FFT based sampling: D3R case study.

PubMed

Padhorny, Dzmitry; Hall, David R; Mirzaei, Hanieh; Mamonov, Artem B; Moghadasi, Mohammad; Alekseenko, Andrey; Beglov, Dmitri; Kozakov, Dima

2018-01-01

Fast Fourier transform (FFT) based approaches have been successful in application to modeling of relatively rigid protein-protein complexes. Recently, we have been able to adapt the FFT methodology to treatment of flexible protein-peptide interactions. Here, we report our latest attempt to expand the capabilities of the FFT approach to treatment of flexible protein-ligand interactions in application to the D3R PL-2016-1 challenge. Based on the D3R assessment, our FFT approach in conjunction with Monte Carlo minimization off-grid refinement was among the top performing methods in the challenge. The potential advantage of our method is its ability to globally sample the protein-ligand interaction landscape, which will be explored in further applications.
Parallel Algorithms for Switching Edges in Heterogeneous Graphs.

PubMed

Bhuiyan, Hasanuzzaman; Khan, Maleq; Chen, Jiangzhuo; Marathe, Madhav

2017-06-01

An edge switch is an operation on a graph (or network) where two edges are selected randomly and one of their end vertices are swapped with each other. Edge switch operations have important applications in graph theory and network analysis, such as in generating random networks with a given degree sequence, modeling and analyzing dynamic networks, and in studying various dynamic phenomena over a network. The recent growth of real-world networks motivates the need for efficient parallel algorithms. The dependencies among successive edge switch operations and the requirement to keep the graph simple (i.e., no self-loops or parallel edges) as the edges are switched lead to significant challenges in designing a parallel algorithm. Addressing these challenges requires complex synchronization and communication among the processors leading to difficulties in achieving a good speedup by parallelization. In this paper, we present distributed memory parallel algorithms for switching edges in massive networks. These algorithms provide good speedup and scale well to a large number of processors. A harmonic mean speedup of 73.25 is achieved on eight different networks with 1024 processors. One of the steps in our edge switch algorithms requires the computation of multinomial random variables in parallel. This paper presents the first non-trivial parallel algorithm for the problem, achieving a speedup of 925 using 1024 processors.
Parallel Algorithms for Switching Edges in Heterogeneous Graphs☆

PubMed Central

Khan, Maleq; Chen, Jiangzhuo; Marathe, Madhav

2017-01-01

An edge switch is an operation on a graph (or network) where two edges are selected randomly and one of their end vertices are swapped with each other. Edge switch operations have important applications in graph theory and network analysis, such as in generating random networks with a given degree sequence, modeling and analyzing dynamic networks, and in studying various dynamic phenomena over a network. The recent growth of real-world networks motivates the need for efficient parallel algorithms. The dependencies among successive edge switch operations and the requirement to keep the graph simple (i.e., no self-loops or parallel edges) as the edges are switched lead to significant challenges in designing a parallel algorithm. Addressing these challenges requires complex synchronization and communication among the processors leading to difficulties in achieving a good speedup by parallelization. In this paper, we present distributed memory parallel algorithms for switching edges in massive networks. These algorithms provide good speedup and scale well to a large number of processors. A harmonic mean speedup of 73.25 is achieved on eight different networks with 1024 processors. One of the steps in our edge switch algorithms requires the computation of multinomial random variables in parallel. This paper presents the first non-trivial parallel algorithm for the problem, achieving a speedup of 925 using 1024 processors. PMID:28757680
Scaling Up Coordinate Descent Algorithms for Large ℓ1 Regularization Problems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Scherrer, Chad; Halappanavar, Mahantesh; Tewari, Ambuj

2012-07-03

We present a generic framework for parallel coordinate descent (CD) algorithms that has as special cases the original sequential algorithms of Cyclic CD and Stochastic CD, as well as the recent parallel Shotgun algorithm of Bradley et al. We introduce two novel parallel algorithms that are also special cases---Thread-Greedy CD and Coloring-Based CD---and give performance measurements for an OpenMP implementation of these.
Family-Focused Therapy for Bipolar Disorder: Reflections on 30 Years of Research.

PubMed

Miklowitz, David J; Chung, Bowen

2016-09-01

Family-focused therapy (FFT) is an evidence-based intervention for adults and children with bipolar disorder (BD) and their caregivers, usually given in conjunction with pharmacotherapy after an illness episode. The treatment consists of conjoint sessions of psychoeducation regarding bipolar illness, communication enhancement training, and problem-solving skills training. This paper summarizes over 30 years of research on FFT and family processes in BD. Across eight randomized controlled trials with adults and adolescents with BD, FFT and mood-stabilizing medications have been found to hasten recovery from mood episodes, reduce recurrences, and reduce levels of symptom severity compared to briefer forms of psychoeducation and medications over 1-2 years. Several studies indicate that the effects of FFT on symptom improvement are greater among patients with high-expressed emotion relatives. New research focuses on FFT as an early intervention for youth at risk for BD, neuroimaging as a means of evaluating treatment mechanisms, and progress in implementing FFT in community mental health settings. © 2016 Family Process Institute.
Parallel language constructs for tensor product computations on loosely coupled architectures

NASA Technical Reports Server (NTRS)

Mehrotra, Piyush; Van Rosendale, John

1989-01-01

A set of language primitives designed to allow the specification of parallel numerical algorithms at a higher level is described. The authors focus on tensor product array computations, a simple but important class of numerical algorithms. They consider first the problem of programming one-dimensional kernel routines, such as parallel tridiagonal solvers, and then look at how such parallel kernels can be combined to form parallel tensor product algorithms.
Big Data: A Parallel Particle Swarm Optimization-Back-Propagation Neural Network Algorithm Based on MapReduce.

PubMed

Cao, Jianfang; Cui, Hongyan; Shi, Hao; Jiao, Lijuan

2016-01-01

A back-propagation (BP) neural network can solve complicated random nonlinear mapping problems; therefore, it can be applied to a wide range of problems. However, as the sample size increases, the time required to train BP neural networks becomes lengthy. Moreover, the classification accuracy decreases as well. To improve the classification accuracy and runtime efficiency of the BP neural network algorithm, we proposed a parallel design and realization method for a particle swarm optimization (PSO)-optimized BP neural network based on MapReduce on the Hadoop platform using both the PSO algorithm and a parallel design. The PSO algorithm was used to optimize the BP neural network's initial weights and thresholds and improve the accuracy of the classification algorithm. The MapReduce parallel programming model was utilized to achieve parallel processing of the BP algorithm, thereby solving the problems of hardware and communication overhead when the BP neural network addresses big data. Datasets on 5 different scales were constructed using the scene image library from the SUN Database. The classification accuracy of the parallel PSO-BP neural network algorithm is approximately 92%, and the system efficiency is approximately 0.85, which presents obvious advantages when processing big data. The algorithm proposed in this study demonstrated both higher classification accuracy and improved time efficiency, which represents a significant improvement obtained from applying parallel processing to an intelligent algorithm on big data.
Innovative SETI by the KLT

NASA Astrophysics Data System (ADS)

Maccone, C.

SETI searches are, by definition, the extraction of very weak radio signals out of the cosmic background noise. When SETI was born in 1959, it was "natural" to attempt this extraction by the only detection algorithm well known at the time: the Fourier Transform (FT). In fact: 1) SETI radio astronomers had adopted the viewpoint that a candidate ET signal would necessarily be a sinusoidal carrier, i.e. a very narrow-band signal. Over such a narrow band, the background noise is necessarily white. And so, the basic assumption behind the FT that the background noise must be white was "perfectly matched" to SETI for the next fifty years! 2) In addition, the Americans, J. W. Cooley and J. W. Tukey discovered in April 1965 that all the FT computations could be speeded up to N*ln(N) (rather than N2) (N is the number of numbers to be processed) by their own Fast Fourier Transform (FFT). Then, SETI radio astronomers all over the world gladly and unquestioningly adopted the new FFT forever. In 1983, however, the French SETI radio astronomer, François Biraud, dared to challenge this view (ref. [6]). He argued that we only can make guesses about ET's telecommunication systems, and that the shifting trend on Earth was from narrow-band to wide-band telecommunications. Thus, a new transform, other than the FFT, was needed that could detect signals over both narrow and wide bands, regardless of the colored noise distribution over any finite bandwidth. Such a transform had actually been pointed out as early as 1946 by the Finn mathematician, Kari Karhunen, and the French mathematician, Michel Loève, and is thus named KLT for them. In conclusion, François Biraud suggested to "look for the unknown in SETI" by adopting the KLT rather than the FFT. The same ideas were reached independently by this author also, and starting 1987, he too was "preaching the KLT": first at the SETI Institute, then (since 1990) at the Italian CNR (now called INAF) SETI facilities at Medicina, near Bologna. Their director, Stelio Montebugnoli, was willing to pay attention to him. Little by little, bright students succeeded in programming the KLT algorithm for the Medicina radio telescopes. Finally, by the year 2000, the advent of programmable cards, mastered by Montebugnoli, made the "miracle" happen. The KLT for SETI is now a reality at the SETI-Italia facilities and for the first time in history. This paper describes the KLT with a final section devoted to the advantages of installing the KLT on LOFAR and the SKA, i.e. to detecting leakage from nearby stars. Bursts, Pulses and Flickering: Wide-field monitoring of the dynamic radio sky Kerastari, Tripolis, Greece 12-15 June, 2007
Multirate-based fast parallel algorithms for 2-D DHT-based real-valued discrete Gabor transform.

PubMed

Tao, Liang; Kwan, Hon Keung

2012-07-01

Novel algorithms for the multirate and fast parallel implementation of the 2-D discrete Hartley transform (DHT)-based real-valued discrete Gabor transform (RDGT) and its inverse transform are presented in this paper. A 2-D multirate-based analysis convolver bank is designed for the 2-D RDGT, and a 2-D multirate-based synthesis convolver bank is designed for the 2-D inverse RDGT. The parallel channels in each of the two convolver banks have a unified structure and can apply the 2-D fast DHT algorithm to speed up their computations. The computational complexity of each parallel channel is low and is independent of the Gabor oversampling rate. All the 2-D RDGT coefficients of an image are computed in parallel during the analysis process and can be reconstructed in parallel during the synthesis process. The computational complexity and time of the proposed parallel algorithms are analyzed and compared with those of the existing fastest algorithms for 2-D discrete Gabor transforms. The results indicate that the proposed algorithms are the fastest, which make them attractive for real-time image processing.
An efficient parallel algorithm for matrix-vector multiplication

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hendrickson, B.; Leland, R.; Plimpton, S.

The multiplication of a vector by a matrix is the kernel computation of many algorithms in scientific computation. A fast parallel algorithm for this calculation is therefore necessary if one is to make full use of the new generation of parallel supercomputers. This paper presents a high performance, parallel matrix-vector multiplication algorithm that is particularly well suited to hypercube multiprocessors. For an n x n matrix on p processors, the communication cost of this algorithm is O(n/[radical]p + log(p)), independent of the matrix sparsity pattern. The performance of the algorithm is demonstrated by employing it as the kernel in themore » well-known NAS conjugate gradient benchmark, where a run time of 6.09 seconds was observed. This is the best published performance on this benchmark achieved to date using a massively parallel supercomputer.« less
Controlling the numerical Cerenkov instability in PIC simulations using a customized finite difference Maxwell solver and a local FFT based current correction

DOE PAGES

Li, Fei; Yu, Peicheng; Xu, Xinlu; ...

2017-01-12

In this study we present a customized finite-difference-time-domain (FDTD) Maxwell solver for the particle-in-cell (PIC) algorithm. The solver is customized to effectively eliminate the numerical Cerenkov instability (NCI) which arises when a plasma (neutral or non-neutral) relativistically drifts on a grid when using the PIC algorithm. We control the EM dispersion curve in the direction of the plasma drift of a FDTD Maxwell solver by using a customized higher order finite difference operator for the spatial derivative along the direction of the drift (1ˆ direction). We show that this eliminates the main NCI modes with moderate |k 1|, while keepsmore » additional main NCI modes well outside the range of physical interest with higher |k 1|. These main NCI modes can be easily filtered out along with first spatial aliasing NCI modes which are also at the edge of the fundamental Brillouin zone. The customized solver has the possible advantage of improved parallel scalability because it can be easily partitioned along 1ˆ which typically has many more cells than other directions for the problems of interest. We show that FFTs can be performed locally to current on each partition to filter out the main and first spatial aliasing NCI modes, and to correct the current so that it satisfies the continuity equation for the customized spatial derivative. This ensures that Gauss’ Law is satisfied. Lastly, we present simulation examples of one relativistically drifting plasma, of two colliding relativistically drifting plasmas, and of nonlinear laser wakefield acceleration (LWFA) in a Lorentz boosted frame that show no evidence of the NCI can be observed when using this customized Maxwell solver together with its NCI elimination scheme.« less
Controlling the numerical Cerenkov instability in PIC simulations using a customized finite difference Maxwell solver and a local FFT based current correction

NASA Astrophysics Data System (ADS)

Li, Fei; Yu, Peicheng; Xu, Xinlu; Fiuza, Frederico; Decyk, Viktor K.; Dalichaouch, Thamine; Davidson, Asher; Tableman, Adam; An, Weiming; Tsung, Frank S.; Fonseca, Ricardo A.; Lu, Wei; Mori, Warren B.

2017-05-01

In this paper we present a customized finite-difference-time-domain (FDTD) Maxwell solver for the particle-in-cell (PIC) algorithm. The solver is customized to effectively eliminate the numerical Cerenkov instability (NCI) which arises when a plasma (neutral or non-neutral) relativistically drifts on a grid when using the PIC algorithm. We control the EM dispersion curve in the direction of the plasma drift of a FDTD Maxwell solver by using a customized higher order finite difference operator for the spatial derivative along the direction of the drift (1 ˆ direction). We show that this eliminates the main NCI modes with moderate |k1 | , while keeps additional main NCI modes well outside the range of physical interest with higher |k1 | . These main NCI modes can be easily filtered out along with first spatial aliasing NCI modes which are also at the edge of the fundamental Brillouin zone. The customized solver has the possible advantage of improved parallel scalability because it can be easily partitioned along 1 ˆ which typically has many more cells than other directions for the problems of interest. We show that FFTs can be performed locally to current on each partition to filter out the main and first spatial aliasing NCI modes, and to correct the current so that it satisfies the continuity equation for the customized spatial derivative. This ensures that Gauss' Law is satisfied. We present simulation examples of one relativistically drifting plasma, of two colliding relativistically drifting plasmas, and of nonlinear laser wakefield acceleration (LWFA) in a Lorentz boosted frame that show no evidence of the NCI can be observed when using this customized Maxwell solver together with its NCI elimination scheme.
Controlling the numerical Cerenkov instability in PIC simulations using a customized finite difference Maxwell solver and a local FFT based current correction

DOE Office of Scientific and Technical Information (OSTI.GOV)

Li, Fei; Yu, Peicheng; Xu, Xinlu

In this study we present a customized finite-difference-time-domain (FDTD) Maxwell solver for the particle-in-cell (PIC) algorithm. The solver is customized to effectively eliminate the numerical Cerenkov instability (NCI) which arises when a plasma (neutral or non-neutral) relativistically drifts on a grid when using the PIC algorithm. We control the EM dispersion curve in the direction of the plasma drift of a FDTD Maxwell solver by using a customized higher order finite difference operator for the spatial derivative along the direction of the drift (1ˆ direction). We show that this eliminates the main NCI modes with moderate |k 1|, while keepsmore » additional main NCI modes well outside the range of physical interest with higher |k 1|. These main NCI modes can be easily filtered out along with first spatial aliasing NCI modes which are also at the edge of the fundamental Brillouin zone. The customized solver has the possible advantage of improved parallel scalability because it can be easily partitioned along 1ˆ which typically has many more cells than other directions for the problems of interest. We show that FFTs can be performed locally to current on each partition to filter out the main and first spatial aliasing NCI modes, and to correct the current so that it satisfies the continuity equation for the customized spatial derivative. This ensures that Gauss’ Law is satisfied. Lastly, we present simulation examples of one relativistically drifting plasma, of two colliding relativistically drifting plasmas, and of nonlinear laser wakefield acceleration (LWFA) in a Lorentz boosted frame that show no evidence of the NCI can be observed when using this customized Maxwell solver together with its NCI elimination scheme.« less
Hybrid massively parallel fast sweeping method for static Hamilton–Jacobi equations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Detrixhe, Miles, E-mail: mdetrixhe@engineering.ucsb.edu; University of California Santa Barbara, Santa Barbara, CA, 93106; Gibou, Frédéric, E-mail: fgibou@engineering.ucsb.edu

The fast sweeping method is a popular algorithm for solving a variety of static Hamilton–Jacobi equations. Fast sweeping algorithms for parallel computing have been developed, but are severely limited. In this work, we present a multilevel, hybrid parallel algorithm that combines the desirable traits of two distinct parallel methods. The fine and coarse grained components of the algorithm take advantage of heterogeneous computer architecture common in high performance computing facilities. We present the algorithm and demonstrate its effectiveness on a set of example problems including optimal control, dynamic games, and seismic wave propagation. We give results for convergence, parallel scaling,more » and show state-of-the-art speedup values for the fast sweeping method.« less
Optimal Design of Passive Power Filters Based on Pseudo-parallel Genetic Algorithm

NASA Astrophysics Data System (ADS)

Li, Pei; Li, Hongbo; Gao, Nannan; Niu, Lin; Guo, Liangfeng; Pei, Ying; Zhang, Yanyan; Xu, Minmin; Chen, Kerui

2017-05-01

The economic costs together with filter efficiency are taken as targets to optimize the parameter of passive filter. Furthermore, the method of combining pseudo-parallel genetic algorithm with adaptive genetic algorithm is adopted in this paper. In the early stages pseudo-parallel genetic algorithm is introduced to increase the population diversity, and adaptive genetic algorithm is used in the late stages to reduce the workload. At the same time, the migration rate of pseudo-parallel genetic algorithm is improved to change with population diversity adaptively. Simulation results show that the filter designed by the proposed method has better filtering effect with lower economic cost, and can be used in engineering.
Parallel Directionally Split Solver Based on Reformulation of Pipelined Thomas Algorithm

NASA Technical Reports Server (NTRS)

Povitsky, A.

1998-01-01

In this research an efficient parallel algorithm for 3-D directionally split problems is developed. The proposed algorithm is based on a reformulated version of the pipelined Thomas algorithm that starts the backward step computations immediately after the completion of the forward step computations for the first portion of lines This algorithm has data available for other computational tasks while processors are idle from the Thomas algorithm. The proposed 3-D directionally split solver is based on the static scheduling of processors where local and non-local, data-dependent and data-independent computations are scheduled while processors are idle. A theoretical model of parallelization efficiency is used to define optimal parameters of the algorithm, to show an asymptotic parallelization penalty and to obtain an optimal cover of a global domain with subdomains. It is shown by computational experiments and by the theoretical model that the proposed algorithm reduces the parallelization penalty about two times over the basic algorithm for the range of the number of processors (subdomains) considered and the number of grid nodes per subdomain.
Multi-core and GPU accelerated simulation of a radial star target imaged with equivalent t-number circular and Gaussian pupils

NASA Astrophysics Data System (ADS)

Greynolds, Alan W.

2013-09-01

Results from the GelOE optical engineering software are presented for the through-focus, monochromatic coherent and polychromatic incoherent imaging of a radial "star" target for equivalent t-number circular and Gaussian pupils. The FFT-based simulations are carried out using OpenMP threading on a multi-core desktop computer, with and without the aid of a many-core NVIDIA GPU accessing its cuFFT library. It is found that a custom FFT optimized for the 12-core host has similar performance to a simply implemented 256-core GPU FFT. A more sophisticated version of the latter but tuned to reduce overhead on a 448-core GPU is 20 to 28 times faster than a basic FFT implementation running on one CPU core.
Computationally efficient algorithm for high sampling-frequency operation of active noise control

NASA Astrophysics Data System (ADS)

Rout, Nirmal Kumar; Das, Debi Prasad; Panda, Ganapati

2015-05-01

In high sampling-frequency operation of active noise control (ANC) system the length of the secondary path estimate and the ANC filter are very long. This increases the computational complexity of the conventional filtered-x least mean square (FXLMS) algorithm. To reduce the computational complexity of long order ANC system using FXLMS algorithm, frequency domain block ANC algorithms have been proposed in past. These full block frequency domain ANC algorithms are associated with some disadvantages such as large block delay, quantization error due to computation of large size transforms and implementation difficulties in existing low-end DSP hardware. To overcome these shortcomings, the partitioned block ANC algorithm is newly proposed where the long length filters in ANC are divided into a number of equal partitions and suitably assembled to perform the FXLMS algorithm in the frequency domain. The complexity of this proposed frequency domain partitioned block FXLMS (FPBFXLMS) algorithm is quite reduced compared to the conventional FXLMS algorithm. It is further reduced by merging one fast Fourier transform (FFT)-inverse fast Fourier transform (IFFT) combination to derive the reduced structure FPBFXLMS (RFPBFXLMS) algorithm. Computational complexity analysis for different orders of filter and partition size are presented. Systematic computer simulations are carried out for both the proposed partitioned block ANC algorithms to show its accuracy compared to the time domain FXLMS algorithm.
Parallel optimization algorithms and their implementation in VLSI design

NASA Technical Reports Server (NTRS)

Lee, G.; Feeley, J. J.

1991-01-01

Two new parallel optimization algorithms based on the simplex method are described. They may be executed by a SIMD parallel processor architecture and be implemented in VLSI design. Several VLSI design implementations are introduced. An application example is reported to demonstrate that the algorithms are effective.
A parallel time integrator for noisy nonlinear oscillatory systems

NASA Astrophysics Data System (ADS)

Subber, Waad; Sarkar, Abhijit

2018-06-01

In this paper, we adapt a parallel time integration scheme to track the trajectories of noisy non-linear dynamical systems. Specifically, we formulate a parallel algorithm to generate the sample path of nonlinear oscillator defined by stochastic differential equations (SDEs) using the so-called parareal method for ordinary differential equations (ODEs). The presence of Wiener process in SDEs causes difficulties in the direct application of any numerical integration techniques of ODEs including the parareal algorithm. The parallel implementation of the algorithm involves two SDEs solvers, namely a fine-level scheme to integrate the system in parallel and a coarse-level scheme to generate and correct the required initial conditions to start the fine-level integrators. For the numerical illustration, a randomly excited Duffing oscillator is investigated in order to study the performance of the stochastic parallel algorithm with respect to a range of system parameters. The distributed implementation of the algorithm exploits Massage Passing Interface (MPI).

A parallel variable metric optimization algorithm

NASA Technical Reports Server (NTRS)

Straeter, T. A.

1973-01-01

An algorithm, designed to exploit the parallel computing or vector streaming (pipeline) capabilities of computers is presented. When p is the degree of parallelism, then one cycle of the parallel variable metric algorithm is defined as follows: first, the function and its gradient are computed in parallel at p different values of the independent variable; then the metric is modified by p rank-one corrections; and finally, a single univariant minimization is carried out in the Newton-like direction. Several properties of this algorithm are established. The convergence of the iterates to the solution is proved for a quadratic functional on a real separable Hilbert space. For a finite-dimensional space the convergence is in one cycle when p equals the dimension of the space. Results of numerical experiments indicate that the new algorithm will exploit parallel or pipeline computing capabilities to effect faster convergence than serial techniques.
Designing a parallel evolutionary algorithm for inferring gene networks on the cloud computing environment.

PubMed

Lee, Wei-Po; Hsiao, Yu-Ting; Hwang, Wei-Che

2014-01-16

To improve the tedious task of reconstructing gene networks through testing experimentally the possible interactions between genes, it becomes a trend to adopt the automated reverse engineering procedure instead. Some evolutionary algorithms have been suggested for deriving network parameters. However, to infer large networks by the evolutionary algorithm, it is necessary to address two important issues: premature convergence and high computational cost. To tackle the former problem and to enhance the performance of traditional evolutionary algorithms, it is advisable to use parallel model evolutionary algorithms. To overcome the latter and to speed up the computation, it is advocated to adopt the mechanism of cloud computing as a promising solution: most popular is the method of MapReduce programming model, a fault-tolerant framework to implement parallel algorithms for inferring large gene networks. This work presents a practical framework to infer large gene networks, by developing and parallelizing a hybrid GA-PSO optimization method. Our parallel method is extended to work with the Hadoop MapReduce programming model and is executed in different cloud computing environments. To evaluate the proposed approach, we use a well-known open-source software GeneNetWeaver to create several yeast S. cerevisiae sub-networks and use them to produce gene profiles. Experiments have been conducted and the results have been analyzed. They show that our parallel approach can be successfully used to infer networks with desired behaviors and the computation time can be largely reduced. Parallel population-based algorithms can effectively determine network parameters and they perform better than the widely-used sequential algorithms in gene network inference. These parallel algorithms can be distributed to the cloud computing environment to speed up the computation. By coupling the parallel model population-based optimization method and the parallel computational framework, high quality solutions can be obtained within relatively short time. This integrated approach is a promising way for inferring large networks.
Designing a parallel evolutionary algorithm for inferring gene networks on the cloud computing environment

PubMed Central

2014-01-01

Background To improve the tedious task of reconstructing gene networks through testing experimentally the possible interactions between genes, it becomes a trend to adopt the automated reverse engineering procedure instead. Some evolutionary algorithms have been suggested for deriving network parameters. However, to infer large networks by the evolutionary algorithm, it is necessary to address two important issues: premature convergence and high computational cost. To tackle the former problem and to enhance the performance of traditional evolutionary algorithms, it is advisable to use parallel model evolutionary algorithms. To overcome the latter and to speed up the computation, it is advocated to adopt the mechanism of cloud computing as a promising solution: most popular is the method of MapReduce programming model, a fault-tolerant framework to implement parallel algorithms for inferring large gene networks. Results This work presents a practical framework to infer large gene networks, by developing and parallelizing a hybrid GA-PSO optimization method. Our parallel method is extended to work with the Hadoop MapReduce programming model and is executed in different cloud computing environments. To evaluate the proposed approach, we use a well-known open-source software GeneNetWeaver to create several yeast S. cerevisiae sub-networks and use them to produce gene profiles. Experiments have been conducted and the results have been analyzed. They show that our parallel approach can be successfully used to infer networks with desired behaviors and the computation time can be largely reduced. Conclusions Parallel population-based algorithms can effectively determine network parameters and they perform better than the widely-used sequential algorithms in gene network inference. These parallel algorithms can be distributed to the cloud computing environment to speed up the computation. By coupling the parallel model population-based optimization method and the parallel computational framework, high quality solutions can be obtained within relatively short time. This integrated approach is a promising way for inferring large networks. PMID:24428926
A parallel Jacobson-Oksman optimization algorithm. [parallel processing (computers)

NASA Technical Reports Server (NTRS)

Straeter, T. A.; Markos, A. T.

1975-01-01

A gradient-dependent optimization technique which exploits the vector-streaming or parallel-computing capabilities of some modern computers is presented. The algorithm, derived by assuming that the function to be minimized is homogeneous, is a modification of the Jacobson-Oksman serial minimization method. In addition to describing the algorithm, conditions insuring the convergence of the iterates of the algorithm and the results of numerical experiments on a group of sample test functions are presented. The results of these experiments indicate that this algorithm will solve optimization problems in less computing time than conventional serial methods on machines having vector-streaming or parallel-computing capabilities.
Rapid code acquisition algorithms employing PN matched filters

NASA Technical Reports Server (NTRS)

Su, Yu T.

1988-01-01

The performance of four algorithms using pseudonoise matched filters (PNMFs), for direct-sequence spread-spectrum systems, is analyzed. They are: parallel search with fix dwell detector (PL-FDD), parallel search with sequential detector (PL-SD), parallel-serial search with fix dwell detector (PS-FDD), and parallel-serial search with sequential detector (PS-SD). The operation characteristic for each detector and the mean acquisition time for each algorithm are derived. All the algorithms are studied in conjunction with the noncoherent integration technique, which enables the system to operate in the presence of data modulation. Several previous proposals using PNMF are seen as special cases of the present algorithms.
Algorithms and programming tools for image processing on the MPP

NASA Technical Reports Server (NTRS)

Reeves, A. P.

1985-01-01

Topics addressed include: data mapping and rotational algorithms for the Massively Parallel Processor (MPP); Parallel Pascal language; documentation for the Parallel Pascal Development system; and a description of the Parallel Pascal language used on the MPP.
Big Data: A Parallel Particle Swarm Optimization-Back-Propagation Neural Network Algorithm Based on MapReduce

PubMed Central

Cao, Jianfang; Cui, Hongyan; Shi, Hao; Jiao, Lijuan

2016-01-01

A back-propagation (BP) neural network can solve complicated random nonlinear mapping problems; therefore, it can be applied to a wide range of problems. However, as the sample size increases, the time required to train BP neural networks becomes lengthy. Moreover, the classification accuracy decreases as well. To improve the classification accuracy and runtime efficiency of the BP neural network algorithm, we proposed a parallel design and realization method for a particle swarm optimization (PSO)-optimized BP neural network based on MapReduce on the Hadoop platform using both the PSO algorithm and a parallel design. The PSO algorithm was used to optimize the BP neural network’s initial weights and thresholds and improve the accuracy of the classification algorithm. The MapReduce parallel programming model was utilized to achieve parallel processing of the BP algorithm, thereby solving the problems of hardware and communication overhead when the BP neural network addresses big data. Datasets on 5 different scales were constructed using the scene image library from the SUN Database. The classification accuracy of the parallel PSO-BP neural network algorithm is approximately 92%, and the system efficiency is approximately 0.85, which presents obvious advantages when processing big data. The algorithm proposed in this study demonstrated both higher classification accuracy and improved time efficiency, which represents a significant improvement obtained from applying parallel processing to an intelligent algorithm on big data. PMID:27304987
Applications and accuracy of the parallel diagonal dominant algorithm

NASA Technical Reports Server (NTRS)

Sun, Xian-He

1993-01-01

The Parallel Diagonal Dominant (PDD) algorithm is a highly efficient, ideally scalable tridiagonal solver. In this paper, a detailed study of the PDD algorithm is given. First the PDD algorithm is introduced. Then the algorithm is extended to solve periodic tridiagonal systems. A variant, the reduced PDD algorithm, is also proposed. Accuracy analysis is provided for a class of tridiagonal systems, the symmetric, and anti-symmetric Toeplitz tridiagonal systems. Implementation results show that the analysis gives a good bound on the relative error, and the algorithm is a good candidate for the emerging massively parallel machines.
Investigating the Microbial Degradation Potential in Oil Sands Fluid Fine Tailings Using Gamma Irradiation: A Metagenomic Perspective.

PubMed

VanMensel, Danielle; Chaganti, Subba Rao; Boudens, Ryan; Reid, Thomas; Ciborowski, Jan; Weisener, Christopher

2017-08-01

Open-pit mining of the Athabasca oil sands has generated large volumes of waste termed fluid fine tailings (FFT), stored in tailings ponds. Accumulation of toxic organic substances in the tailings ponds is one of the biggest concerns. Gamma irradiation (GI) treatment could accelerate the biodegradation of toxic organic substances. Hence, this research investigates the response of the microbial consortia in GI-treated FFT materials with an emphasis on changes in diversity and organism-related stimuli. FFT materials from aged and fresh ponds were used in the study under aerobic and anaerobic conditions. Variations in the microbial diversity in GI-treated FFT materials were monitored for 52 weeks and significant stimuli (p < 0.05) were observed. Chemoorganotrophic organisms dominated in fresh and aged ponds and showed increased relative abundance resulting from GI treatment. GI-treated anaerobic FFT aged reported stimulus of organisms with biodegradation potential (e.g., Pseudomonas, Enterobacter) and methylotrophic capabilities (e.g., Syntrophus, Smithella). In comparison, GI-treated anaerobic FFT fresh stimulated Desulfuromonas as the principle genus at 52 weeks. Under aerobic conditions, GI-treated FFT aged showed stimulation of organisms capable of sulfur and iron cycling (e.g., Geobacter). However, GI-treated aerobic FFT fresh showed no stimulus at 52 weeks. This research provides an enhanced understanding of oil sands tailings biogeochemistry and the impacts of GI treatment on microorganisms as an effect for targeting toxic organics. The outcomes of this study highlight the potential for this approach to accelerate stabilization and reclamation end points. Graphical Abstract.
Parallel Computing Strategies for Irregular Algorithms

NASA Technical Reports Server (NTRS)

Biswas, Rupak; Oliker, Leonid; Shan, Hongzhang; Biegel, Bryan (Technical Monitor)

2002-01-01

Parallel computing promises several orders of magnitude increase in our ability to solve realistic computationally-intensive problems, but relies on their efficient mapping and execution on large-scale multiprocessor architectures. Unfortunately, many important applications are irregular and dynamic in nature, making their effective parallel implementation a daunting task. Moreover, with the proliferation of parallel architectures and programming paradigms, the typical scientist is faced with a plethora of questions that must be answered in order to obtain an acceptable parallel implementation of the solution algorithm. In this paper, we consider three representative irregular applications: unstructured remeshing, sparse matrix computations, and N-body problems, and parallelize them using various popular programming paradigms on a wide spectrum of computer platforms ranging from state-of-the-art supercomputers to PC clusters. We present the underlying problems, the solution algorithms, and the parallel implementation strategies. Smart load-balancing, partitioning, and ordering techniques are used to enhance parallel performance. Overall results demonstrate the complexity of efficiently parallelizing irregular algorithms.
Laboratory for Engineering Man/Machine Systems (LEMS): System identification, model reduction and deconvolution filtering using Fourier based modulating signals and high order statistics

NASA Technical Reports Server (NTRS)

Pan, Jianqiang

1992-01-01

Several important problems in the fields of signal processing and model identification, such as system structure identification, frequency response determination, high order model reduction, high resolution frequency analysis, deconvolution filtering, and etc. Each of these topics involves a wide range of applications and has received considerable attention. Using the Fourier based sinusoidal modulating signals, it is shown that a discrete autoregressive model can be constructed for the least squares identification of continuous systems. Some identification algorithms are presented for both SISO and MIMO systems frequency response determination using only transient data. Also, several new schemes for model reduction were developed. Based upon the complex sinusoidal modulating signals, a parametric least squares algorithm for high resolution frequency estimation is proposed. Numerical examples show that the proposed algorithm gives better performance than the usual. Also, the problem was studied of deconvolution and parameter identification of a general noncausal nonminimum phase ARMA system driven by non-Gaussian stationary random processes. Algorithms are introduced for inverse cumulant estimation, both in the frequency domain via the FFT algorithms and in the domain via the least squares algorithm.
High-performance compression and double cryptography based on compressive ghost imaging with the fast Fourier transform

NASA Astrophysics Data System (ADS)

Leihong, Zhang; Zilan, Pan; Luying, Wu; Xiuhua, Ma

2016-11-01

To solve the problem that large images can hardly be retrieved for stringent hardware restrictions and the security level is low, a method based on compressive ghost imaging (CGI) with Fast Fourier Transform (FFT) is proposed, named FFT-CGI. Initially, the information is encrypted by the sender with FFT, and the FFT-coded image is encrypted by the system of CGI with a secret key. Then the receiver decrypts the image with the aid of compressive sensing (CS) and FFT. Simulation results are given to verify the feasibility, security, and compression of the proposed encryption scheme. The experiment suggests the method can improve the quality of large images compared with conventional ghost imaging and achieve the imaging for large-sized images, further the amount of data transmitted largely reduced because of the combination of compressive sensing and FFT, and improve the security level of ghost images through ciphertext-only attack (COA), chosen-plaintext attack (CPA), and noise attack. This technique can be immediately applied to encryption and data storage with the advantages of high security, fast transmission, and high quality of reconstructed information.
Dependence of Adaptive Cross-correlation Algorithm Performance on the Extended Scene Image Quality

NASA Technical Reports Server (NTRS)

Sidick, Erkin

2008-01-01

Recently, we reported an adaptive cross-correlation (ACC) algorithm to estimate with high accuracy the shift as large as several pixels between two extended-scene sub-images captured by a Shack-Hartmann wavefront sensor. It determines the positions of all extended-scene image cells relative to a reference cell in the same frame using an FFT-based iterative image-shifting algorithm. It works with both point-source spot images as well as extended scene images. We have demonstrated previously based on some measured images that the ACC algorithm can determine image shifts with as high an accuracy as 0.01 pixel for shifts as large 3 pixels, and yield similar results for both point source spot images and extended scene images. The shift estimate accuracy of the ACC algorithm depends on illumination level, background, and scene content in addition to the amount of the shift between two image cells. In this paper we investigate how the performance of the ACC algorithm depends on the quality and the frequency content of extended scene images captured by a Shack-Hatmann camera. We also compare the performance of the ACC algorithm with those of several other approaches, and introduce a failsafe criterion for the ACC algorithm-based extended scene Shack-Hatmann sensors.
Fast computation of the kurtogram for the detection of transient faults

NASA Astrophysics Data System (ADS)

Antoni, Jérôme

2007-01-01

The kurtogram is a fourth-order spectral analysis tool recently introduced for detecting and characterising non-stationarities in a signal. The paradigm relies on the assertion that each type of transient is associated with an optimal (frequency/frequency resolution) dyad { f,Δf} which maximises its kurtosis, and hence its detection. However, the complete exploration of the whole plane ( f,Δf) is a formidable task hardly amenable to on-line industrial applications. In this communication we describe a fast algorithm for computing the kurtogram over a grid that finely samples the ( f,Δf) plane. Its complexity is on the order of N log N, similarly to the FFT. The efficiency of the algorithm is then illustrated on several industrial cases concerned with the detection of incipient transient faults.
Sensor Authentication: Embedded Processor Code

DOE Office of Scientific and Technical Information (OSTI.GOV)

Svoboda, John

2012-09-25

Described is the c code running on the embedded Microchip 32bit PIC32MX575F256H located on the INL developed noise analysis circuit board. The code performs the following functions: Controls the noise analysis circuit board preamplifier voltage gains of 1, 10, 100, 000 Initializes the analog to digital conversion hardware, input channel selection, Fast Fourier Transform (FFT) function, USB communications interface, and internal memory allocations Initiates high resolution 4096 point 200 kHz data acquisition Computes complex 2048 point FFT and FFT magnitude. Services Host command set Transfers raw data to Host Transfers FFT result to host Communication error checking
STS-48 Pilot Reightler and MS Brown, in LESs, stand at JSC FFT side hatch

NASA Technical Reports Server (NTRS)

1991-01-01

STS-48 Discovery, Orbiter Vehicle (OV) 103, Pilot Kenneth S. Reightler, Jr (left) and Mission Specialist (MS) Mark N. Brown, wearing launch and entry suits (LESs), stand at the side hatch of JSC's full fuselage trainer (FFT). The crewmembers will enter the FFT shuttle mockup through the side hatch and take their assigned descent (landing) positions in the crew cabin. Reightler and Brown, along with the other crewmembers, are participating in a post-landing emergency egress exercise. The FFT is located in the Mockup and Integration Laboratory (MAIL) Bldg 9A.
Mediaprocessors in medical imaging for high performance and flexibility

NASA Astrophysics Data System (ADS)

Managuli, Ravi; Kim, Yongmin

2002-05-01

New high performance programmable processors, called mediaprocessors, have been emerging since the early 1990s for various digital media applications, such as digital TV, set-top boxes, desktop video conferencing, and digital camcorders. Modern mediaprocessors, e.g., TI's TMS320C64x and Hitachi/Equator Technologies MAP-CA, can offer high performance utilizing both instruction-level and data-level parallelism. During this decade, with continued performance improvement and cost reduction, we believe that the mediaprocessors will become a preferred choice in designing imaging and video systems due to their flexibility in incorporating new algorithms and applications via programming and faster-time-to-market. In this paper, we will evaluate the suitability of these mediaprocessors in medical imaging. We will review the core routines of several medical imaging modalities, such as ultrasound and DR, and present how these routines can be mapped to mediaprocessors and their resultant performance. We will analyze the architecture of several leading mediaprocessors. By carefully mapping key imaging routines, such as 2D convolution, unsharp masking, and 2D FFT, to the mediaprocessor, we have been able to achieve comparable (if not better) performance to that of traditional hardwired approaches. Thus, we believe that future medical imaging systems will benefit greatly from these advanced mediaprocessors, offering significantly increased flexibility and adaptability, reducing the time-to-market, and improving the cost/performance ratio compared to the existing systems while meeting the high computing requirements.
Sublattice parallel replica dynamics.

PubMed

Martínez, Enrique; Uberuaga, Blas P; Voter, Arthur F

2014-06-01

Exascale computing presents a challenge for the scientific community as new algorithms must be developed to take full advantage of the new computing paradigm. Atomistic simulation methods that offer full fidelity to the underlying potential, i.e., molecular dynamics (MD) and parallel replica dynamics, fail to use the whole machine speedup, leaving a region in time and sample size space that is unattainable with current algorithms. In this paper, we present an extension of the parallel replica dynamics algorithm [A. F. Voter, Phys. Rev. B 57, R13985 (1998)] by combining it with the synchronous sublattice approach of Shim and Amar [ and , Phys. Rev. B 71, 125432 (2005)], thereby exploiting event locality to improve the algorithm scalability. This algorithm is based on a domain decomposition in which events happen independently in different regions in the sample. We develop an analytical expression for the speedup given by this sublattice parallel replica dynamics algorithm and compare it with parallel MD and traditional parallel replica dynamics. We demonstrate how this algorithm, which introduces a slight additional approximation of event locality, enables the study of physical systems unreachable with traditional methodologies and promises to better utilize the resources of current high performance and future exascale computers.
Parallel Algorithms and Patterns

DOE Office of Scientific and Technical Information (OSTI.GOV)

Robey, Robert W.

2016-06-16

This is a powerpoint presentation on parallel algorithms and patterns. A parallel algorithm is a well-defined, step-by-step computational procedure that emphasizes concurrency to solve a problem. Examples of problems include: Sorting, searching, optimization, matrix operations. A parallel pattern is a computational step in a sequence of independent, potentially concurrent operations that occurs in diverse scenarios with some frequency. Examples are: Reductions, prefix scans, ghost cell updates. We only touch on parallel patterns in this presentation. It really deserves its own detailed discussion which Gabe Rockefeller would like to develop.
Parallel computing of physical maps--a comparative study in SIMD and MIMD parallelism.

PubMed

Bhandarkar, S M; Chirravuri, S; Arnold, J

1996-01-01

Ordering clones from a genomic library into physical maps of whole chromosomes presents a central computational problem in genetics. Chromosome reconstruction via clone ordering is usually isomorphic to the NP-complete Optimal Linear Arrangement problem. Parallel SIMD and MIMD algorithms for simulated annealing based on Markov chain distribution are proposed and applied to the problem of chromosome reconstruction via clone ordering. Perturbation methods and problem-specific annealing heuristics are proposed and described. The SIMD algorithms are implemented on a 2048 processor MasPar MP-2 system which is an SIMD 2-D toroidal mesh architecture whereas the MIMD algorithms are implemented on an 8 processor Intel iPSC/860 which is an MIMD hypercube architecture. A comparative analysis of the various SIMD and MIMD algorithms is presented in which the convergence, speedup, and scalability characteristics of the various algorithms are analyzed and discussed. On a fine-grained, massively parallel SIMD architecture with a low synchronization overhead such as the MasPar MP-2, a parallel simulated annealing algorithm based on multiple periodically interacting searches performs the best. For a coarse-grained MIMD architecture with high synchronization overhead such as the Intel iPSC/860, a parallel simulated annealing algorithm based on multiple independent searches yields the best results. In either case, distribution of clonal data across multiple processors is shown to exacerbate the tendency of the parallel simulated annealing algorithm to get trapped in a local optimum.

Exact parallel algorithms for some members of the traveling salesman problem family

DOE Office of Scientific and Technical Information (OSTI.GOV)

Pekny, J.F.

1989-01-01

The traveling salesman problem and its many generalizations comprise one of the best known combinatorial optimization problem families. Most members of the family are NP-complete problems so that exact algorithms require an unpredictable and sometimes large computational effort. Parallel computers offer hope for providing the power required to meet these demands. A major barrier to applying parallel computers is the lack of parallel algorithms. The contributions presented in this thesis center around new exact parallel algorithms for the asymmetric traveling salesman problem (ATSP), prize collecting traveling salesman problem (PCTSP), and resource constrained traveling salesman problem (RCTSP). The RCTSP is amore » particularly difficult member of the family since finding a feasible solution is an NP-complete problem. An exact sequential algorithm is also presented for the directed hamiltonian cycle problem (DHCP). The DHCP algorithm is superior to current heuristic approaches and represents the first exact method applicable to large graphs. Computational results presented for each of the algorithms demonstrates the effectiveness of combining efficient algorithms with parallel computing methods. Performance statistics are reported for randomly generated ATSPs with 7,500 cities, PCTSPs with 200 cities, RCTSPs with 200 cities, DHCPs with 3,500 vertices, and assignment problems of size 10,000. Sequential results were collected on a Sun 4/260 engineering workstation, while parallel results were collected using a 14 and 100 processor BBN Butterfly Plus computer. The computational results represent the largest instances ever solved to optimality on any type of computer.« less
Noniterative MAP reconstruction using sparse matrix representations.

PubMed

Cao, Guangzhi; Bouman, Charles A; Webb, Kevin J

2009-09-01

We present a method for noniterative maximum a posteriori (MAP) tomographic reconstruction which is based on the use of sparse matrix representations. Our approach is to precompute and store the inverse matrix required for MAP reconstruction. This approach has generally not been used in the past because the inverse matrix is typically large and fully populated (i.e., not sparse). In order to overcome this problem, we introduce two new ideas. The first idea is a novel theory for the lossy source coding of matrix transformations which we refer to as matrix source coding. This theory is based on a distortion metric that reflects the distortions produced in the final matrix-vector product, rather than the distortions in the coded matrix itself. The resulting algorithms are shown to require orthonormal transformations of both the measurement data and the matrix rows and columns before quantization and coding. The second idea is a method for efficiently storing and computing the required orthonormal transformations, which we call a sparse-matrix transform (SMT). The SMT is a generalization of the classical FFT in that it uses butterflies to compute an orthonormal transform; but unlike an FFT, the SMT uses the butterflies in an irregular pattern, and is numerically designed to best approximate the desired transforms. We demonstrate the potential of the noniterative MAP reconstruction with examples from optical tomography. The method requires offline computation to encode the inverse transform. However, once these offline computations are completed, the noniterative MAP algorithm is shown to reduce both storage and computation by well over two orders of magnitude, as compared to a linear iterative reconstruction methods.
Feed-forward frequency offset estimation for 32-QAM optical coherent detection.

PubMed

Xiao, Fei; Lu, Jianing; Fu, Songnian; Xie, Chenhui; Tang, Ming; Tian, Jinwen; Liu, Deming

2017-04-17

Due to the non-rectangular distribution of the constellation points, traditional fast Fourier transform based frequency offset estimation (FFT-FOE) is no longer suitable for 32-QAM signal. Here, we report a modified FFT-FOE technique by selecting and digitally amplifying the inner QPSK ring of 32-QAM after the adaptive equalization, which is defined as QPSK-selection assisted FFT-FOE. Simulation results show that no FOE error occurs with a FFT size of only 512 symbols, when the signal-to-noise ratio (SNR) is above 17.5 dB using our proposed FOE technique. However, the error probability of traditional FFT-FOE scheme for 32-QAM is always intolerant. Finally, our proposed FOE scheme functions well for 10 Gbaud dual polarization (DP)-32-QAM signal to reach 20% forward error correction (FEC) threshold of BER=2×10^-2, under the scenario of back-to-back (B2B) transmission.
Extension of the frequency-domain pFFT method for wave structure interaction in finite depth

NASA Astrophysics Data System (ADS)

Teng, Bin; Song, Zhi-jie

2017-06-01

To analyze wave interaction with a large scale body in the frequency domain, a precorrected Fast Fourier Transform (pFFT) method has been proposed for infinite depth problems with the deep water Green function, as it can form a matrix with Toeplitz and Hankel properties. In this paper, a method is proposed to decompose the finite depth Green function into two terms, which can form matrices with the Toeplitz and a Hankel properties respectively. Then, a pFFT method for finite depth problems is developed. Based on the pFFT method, a numerical code pFFT-HOBEM is developed with the discretization of high order elements. The model is validated, and examinations on the computing efficiency and memory requirement of the new method have also been carried out. It shows that the new method has the same advantages as that for infinite depth.
Computational mechanics analysis tools for parallel-vector supercomputers

NASA Technical Reports Server (NTRS)

Storaasli, O. O.; Nguyen, D. T.; Baddourah, M. A.; Qin, J.

1993-01-01

Computational algorithms for structural analysis on parallel-vector supercomputers are reviewed. These parallel algorithms, developed by the authors, are for the assembly of structural equations, 'out-of-core' strategies for linear equation solution, massively distributed-memory equation solution, unsymmetric equation solution, general eigen-solution, geometrically nonlinear finite element analysis, design sensitivity analysis for structural dynamics, optimization algorithm and domain decomposition. The source code for many of these algorithms is available from NASA Langley.
Concurrent computation of attribute filters on shared memory parallel machines.

PubMed

Wilkinson, Michael H F; Gao, Hui; Hesselink, Wim H; Jonker, Jan-Eppo; Meijster, Arnold

2008-10-01

Morphological attribute filters have not previously been parallelized, mainly because they are both global and non-separable. We propose a parallel algorithm that achieves efficient parallelism for a large class of attribute filters, including attribute openings, closings, thinnings and thickenings, based on Salembier's Max-Trees and Min-trees. The image or volume is first partitioned in multiple slices. We then compute the Max-trees of each slice using any sequential Max-Tree algorithm. Subsequently, the Max-trees of the slices can be merged to obtain the Max-tree of the image. A C-implementation yielded good speed-ups on both a 16-processor MIPS 14000 parallel machine, and a dual-core Opteron-based machine. It is shown that the speed-up of the parallel algorithm is a direct measure of the gain with respect to the sequential algorithm used. Furthermore, the concurrent algorithm shows a speed gain of up to 72 percent on a single-core processor, due to reduced cache thrashing.
Custom instruction set NIOS-based OFDM processor for FPGAs

NASA Astrophysics Data System (ADS)

Meyer-Bäse, Uwe; Sunkara, Divya; Castillo, Encarnacion; Garcia, Antonio

2006-05-01

Orthogonal Frequency division multiplexing (OFDM) spread spectrum technique, sometimes also called multi-carrier or discrete multi-tone modulation, are used in bandwidth-efficient communication systems in the presence of channel distortion. The benefits of OFDM are high spectral efficiency, resiliency to RF interference, and lower multi-path distortion. OFDM is the basis for the European digital audio broadcasting (DAB) standard, the global asymmetric digital subscriber line (ADSL) standard, in the IEEE 802.11 5.8 GHz band standard, and ongoing development in wireless local area networks. The modulator and demodulator in an OFDM system can be implemented by use of a parallel bank of filters based on the discrete Fourier transform (DFT), in case the number of subchannels is large (e.g. K > 25), the OFDM system are efficiently implemented by use of the fast Fourier transform (FFT) to compute the DFT. We have developed a custom FPGA-based Altera NIOS system to increase the performance, programmability, and low power in mobil wireless systems. The overall gain observed for a 1024-point FFT ranges depending on the multiplier used by the NIOS processor between a factor of 3 and 16. A careful optimization described in the appendix yield a performance gain of up to 77% when compared with our preliminary results.
Regional-scale calculation of the LS factor using parallel processing

NASA Astrophysics Data System (ADS)

Liu, Kai; Tang, Guoan; Jiang, Ling; Zhu, A.-Xing; Yang, Jianyi; Song, Xiaodong

2015-05-01

With the increase of data resolution and the increasing application of USLE over large areas, the existing serial implementation of algorithms for computing the LS factor is becoming a bottleneck. In this paper, a parallel processing model based on message passing interface (MPI) is presented for the calculation of the LS factor, so that massive datasets at a regional scale can be processed efficiently. The parallel model contains algorithms for calculating flow direction, flow accumulation, drainage network, slope, slope length and the LS factor. According to the existence of data dependence, the algorithms are divided into local algorithms and global algorithms. Parallel strategy are designed according to the algorithm characters including the decomposition method for maintaining the integrity of the results, optimized workflow for reducing the time taken for exporting the unnecessary intermediate data and a buffer-communication-computation strategy for improving the communication efficiency. Experiments on a multi-node system show that the proposed parallel model allows efficient calculation of the LS factor at a regional scale with a massive dataset.
A new scheduling algorithm for parallel sparse LU factorization with static pivoting

DOE Office of Scientific and Technical Information (OSTI.GOV)

Grigori, Laura; Li, Xiaoye S.

2002-08-20

In this paper we present a static scheduling algorithm for parallel sparse LU factorization with static pivoting. The algorithm is divided into mapping and scheduling phases, using the symmetric pruned graphs of L' and U to represent dependencies. The scheduling algorithm is designed for driving the parallel execution of the factorization on a distributed-memory architecture. Experimental results and comparisons with SuperLU{_}DIST are reported after applying this algorithm on real world application matrices on an IBM SP RS/6000 distributed memory machine.
Parallel conjugate gradient algorithms for manipulator dynamic simulation

NASA Technical Reports Server (NTRS)

Fijany, Amir; Scheld, Robert E.

1989-01-01

Parallel conjugate gradient algorithms for the computation of multibody dynamics are developed for the specialized case of a robot manipulator. For an n-dimensional positive-definite linear system, the Classical Conjugate Gradient (CCG) algorithms are guaranteed to converge in n iterations, each with a computation cost of O(n); this leads to a total computational cost of O(n sq) on a serial processor. A conjugate gradient algorithms is presented that provide greater efficiency using a preconditioner, which reduces the number of iterations required, and by exploiting parallelism, which reduces the cost of each iteration. Two Preconditioned Conjugate Gradient (PCG) algorithms are proposed which respectively use a diagonal and a tridiagonal matrix, composed of the diagonal and tridiagonal elements of the mass matrix, as preconditioners. Parallel algorithms are developed to compute the preconditioners and their inversions in O(log sub 2 n) steps using n processors. A parallel algorithm is also presented which, on the same architecture, achieves the computational time of O(log sub 2 n) for each iteration. Simulation results for a seven degree-of-freedom manipulator are presented. Variants of the proposed algorithms are also developed which can be efficiently implemented on the Robot Mathematics Processor (RMP).
GPU-accelerated non-uniform fast Fourier transform-based compressive sensing spectral domain optical coherence tomography.

PubMed

Xu, Daguang; Huang, Yong; Kang, Jin U

2014-06-16

We implemented the graphics processing unit (GPU) accelerated compressive sensing (CS) non-uniform in k-space spectral domain optical coherence tomography (SD OCT). Kaiser-Bessel (KB) function and Gaussian function are used independently as the convolution kernel in the gridding-based non-uniform fast Fourier transform (NUFFT) algorithm with different oversampling ratios and kernel widths. Our implementation is compared with the GPU-accelerated modified non-uniform discrete Fourier transform (MNUDFT) matrix-based CS SD OCT and the GPU-accelerated fast Fourier transform (FFT)-based CS SD OCT. It was found that our implementation has comparable performance to the GPU-accelerated MNUDFT-based CS SD OCT in terms of image quality while providing more than 5 times speed enhancement. When compared to the GPU-accelerated FFT based-CS SD OCT, it shows smaller background noise and less side lobes while eliminating the need for the cumbersome k-space grid filling and the k-linear calibration procedure. Finally, we demonstrated that by using a conventional desktop computer architecture having three GPUs, real-time B-mode imaging can be obtained in excess of 30 fps for the GPU-accelerated NUFFT based CS SD OCT with frame size 2048(axial) × 1,000(lateral).
On the period determination of ASAS eclipsing binaries

NASA Astrophysics Data System (ADS)

Mayangsari, L.; Priyatikanto, R.; Putra, M.

2014-03-01

Variable stars, or particularly eclipsing binaries, are very essential astronomical occurrence. Surveys are the backbone of astronomy, and many discoveries of variable stars are the results of surveys. All-Sky Automated Survey (ASAS) is one of the observing projects whose ultimate goal is photometric monitoring of variable stars. Since its first light in 1997, ASAS has collected 50,099 variable stars, with 11,076 eclipsing binaries among them. In the present work we focus on the period determination of the eclipsing binaries. Since the number of data points in each ASAS eclipsing binary light curve is sparse, period determination of any system is a not straightforward process. For 30 samples of such systems we compare the implementation of Lomb-Scargle algorithm which is an Fast Fourier Transform (FFT) basis and Phase Dispersion Minimization (PDM) method which is non-FFT basis to determine their period. It is demonstrated that PDM gives better performance at handling eclipsing detached (ED) systems whose variability are non-sinusoidal. More over, using semi-automatic recipes, we get better period solution and satisfactorily improve 53% of the selected object's light curves, but failed against another 7% of selected objects. In addition, we also highlight 4 interesting objects for further investigation.
Design of Passive Power Filter for Hybrid Series Active Power Filter using Estimation, Detection and Classification Method

NASA Astrophysics Data System (ADS)

Swain, Sushree Diptimayee; Ray, Pravat Kumar; Mohanty, K. B.

2016-06-01

This research paper discover the design of a shunt Passive Power Filter (PPF) in Hybrid Series Active Power Filter (HSAPF) that employs a novel analytic methodology which is superior than FFT analysis. This novel approach consists of the estimation, detection and classification of the signals. The proposed method is applied to estimate, detect and classify the power quality (PQ) disturbance such as harmonics. This proposed work deals with three methods: the harmonic detection through wavelet transform method, the harmonic estimation by Kalman Filter algorithm and harmonic classification by decision tree method. From different type of mother wavelets in wavelet transform method, the db8 is selected as suitable mother wavelet because of its potency on transient response and crouched oscillation at frequency domain. In harmonic compensation process, the detected harmonic is compensated through Hybrid Series Active Power Filter (HSAPF) based on Instantaneous Reactive Power Theory (IRPT). The efficacy of the proposed method is verified in MATLAB/SIMULINK domain and as well as with an experimental set up. The obtained results confirm the superiority of the proposed methodology than FFT analysis. This newly proposed PPF is used to make the conventional HSAPF more robust and stable.
GPU-completeness: theory and implications

NASA Astrophysics Data System (ADS)

Lin, I.-Jong

2011-01-01

This paper formalizes a major insight into a class of algorithms that relate parallelism and performance. The purpose of this paper is to define a class of algorithms that trades off parallelism for quality of result (e.g. visual quality, compression rate), and we propose a similar method for algorithmic classification based on NP-Completeness techniques, applied toward parallel acceleration. We will define this class of algorithm as "GPU-Complete" and will postulate the necessary properties of the algorithms for admission into this class. We will also formally relate his algorithmic space and imaging algorithms space. This concept is based upon our experience in the print production area where GPUs (Graphic Processing Units) have shown a substantial cost/performance advantage within the context of HPdelivered enterprise services and commercial printing infrastructure. While CPUs and GPUs are converging in their underlying hardware and functional blocks, their system behaviors are clearly distinct in many ways: memory system design, programming paradigms, and massively parallel SIMD architecture. There are applications that are clearly suited to each architecture: for CPU: language compilation, word processing, operating systems, and other applications that are highly sequential in nature; for GPU: video rendering, particle simulation, pixel color conversion, and other problems clearly amenable to massive parallelization. While GPUs establishing themselves as a second, distinct computing architecture from CPUs, their end-to-end system cost/performance advantage in certain parts of computation inform the structure of algorithms and their efficient parallel implementations. While GPUs are merely one type of architecture for parallelization, we show that their introduction into the design space of printing systems demonstrate the trade-offs against competing multi-core, FPGA, and ASIC architectures. While each architecture has its own optimal application, we believe that the selection of architecture can be defined in terms of properties of GPU-Completeness. For a welldefined subset of algorithms, GPU-Completeness is intended to connect the parallelism, algorithms and efficient architectures into a unified framework to show that multiple layers of parallel implementation are guided by the same underlying trade-off.
Crashworthiness simulations with DYNA3D

DOE Office of Scientific and Technical Information (OSTI.GOV)

Schauer, D.A.; Hoover, C.G.; Kay, G.J.

1996-04-01

Current progress in parallel algorithm research and applications in vehicle crash simulation is described for the explicit, finite element algorithms in DYNA3D. Problem partitioning methods and parallel algorithms for contact at material interfaces are the two challenging algorithm research problems that are addressed. Two prototype parallel contact algorithms have been developed for treating the cases of local and arbitrary contact. Demonstration problems for local contact are crashworthiness simulations with 222 locally defined contact surfaces and a vehicle/barrier collision modeled with arbitrary contact. A simulation of crash tests conducted for a vehicle impacting a U-channel small sign post embedded in soilmore » has been run on both the serial and parallel versions of DYNA3D. A significant reduction in computational time has been observed when running these problems on the parallel version. However, to achieve maximum efficiency, complex problems must be appropriately partitioned, especially when contact dominates the computation.« less
Implementation of real-time digital signal processing systems

NASA Technical Reports Server (NTRS)

Narasimha, M.; Peterson, A.; Narayan, S.

1978-01-01

Special purpose hardware implementation of DFT Computers and digital filters is considered in the light of newly introduced algorithms and IC devices. Recent work by Winograd on high-speed convolution techniques for computing short length DFT's, has motivated the development of more efficient algorithms, compared to the FFT, for evaluating the transform of longer sequences. Among these, prime factor algorithms appear suitable for special purpose hardware implementations. Architectural considerations in designing DFT computers based on these algorithms are discussed. With the availability of monolithic multiplier-accumulators, a direct implementation of IIR and FIR filters, using random access memories in place of shift registers, appears attractive. The memory addressing scheme involved in such implementations is discussed. A simple counter set-up to address the data memory in the realization of FIR filters is also described. The combination of a set of simple filters (weighting network) and a DFT computer is shown to realize a bank of uniform bandpass filters. The usefulness of this concept in arriving at a modular design for a million channel spectrum analyzer, based on microprocessors, is discussed.
Line-drawing algorithms for parallel machines

NASA Technical Reports Server (NTRS)

Pang, Alex T.

1990-01-01

The fact that conventional line-drawing algorithms, when applied directly on parallel machines, can lead to very inefficient codes is addressed. It is suggested that instead of modifying an existing algorithm for a parallel machine, a more efficient implementation can be produced by going back to the invariants in the definition. Popular line-drawing algorithms are compared with two alternatives; distance to a line (a point is on the line if sufficiently close to it) and intersection with a line (a point on the line if an intersection point). For massively parallel single-instruction-multiple-data (SIMD) machines (with thousands of processors and up), the alternatives provide viable line-drawing algorithms. Because of the pixel-per-processor mapping, their performance is independent of the line length and orientation.
1-FFT amino acids involved in high DP inulin accumulation in Viguiera discolor

PubMed Central

De Sadeleer, Emerik; Vergauwen, Rudy; Struyf, Tom; Le Roy, Katrien; Van den Ende, Wim

2015-01-01

Fructans are important vacuolar reserve carbohydrates with drought, cold, ROS and general abiotic stress mediating properties. They occur in 15% of all flowering plants and are believed to display health benefits as a prebiotic and dietary fiber. Fructans are synthesized by specific fructosyltransferases and classified based on the linkage type between fructosyl units. Inulins, one of these fructan types with β(2-1) linkages, are elongated by fructan:fructan 1-fructosyltransferases (1-FFT) using a fructosyl unit from a donor inulin to elongate the acceptor inulin molecule. The sequence identity of the 1-FFT of Viguiera discolor (Vd) and Helianthus tuberosus (Ht) is 91% although these enzymes produce distinct fructans. The Vd 1-FFT produces high degree of polymerization (DP) inulins by preferring the elongation of long chain inulins, in contrast to the Ht 1-FFT which prefers small molecules (DP3 or 4) as acceptor. Since higher DP inulins have interesting properties for industrial, food and medical applications, we report here on the influence of two amino acids on the high DP inulin production capacity of the Vd 1-FFT. Introducing the M19F and H308T mutations in the active site of the Vd 1-FFT greatly reduces its capacity to produce high DP inulin molecules. Both amino acids can be considered important to this capacity, although the double mutation had a much higher impact than the single mutations. PMID:26322058
Multiprocessing the Sieve of Eratosthenes

NASA Technical Reports Server (NTRS)

Bokhari, S.

1986-01-01

The Sieve of Eratosthenes for finding prime numbers in recent years has seen much use as a benchmark algorithm for serial computers while its intrinsically parallel nature has gone largely unnoticed. The implementation of a parallel version of this algorithm for a real parallel computer, the Flex/32, is described and its performance discussed. It is shown that the algorithm is sensitive to several fundamental performance parameters of parallel machines, such as spawning time, signaling time, memory access, and overhead of process switching. Because of the nature of the algorithm, it is impossible to get any speedup beyond 4 or 5 processors unless some form of dynamic load balancing is employed. We describe the performance of our algorithm with and without load balancing and compare it with theoretical lower bounds and simulated results. It is straightforward to understand this algorithm and to check the final results. However, its efficient implementation on a real parallel machine requires thoughtful design, especially if dynamic load balancing is desired. The fundamental operations required by the algorithm are very simple: this means that the slightest overhead appears prominently in performance data. The Sieve thus serves not only as a very severe test of the capabilities of a parallel processor but is also an interesting challenge for the programmer.
A Parallel Rendering Algorithm for MIMD Architectures

NASA Technical Reports Server (NTRS)

Crockett, Thomas W.; Orloff, Tobias

1991-01-01

Applications such as animation and scientific visualization demand high performance rendering of complex three dimensional scenes. To deliver the necessary rendering rates, highly parallel hardware architectures are required. The challenge is then to design algorithms and software which effectively use the hardware parallelism. A rendering algorithm targeted to distributed memory MIMD architectures is described. For maximum performance, the algorithm exploits both object-level and pixel-level parallelism. The behavior of the algorithm is examined both analytically and experimentally. Its performance for large numbers of processors is found to be limited primarily by communication overheads. An experimental implementation for the Intel iPSC/860 shows increasing performance from 1 to 128 processors across a wide range of scene complexities. It is shown that minimal modifications to the algorithm will adapt it for use on shared memory architectures as well.

A highly efficient multi-core algorithm for clustering extremely large datasets

PubMed Central

2010-01-01

Background In recent years, the demand for computational power in computational biology has increased due to rapidly growing data sets from microarray and other high-throughput technologies. This demand is likely to increase. Standard algorithms for analyzing data, such as cluster algorithms, need to be parallelized for fast processing. Unfortunately, most approaches for parallelizing algorithms largely rely on network communication protocols connecting and requiring multiple computers. One answer to this problem is to utilize the intrinsic capabilities in current multi-core hardware to distribute the tasks among the different cores of one computer. Results We introduce a multi-core parallelization of the k-means and k-modes cluster algorithms based on the design principles of transactional memory for clustering gene expression microarray type data and categorial SNP data. Our new shared memory parallel algorithms show to be highly efficient. We demonstrate their computational power and show their utility in cluster stability and sensitivity analysis employing repeated runs with slightly changed parameters. Computation speed of our Java based algorithm was increased by a factor of 10 for large data sets while preserving computational accuracy compared to single-core implementations and a recently published network based parallelization. Conclusions Most desktop computers and even notebooks provide at least dual-core processors. Our multi-core algorithms show that using modern algorithmic concepts, parallelization makes it possible to perform even such laborious tasks as cluster sensitivity and cluster number estimation on the laboratory computer. PMID:20370922
A sweep algorithm for massively parallel simulation of circuit-switched networks

NASA Technical Reports Server (NTRS)

Gaujal, Bruno; Greenberg, Albert G.; Nicol, David M.

1992-01-01

A new massively parallel algorithm is presented for simulating large asymmetric circuit-switched networks, controlled by a randomized-routing policy that includes trunk-reservation. A single instruction multiple data (SIMD) implementation is described, and corresponding experiments on a 16384 processor MasPar parallel computer are reported. A multiple instruction multiple data (MIMD) implementation is also described, and corresponding experiments on an Intel IPSC/860 parallel computer, using 16 processors, are reported. By exploiting parallelism, our algorithm increases the possible execution rate of such complex simulations by as much as an order of magnitude.
Constraint treatment techniques and parallel algorithms for multibody dynamic analysis. Ph.D. Thesis

NASA Technical Reports Server (NTRS)

Chiou, Jin-Chern

1990-01-01

Computational procedures for kinematic and dynamic analysis of three-dimensional multibody dynamic (MBD) systems are developed from the differential-algebraic equations (DAE's) viewpoint. Constraint violations during the time integration process are minimized and penalty constraint stabilization techniques and partitioning schemes are developed. The governing equations of motion, a two-stage staggered explicit-implicit numerical algorithm, are treated which takes advantage of a partitioned solution procedure. A robust and parallelizable integration algorithm is developed. This algorithm uses a two-stage staggered central difference algorithm to integrate the translational coordinates and the angular velocities. The angular orientations of bodies in MBD systems are then obtained by using an implicit algorithm via the kinematic relationship between Euler parameters and angular velocities. It is shown that the combination of the present solution procedures yields a computationally more accurate solution. To speed up the computational procedures, parallel implementation of the present constraint treatment techniques, the two-stage staggered explicit-implicit numerical algorithm was efficiently carried out. The DAE's and the constraint treatment techniques were transformed into arrowhead matrices to which Schur complement form was derived. By fully exploiting the sparse matrix structural analysis techniques, a parallel preconditioned conjugate gradient numerical algorithm is used to solve the systems equations written in Schur complement form. A software testbed was designed and implemented in both sequential and parallel computers. This testbed was used to demonstrate the robustness and efficiency of the constraint treatment techniques, the accuracy of the two-stage staggered explicit-implicit numerical algorithm, and the speed up of the Schur-complement-based parallel preconditioned conjugate gradient algorithm on a parallel computer.
Computing Maximum Cardinality Matchings in Parallel on Bipartite Graphs via Tree-Grafting

DOE Office of Scientific and Technical Information (OSTI.GOV)

Azad, Ariful; Buluc, Aydn; Pothen, Alex

It is difficult to obtain high performance when computing matchings on parallel processors because matching algorithms explicitly or implicitly search for paths in the graph, and when these paths become long, there is little concurrency. In spite of this limitation, we present a new algorithm and its shared-memory parallelization that achieves good performance and scalability in computing maximum cardinality matchings in bipartite graphs. This algorithm searches for augmenting paths via specialized breadth-first searches (BFS) from multiple source vertices, hence creating more parallelism than single source algorithms. Algorithms that employ multiple-source searches cannot discard a search tree once no augmenting pathmore » is discovered from the tree, unlike algorithms that rely on single-source searches. We describe a novel tree-grafting method that eliminates most of the redundant edge traversals resulting from this property of multiple-source searches. We also employ the recent direction-optimizing BFS algorithm as a subroutine to discover augmenting paths faster. Our algorithm compares favorably with the current best algorithms in terms of the number of edges traversed, the average augmenting path length, and the number of iterations. Here, we provide a proof of correctness for our algorithm. Our NUMA-aware implementation is scalable to 80 threads of an Intel multiprocessor and to 240 threads on an Intel Knights Corner coprocessor. On average, our parallel algorithm runs an order of magnitude faster than the fastest algorithms available. The performance improvement is more significant on graphs with small matching number.« less
Computing Maximum Cardinality Matchings in Parallel on Bipartite Graphs via Tree-Grafting

DOE PAGES

Azad, Ariful; Buluc, Aydn; Pothen, Alex

2016-03-24

It is difficult to obtain high performance when computing matchings on parallel processors because matching algorithms explicitly or implicitly search for paths in the graph, and when these paths become long, there is little concurrency. In spite of this limitation, we present a new algorithm and its shared-memory parallelization that achieves good performance and scalability in computing maximum cardinality matchings in bipartite graphs. This algorithm searches for augmenting paths via specialized breadth-first searches (BFS) from multiple source vertices, hence creating more parallelism than single source algorithms. Algorithms that employ multiple-source searches cannot discard a search tree once no augmenting pathmore » is discovered from the tree, unlike algorithms that rely on single-source searches. We describe a novel tree-grafting method that eliminates most of the redundant edge traversals resulting from this property of multiple-source searches. We also employ the recent direction-optimizing BFS algorithm as a subroutine to discover augmenting paths faster. Our algorithm compares favorably with the current best algorithms in terms of the number of edges traversed, the average augmenting path length, and the number of iterations. Here, we provide a proof of correctness for our algorithm. Our NUMA-aware implementation is scalable to 80 threads of an Intel multiprocessor and to 240 threads on an Intel Knights Corner coprocessor. On average, our parallel algorithm runs an order of magnitude faster than the fastest algorithms available. The performance improvement is more significant on graphs with small matching number.« less
Fast parallel approach for 2-D DHT-based real-valued discrete Gabor transform.

PubMed

Tao, Liang; Kwan, Hon Keung

2009-12-01

Two-dimensional fast Gabor transform algorithms are useful for real-time applications due to the high computational complexity of the traditional 2-D complex-valued discrete Gabor transform (CDGT). This paper presents two block time-recursive algorithms for 2-D DHT-based real-valued discrete Gabor transform (RDGT) and its inverse transform and develops a fast parallel approach for the implementation of the two algorithms. The computational complexity of the proposed parallel approach is analyzed and compared with that of the existing 2-D CDGT algorithms. The results indicate that the proposed parallel approach is attractive for real time image processing.
Communications oriented programming of parallel iterative solutions of sparse linear systems

NASA Technical Reports Server (NTRS)

Patrick, M. L.; Pratt, T. W.

1986-01-01

Parallel algorithms are developed for a class of scientific computational problems by partitioning the problems into smaller problems which may be solved concurrently. The effectiveness of the resulting parallel solutions is determined by the amount and frequency of communication and synchronization and the extent to which communication can be overlapped with computation. Three different parallel algorithms for solving the same class of problems are presented, and their effectiveness is analyzed from this point of view. The algorithms are programmed using a new programming environment. Run-time statistics and experience obtained from the execution of these programs assist in measuring the effectiveness of these algorithms.
Efficient parallel implementation of active appearance model fitting algorithm on GPU.

PubMed

Wang, Jinwei; Ma, Xirong; Zhu, Yuanping; Sun, Jizhou

2014-01-01

The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia's GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures.
Efficient Parallel Implementation of Active Appearance Model Fitting Algorithm on GPU

PubMed Central

Wang, Jinwei; Ma, Xirong; Zhu, Yuanping; Sun, Jizhou

2014-01-01

The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia's GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures. PMID:24723812
Research on parallel algorithm for sequential pattern mining

NASA Astrophysics Data System (ADS)

Zhou, Lijuan; Qin, Bai; Wang, Yu; Hao, Zhongxiao

2008-03-01

Sequential pattern mining is the mining of frequent sequences related to time or other orders from the sequence database. Its initial motivation is to discover the laws of customer purchasing in a time section by finding the frequent sequences. In recent years, sequential pattern mining has become an important direction of data mining, and its application field has not been confined to the business database and has extended to new data sources such as Web and advanced science fields such as DNA analysis. The data of sequential pattern mining has characteristics as follows: mass data amount and distributed storage. Most existing sequential pattern mining algorithms haven't considered the above-mentioned characteristics synthetically. According to the traits mentioned above and combining the parallel theory, this paper puts forward a new distributed parallel algorithm SPP(Sequential Pattern Parallel). The algorithm abides by the principal of pattern reduction and utilizes the divide-and-conquer strategy for parallelization. The first parallel task is to construct frequent item sets applying frequent concept and search space partition theory and the second task is to structure frequent sequences using the depth-first search method at each processor. The algorithm only needs to access the database twice and doesn't generate the candidated sequences, which abates the access time and improves the mining efficiency. Based on the random data generation procedure and different information structure designed, this paper simulated the SPP algorithm in a concrete parallel environment and implemented the AprioriAll algorithm. The experiments demonstrate that compared with AprioriAll, the SPP algorithm had excellent speedup factor and efficiency.
Logarithmic Laplacian Prior Based Bayesian Inverse Synthetic Aperture Radar Imaging.

PubMed

Zhang, Shuanghui; Liu, Yongxiang; Li, Xiang; Bi, Guoan

2016-04-28

This paper presents a novel Inverse Synthetic Aperture Radar Imaging (ISAR) algorithm based on a new sparse prior, known as the logarithmic Laplacian prior. The newly proposed logarithmic Laplacian prior has a narrower main lobe with higher tail values than the Laplacian prior, which helps to achieve performance improvement on sparse representation. The logarithmic Laplacian prior is used for ISAR imaging within the Bayesian framework to achieve better focused radar image. In the proposed method of ISAR imaging, the phase errors are jointly estimated based on the minimum entropy criterion to accomplish autofocusing. The maximum a posterior (MAP) estimation and the maximum likelihood estimation (MLE) are utilized to estimate the model parameters to avoid manually tuning process. Additionally, the fast Fourier Transform (FFT) and Hadamard product are used to minimize the required computational efficiency. Experimental results based on both simulated and measured data validate that the proposed algorithm outperforms the traditional sparse ISAR imaging algorithms in terms of resolution improvement and noise suppression.
icoshift: A versatile tool for the rapid alignment of 1D NMR spectra

NASA Astrophysics Data System (ADS)

Savorani, F.; Tomasi, G.; Engelsen, S. B.

2010-02-01

The increasing scientific and industrial interest towards metabonomics takes advantage from the high qualitative and quantitative information level of nuclear magnetic resonance (NMR) spectroscopy. However, several chemical and physical factors can affect the absolute and the relative position of an NMR signal and it is not always possible or desirable to eliminate these effects a priori. To remove misalignment of NMR signals a posteriori, several algorithms have been proposed in the literature. The icoshift program presented here is an open source and highly efficient program designed for solving signal alignment problems in metabonomic NMR data analysis. The icoshift algorithm is based on correlation shifting of spectral intervals and employs an FFT engine that aligns all spectra simultaneously. The algorithm is demonstrated to be faster than similar methods found in the literature making full-resolution alignment of large datasets feasible and thus avoiding down-sampling steps such as binning. The algorithm uses missing values as a filling alternative in order to avoid spectral artifacts at the segment boundaries. The algorithm is made open source and the Matlab code including documentation can be downloaded from www.models.life.ku.dk.
Parallel/distributed direct method for solving linear systems

NASA Technical Reports Server (NTRS)

Lin, Avi

1990-01-01

A new family of parallel schemes for directly solving linear systems is presented and analyzed. It is shown that these schemes exhibit a near optimal performance and enjoy several important features: (1) For large enough linear systems, the design of the appropriate paralleled algorithm is insensitive to the number of processors as its performance grows monotonically with them; (2) It is especially good for large matrices, with dimensions large relative to the number of processors in the system; (3) It can be used in both distributed parallel computing environments and tightly coupled parallel computing systems; and (4) This set of algorithms can be mapped onto any parallel architecture without any major programming difficulties or algorithmical changes.
SIAM Conference on Parallel Processing for Scientific Computing, 4th, Chicago, IL, Dec. 11-13, 1989, Proceedings

NASA Technical Reports Server (NTRS)

Dongarra, Jack (Editor); Messina, Paul (Editor); Sorensen, Danny C. (Editor); Voigt, Robert G. (Editor)

1990-01-01

Attention is given to such topics as an evaluation of block algorithm variants in LAPACK and presents a large-grain parallel sparse system solver, a multiprocessor method for the solution of the generalized Eigenvalue problem on an interval, and a parallel QR algorithm for iterative subspace methods on the CM2. A discussion of numerical methods includes the topics of asynchronous numerical solutions of PDEs on parallel computers, parallel homotopy curve tracking on a hypercube, and solving Navier-Stokes equations on the Cedar Multi-Cluster system. A section on differential equations includes a discussion of a six-color procedure for the parallel solution of elliptic systems using the finite quadtree structure, data parallel algorithms for the finite element method, and domain decomposition methods in aerodynamics. Topics dealing with massively parallel computing include hypercube vs. 2-dimensional meshes and massively parallel computation of conservation laws. Performance and tools are also discussed.
3D near-to-surface conductivity reconstruction by inversion of VETEM data using the distorted Born iterative method

USGS Publications Warehouse

Wang, G.L.; Chew, W.C.; Cui, T.J.; Aydiner, A.A.; Wright, D.L.; Smith, D.V.

2004-01-01

Three-dimensional (3D) subsurface imaging by using inversion of data obtained from the very early time electromagnetic system (VETEM) was discussed. The study was carried out by using the distorted Born iterative method to match the internal nonlinear property of the 3D inversion problem. The forward solver was based on the total-current formulation bi-conjugate gradient-fast Fourier transform (BCCG-FFT). It was found that the selection of regularization parameter follow a heuristic rule as used in the Levenberg-Marquardt algorithm so that the iteration is stable.
On the Accuracy and Parallelism of GPGPU-Powered Incremental Clustering Algorithms.

PubMed

Chen, Chunlei; He, Li; Zhang, Huixiang; Zheng, Hao; Wang, Lei

2017-01-01

Incremental clustering algorithms play a vital role in various applications such as massive data analysis and real-time data processing. Typical application scenarios of incremental clustering raise high demand on computing power of the hardware platform. Parallel computing is a common solution to meet this demand. Moreover, General Purpose Graphic Processing Unit (GPGPU) is a promising parallel computing device. Nevertheless, the incremental clustering algorithm is facing a dilemma between clustering accuracy and parallelism when they are powered by GPGPU. We formally analyzed the cause of this dilemma. First, we formalized concepts relevant to incremental clustering like evolving granularity. Second, we formally proved two theorems. The first theorem proves the relation between clustering accuracy and evolving granularity. Additionally, this theorem analyzes the upper and lower bounds of different-to-same mis-affiliation. Fewer occurrences of such mis-affiliation mean higher accuracy. The second theorem reveals the relation between parallelism and evolving granularity. Smaller work-depth means superior parallelism. Through the proofs, we conclude that accuracy of an incremental clustering algorithm is negatively related to evolving granularity while parallelism is positively related to the granularity. Thus the contradictory relations cause the dilemma. Finally, we validated the relations through a demo algorithm. Experiment results verified theoretical conclusions.
Application of integration algorithms in a parallel processing environment for the simulation of jet engines

NASA Technical Reports Server (NTRS)

Krosel, S. M.; Milner, E. J.

1982-01-01

The application of Predictor corrector integration algorithms developed for the digital parallel processing environment are investigated. The algorithms are implemented and evaluated through the use of a software simulator which provides an approximate representation of the parallel processing hardware. Test cases which focus on the use of the algorithms are presented and a specific application using a linear model of a turbofan engine is considered. Results are presented showing the effects of integration step size and the number of processors on simulation accuracy. Real time performance, interprocessor communication, and algorithm startup are also discussed.
Efficient Parallel Algorithm For Direct Numerical Simulation of Turbulent Flows

NASA Technical Reports Server (NTRS)

Moitra, Stuti; Gatski, Thomas B.

1997-01-01

A distributed algorithm for a high-order-accurate finite-difference approach to the direct numerical simulation (DNS) of transition and turbulence in compressible flows is described. This work has two major objectives. The first objective is to demonstrate that parallel and distributed-memory machines can be successfully and efficiently used to solve computationally intensive and input/output intensive algorithms of the DNS class. The second objective is to show that the computational complexity involved in solving the tridiagonal systems inherent in the DNS algorithm can be reduced by algorithm innovations that obviate the need to use a parallelized tridiagonal solver.
Efficiency Analysis of the Parallel Implementation of the SIMPLE Algorithm on Multiprocessor Computers

NASA Astrophysics Data System (ADS)

Lashkin, S. V.; Kozelkov, A. S.; Yalozo, A. V.; Gerasimov, V. Yu.; Zelensky, D. K.

2017-12-01

This paper describes the details of the parallel implementation of the SIMPLE algorithm for numerical solution of the Navier-Stokes system of equations on arbitrary unstructured grids. The iteration schemes for the serial and parallel versions of the SIMPLE algorithm are implemented. In the description of the parallel implementation, special attention is paid to computational data exchange among processors under the condition of the grid model decomposition using fictitious cells. We discuss the specific features for the storage of distributed matrices and implementation of vector-matrix operations in parallel mode. It is shown that the proposed way of matrix storage reduces the number of interprocessor exchanges. A series of numerical experiments illustrates the effect of the multigrid SLAE solver tuning on the general efficiency of the algorithm; the tuning involves the types of the cycles used (V, W, and F), the number of iterations of a smoothing operator, and the number of cells for coarsening. Two ways (direct and indirect) of efficiency evaluation for parallelization of the numerical algorithm are demonstrated. The paper presents the results of solving some internal and external flow problems with the evaluation of parallelization efficiency by two algorithms. It is shown that the proposed parallel implementation enables efficient computations for the problems on a thousand processors. Based on the results obtained, some general recommendations are made for the optimal tuning of the multigrid solver, as well as for selecting the optimal number of cells per processor.
MULTIOBJECTIVE PARALLEL GENETIC ALGORITHM FOR WASTE MINIMIZATION

EPA Science Inventory

In this research we have developed an efficient multiobjective parallel genetic algorithm (MOPGA) for waste minimization problems. This MOPGA integrates PGAPack (Levine, 1996) and NSGA-II (Deb, 2000) with novel modifications. PGAPack is a master-slave parallel implementation of a...

A sample implementation for parallelizing Divide-and-Conquer algorithms on the GPU.

PubMed

Mei, Gang; Zhang, Jiayin; Xu, Nengxiong; Zhao, Kunyang

2018-01-01

The strategy of Divide-and-Conquer (D&C) is one of the frequently used programming patterns to design efficient algorithms in computer science, which has been parallelized on shared memory systems and distributed memory systems. Tzeng and Owens specifically developed a generic paradigm for parallelizing D&C algorithms on modern Graphics Processing Units (GPUs). In this paper, by following the generic paradigm proposed by Tzeng and Owens, we provide a new and publicly available GPU implementation of the famous D&C algorithm, QuickHull, to give a sample and guide for parallelizing D&C algorithms on the GPU. The experimental results demonstrate the practicality of our sample GPU implementation. Our research objective in this paper is to present a sample GPU implementation of a classical D&C algorithm to help interested readers to develop their own efficient GPU implementations with fewer efforts.
Data communications in a parallel active messaging interface of a parallel computer

DOEpatents

Davis, Kristan D.; Faraj, Daniel A.

2014-07-22

Algorithm selection for data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including specifications of a client, a context, and a task, endpoints coupled for data communications through the PAMI, including associating in the PAMI data communications algorithms and ranges of message sizes so that each algorithm is associated with a separate range of message sizes; receiving in an origin endpoint of the PAMI a data communications instruction, the instruction specifying transmission of a data communications message from the origin endpoint to a target endpoint, the data communications message characterized by a message size; selecting, from among the associated algorithms and ranges, a data communications algorithm in dependence upon the message size; and transmitting, according to the selected data communications algorithm from the origin endpoint to the target endpoint, the data communications message.
Data communications in a parallel active messaging interface of a parallel computer

DOEpatents

Davis, Kristan D; Faraj, Daniel A

2013-07-09

Algorithm selection for data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including specifications of a client, a context, and a task, endpoints coupled for data communications through the PAMI, including associating in the PAMI data communications algorithms and ranges of message sizes so that each algorithm is associated with a separate range of message sizes; receiving in an origin endpoint of the PAMI a data communications instruction, the instruction specifying transmission of a data communications message from the origin endpoint to a target endpoint, the data communications message characterized by a message size; selecting, from among the associated algorithms and ranges, a data communications algorithm in dependence upon the message size; and transmitting, according to the selected data communications algorithm from the origin endpoint to the target endpoint, the data communications message.
FFT-enhanced IHS transform method for fusing high-resolution satellite images

USGS Publications Warehouse

Ling, Y.; Ehlers, M.; Usery, E.L.; Madden, M.

2007-01-01

Existing image fusion techniques such as the intensity-hue-saturation (IHS) transform and principal components analysis (PCA) methods may not be optimal for fusing the new generation commercial high-resolution satellite images such as Ikonos and QuickBird. One problem is color distortion in the fused image, which causes visual changes as well as spectral differences between the original and fused images. In this paper, a fast Fourier transform (FFT)-enhanced IHS method is developed for fusing new generation high-resolution satellite images. This method combines a standard IHS transform with FFT filtering of both the panchromatic image and the intensity component of the original multispectral image. Ikonos and QuickBird data are used to assess the FFT-enhanced IHS transform method. Experimental results indicate that the FFT-enhanced IHS transform method may improve upon the standard IHS transform and the PCA methods in preserving spectral and spatial information. ?? 2006 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS).
Graphical Representation of Parallel Algorithmic Processes

DTIC Science & Technology

1990-12-01

interface with the AAARF main process . The source code for the AAARF class-common library is in the common subdi- rectory and consists of the following files... for public release; distribution unlimited AFIT/GCE/ENG/90D-07 Graphical Representation of Parallel Algorithmic Processes THESIS Presented to the...goal of this study is to develop an algorithm animation facility for parallel processes executing on different architectures, from multiprocessor
The openGL visualization of the 2D parallel FDTD algorithm

NASA Astrophysics Data System (ADS)

Walendziuk, Wojciech

2005-02-01

This paper presents a way of visualization of a two-dimensional version of a parallel algorithm of the FDTD method. The visualization module was created on the basis of the OpenGL graphic standard with the use of the GLUT interface. In addition, the work includes the results of the efficiency of the parallel algorithm in the form of speedup charts.
Implementing a Parallel Image Edge Detection Algorithm Based on the Otsu-Canny Operator on the Hadoop Platform.

PubMed

Cao, Jianfang; Chen, Lichao; Wang, Min; Tian, Yun

2018-01-01

The Canny operator is widely used to detect edges in images. However, as the size of the image dataset increases, the edge detection performance of the Canny operator decreases and its runtime becomes excessive. To improve the runtime and edge detection performance of the Canny operator, in this paper, we propose a parallel design and implementation for an Otsu-optimized Canny operator using a MapReduce parallel programming model that runs on the Hadoop platform. The Otsu algorithm is used to optimize the Canny operator's dual threshold and improve the edge detection performance, while the MapReduce parallel programming model facilitates parallel processing for the Canny operator to solve the processing speed and communication cost problems that occur when the Canny edge detection algorithm is applied to big data. For the experiments, we constructed datasets of different scales from the Pascal VOC2012 image database. The proposed parallel Otsu-Canny edge detection algorithm performs better than other traditional edge detection algorithms. The parallel approach reduced the running time by approximately 67.2% on a Hadoop cluster architecture consisting of 5 nodes with a dataset of 60,000 images. Overall, our approach system speeds up the system by approximately 3.4 times when processing large-scale datasets, which demonstrates the obvious superiority of our method. The proposed algorithm in this study demonstrates both better edge detection performance and improved time performance.
A parallel algorithm for switch-level timing simulation on a hypercube multiprocessor

NASA Technical Reports Server (NTRS)

Rao, Hariprasad Nannapaneni

1989-01-01

The parallel approach to speeding up simulation is studied, specifically the simulation of digital LSI MOS circuitry on the Intel iPSC/2 hypercube. The simulation algorithm is based on RSIM, an event driven switch-level simulator that incorporates a linear transistor model for simulating digital MOS circuits. Parallel processing techniques based on the concepts of Virtual Time and rollback are utilized so that portions of the circuit may be simulated on separate processors, in parallel for as large an increase in speed as possible. A partitioning algorithm is also developed in order to subdivide the circuit for parallel processing.
Rapid prototyping of update algorithm of discrete Fourier transform for real-time signal processing

NASA Astrophysics Data System (ADS)

Kakad, Yogendra P.; Sherlock, Barry G.; Chatapuram, Krishnan V.; Bishop, Stephen

2001-10-01

An algorithm is developed in the companion paper, to update the existing DFT to represent the new data series that results when a new signal point is received. Updating the DFT in this way uses less computation than directly evaluating the DFT using the FFT algorithm, This reduces the computational order by a factor of log2 N. The algorithm is able to work in the presence of data window function, for use with rectangular window, the split triangular, Hanning, Hamming, and Blackman windows. In this paper, a hardware implementation of this algorithm, using FPGA technology, is outlined. Unlike traditional fully customized VLSI circuits, FPGAs represent a technical break through in the corresponding industry. The FPGA implements thousands of gates of logic in a single IC chip and it can be programmed by users at their site in a few seconds or less depending on the type of device used. The risk is low and the development time is short. The advantages have made FPGAs very popular for rapid prototyping of algorithms in the area of digital communication, digital signal processing, and image processing. Our paper addresses the related issues of implementation using hardware descriptive language in the development of the design and the subsequent downloading on the programmable hardware chip.
Study of subband electronic structure of Si δ-doped GaAs using magnetotransport measurements in tilted magnetic fields

NASA Astrophysics Data System (ADS)

Li, G.; Hauser, N.; Jagadish, C.; Antoszewski, J.; Xu, W.

1996-06-01

Si δ-doped GaAs grown by metal organic vapor phase epitaxy (MOVPE) is characterized using magnetotransport measurements in tilted magnetic fields. Angular dependence of the longitudinal magnetoresistance (Rxx) vs the magnetic field (B) traces in tilted magnetic fields is used to examine the existence of a quasi-two-dimensional electron gas. The subband electron densities (ni) are obtained applying fast Fourier transform (FFT) analysis to the Rxx vs B trace and using mobility spectrum (MS) analysis of the magnetic field dependent Hall data. Our results show that (1) the subband electron densities remain roughly constant when the tilted magnetic field with an angle <30° measured from the Si δ-doped plane normal is ramped up to 13 T; (2) FFT analysis of the Rxx vs B trace and MS analysis of the magnetic field dependent Hall data both give the comparable results on subband electron densities of Si δ-doped GaAs with low δ-doping concentration, however, for Si δ-doped GaAs with very high δ-doping concentration, the occupation of the lowest subbands cannot be well resolved in the MS analysis; (3) the highest subband electron mobility reported to date of 45 282 cm2/s V is observed in Si δ-doped GaAs at 77 K in the dark; and (4) the subband electron densities of Si δ-doped GaAs grown by MOVPE at 700 °C are comparable to those grown by MBE at temperatures below 600 °C. A detailed study of magnetotransport properties of Si δ-doped GaAs in the parallel magnetic fields is then carried out to further confirm the subband electronic structures revealed by FFT and MS analysis. Our results are compared to theoretical calculation previously reported in literature. In addition, influence of different cap layer structures on subband electronic structures of Si δ-doped GaAs is observed and also discussed.
FFT swept filtering: a bias-free method for processing fringe signals in absolute gravimeters

NASA Astrophysics Data System (ADS)

Křen, Petr; Pálinkáš, Vojtech; Mašika, Pavel; Val'ko, Miloš

2018-05-01

Absolute gravimeters, based on laser interferometry, are widely used for many applications in geoscience and metrology. Although currently the most accurate FG5 and FG5X gravimeters declare standard uncertainties at the level of 2-3 μGal, their inherent systematic errors affect the gravity reference determined by international key comparisons based predominately on the use of FG5-type instruments. The measurement results for FG5-215 and FG5X-251 clearly showed that the measured g-values depend on the size of the fringe signal and that this effect might be approximated by a linear regression with a slope of up to 0.030 μGal/mV . However, these empirical results do not enable one to identify the source of the effect or to determine a reasonable reference fringe level for correcting g-values in an absolute sense. Therefore, both gravimeters were equipped with new measuring systems (according to Křen et al. in Metrologia 53:27-40, 2016. https://doi.org/10.1088/0026-1394/53/1/27 applied for FG5), running in parallel with the original systems. The new systems use an analogue-to-digital converter HS5 to digitize the fringe signal and a new method of fringe signal analysis based on FFT swept bandpass filtering. We demonstrate that the source of the fringe size effect is connected to a distortion of the fringe signal due to the electronic components used in the FG5(X) gravimeters. To obtain a bias-free g-value, the FFT swept method should be applied for the determination of zero-crossings. A comparison of g-values obtained from the new and the original systems clearly shows that the original system might be biased by approximately 3-5 μGal due to improperly distorted fringe signal processing.
Commodity cluster and hardware-based massively parallel implementations of hyperspectral imaging algorithms

NASA Astrophysics Data System (ADS)

Plaza, Antonio; Chang, Chein-I.; Plaza, Javier; Valencia, David

2006-05-01

The incorporation of hyperspectral sensors aboard airborne/satellite platforms is currently producing a nearly continual stream of multidimensional image data, and this high data volume has soon introduced new processing challenges. The price paid for the wealth spatial and spectral information available from hyperspectral sensors is the enormous amounts of data that they generate. Several applications exist, however, where having the desired information calculated quickly enough for practical use is highly desirable. High computing performance of algorithm analysis is particularly important in homeland defense and security applications, in which swift decisions often involve detection of (sub-pixel) military targets (including hostile weaponry, camouflage, concealment, and decoys) or chemical/biological agents. In order to speed-up computational performance of hyperspectral imaging algorithms, this paper develops several fast parallel data processing techniques. Techniques include four classes of algorithms: (1) unsupervised classification, (2) spectral unmixing, and (3) automatic target recognition, and (4) onboard data compression. A massively parallel Beowulf cluster (Thunderhead) at NASA's Goddard Space Flight Center in Maryland is used to measure parallel performance of the proposed algorithms. In order to explore the viability of developing onboard, real-time hyperspectral data compression algorithms, a Xilinx Virtex-II field programmable gate array (FPGA) is also used in experiments. Our quantitative and comparative assessment of parallel techniques and strategies may help image analysts in selection of parallel hyperspectral algorithms for specific applications.
Parallelizing serial code for a distributed processing environment with an application to high frequency electromagnetic scattering

NASA Astrophysics Data System (ADS)

Work, Paul R.

1991-12-01

This thesis investigates the parallelization of existing serial programs in computational electromagnetics for use in a parallel environment. Existing algorithms for calculating the radar cross section of an object are covered, and a ray-tracing code is chosen for implementation on a parallel machine. Current parallel architectures are introduced and a suitable parallel machine is selected for the implementation of the chosen ray-tracing algorithm. The standard techniques for the parallelization of serial codes are discussed, including load balancing and decomposition considerations, and appropriate methods for the parallelization effort are selected. A load balancing algorithm is modified to increase the efficiency of the application, and a high level design of the structure of the serial program is presented. A detailed design of the modifications for the parallel implementation is also included, with both the high level and the detailed design specified in a high level design language called UNITY. The correctness of the design is proven using UNITY and standard logic operations. The theoretical and empirical results show that it is possible to achieve an efficient parallel application for a serial computational electromagnetic program where the characteristics of the algorithm and the target architecture critically influence the development of such an implementation.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Chrisochoides, N.; Sukup, F.

In this paper we present a parallel implementation of the Bowyer-Watson (BW) algorithm using the task-parallel programming model. The BW algorithm constitutes an ideal mesh refinement strategy for implementing a large class of unstructured mesh generation techniques on both sequential and parallel computers, by preventing the need for global mesh refinement. Its implementation on distributed memory multicomputes using the traditional data-parallel model has been proven very inefficient due to excessive synchronization needed among processors. In this paper we demonstrate that with the task-parallel model we can tolerate synchronization costs inherent to data-parallel methods by exploring concurrency in the processor level.more » Our preliminary performance data indicate that the task- parallel approach: (i) is almost four times faster than the existing data-parallel methods, (ii) scales linearly, and (iii) introduces minimum overheads compared to the {open_quotes}best{close_quotes} sequential implementation of the BW algorithm.« less
Laplace Transform Based Radiative Transfer Studies

NASA Astrophysics Data System (ADS)

Hu, Y.; Lin, B.; Ng, T.; Yang, P.; Wiscombe, W.; Herath, J.; Duffy, D.

2006-12-01

Multiple scattering is the major uncertainty for data analysis of space-based lidar measurements. Until now, accurate quantitative lidar data analysis has been limited to very thin objects that are dominated by single scattering, where photons from the laser beam only scatter a single time with particles in the atmosphere before reaching the receiver, and simple linear relationship between physical property and lidar signal exists. In reality, multiple scattering is always a factor in space-based lidar measurement and it dominates space- based lidar returns from clouds, dust aerosols, vegetation canopy and phytoplankton. While multiple scattering are clear signals, the lack of a fast-enough lidar multiple scattering computation tool forces us to treat the signal as unwanted "noise" and use simple multiple scattering correction scheme to remove them. Such multiple scattering treatments waste the multiple scattering signals and may cause orders of magnitude errors in retrieved physical properties. Thus the lack of fast and accurate time-dependent radiative transfer tools significantly limits lidar remote sensing capabilities. Analyzing lidar multiple scattering signals requires fast and accurate time-dependent radiative transfer computations. Currently, multiple scattering is done with Monte Carlo simulations. Monte Carlo simulations take minutes to hours and are too slow for interactive satellite data analysis processes and can only be used to help system / algorithm design and error assessment. We present an innovative physics approach to solve the time-dependent radiative transfer problem. The technique utilizes FPGA based reconfigurable computing hardware. The approach is as following, 1. Physics solution: Perform Laplace transform on the time and spatial dimensions and Fourier transform on the viewing azimuth dimension, and convert the radiative transfer differential equation solving into a fast matrix inversion problem. The majority of the radiative transfer computation goes to matrix inversion processes, FFT and inverse Laplace transforms. 2. Hardware solutions: Perform the well-defined matrix inversion, FFT and Laplace transforms on highly parallel, reconfigurable computing hardware. This physics-based computational tool leads to accurate quantitative analysis of space-based lidar signals and improves data quality of current lidar mission such as CALIPSO. This presentation will introduce the basic idea of this approach, preliminary results based on SRC's FPGA-based Mapstation, and how we may apply it to CALIPSO data analysis.
Computational mechanics analysis tools for parallel-vector supercomputers

NASA Technical Reports Server (NTRS)

Storaasli, Olaf O.; Nguyen, Duc T.; Baddourah, Majdi; Qin, Jiangning

1993-01-01

Computational algorithms for structural analysis on parallel-vector supercomputers are reviewed. These parallel algorithms, developed by the authors, are for the assembly of structural equations, 'out-of-core' strategies for linear equation solution, massively distributed-memory equation solution, unsymmetric equation solution, general eigensolution, geometrically nonlinear finite element analysis, design sensitivity analysis for structural dynamics, optimization search analysis and domain decomposition. The source code for many of these algorithms is available.
Radar Measurements of Ocean Surface Waves using Proper Orthogonal Decomposition

DTIC Science & Technology

2017-03-30

rely on use of Fourier transforms (FFT) and filtering spectra on the linear dispersion relationship for ocean surface waves. This report discusses...the measured signal (e.g., Young et al., 1985). In addition, the methods often rely on filtering the FFT of radar backscatter or Doppler velocities...to those obtained with conventional FFT and dispersion curve filtering techniques (iv) Compare both results of(iii) to ground truth sensors (i .e
Fluid density and concentration measurement using noninvasive in situ ultrasonic resonance interferometry

DOEpatents

Pope, Noah G.; Veirs, Douglas K.; Claytor, Thomas N.

1994-01-01

The specific gravity or solute concentration of a process fluid solution located in a selected structure is determined by obtaining a resonance response spectrum of the fluid/structure over a range of frequencies that are outside the response of the structure itself. A fast fourier transform (FFT) of the resonance response spectrum is performed to form a set of FFT values. A peak value for the FFT values is determined, e.g., by curve fitting, to output a process parameter that is functionally related to the specific gravity and solute concentration of the process fluid solution. Calibration curves are required to correlate the peak FFT value over the range of expected specific gravities and solute concentrations in the selected structure.
Fluid density and concentration measurement using noninvasive in situ ultrasonic resonance interferometry

DOEpatents

Pope, N.G.; Veirs, D.K.; Claytor, T.N.

1994-10-25

The specific gravity or solute concentration of a process fluid solution located in a selected structure is determined by obtaining a resonance response spectrum of the fluid/structure over a range of frequencies that are outside the response of the structure itself. A fast Fourier transform (FFT) of the resonance response spectrum is performed to form a set of FFT values. A peak value for the FFT values is determined, e.g., by curve fitting, to output a process parameter that is functionally related to the specific gravity and solute concentration of the process fluid solution. Calibration curves are required to correlate the peak FFT value over the range of expected specific gravities and solute concentrations in the selected structure. 7 figs.
A unifying framework for rigid multibody dynamics and serial and parallel computational issues

NASA Technical Reports Server (NTRS)

Fijany, Amir; Jain, Abhinandan

1989-01-01

A unifying framework for various formulations of the dynamics of open-chain rigid multibody systems is discussed. Their suitability for serial and parallel processing is assessed. The framework is based on the derivation of intrinsic, i.e., coordinate-free, equations of the algorithms which provides a suitable abstraction and permits a distinction to be made between the computational redundancy in the intrinsic and extrinsic equations. A set of spatial notation is used which allows the derivation of the various algorithms in a common setting and thus clarifies the relationships among them. The three classes of algorithms viz., O(n), O(n exp 2) and O(n exp 3) or the solution of the dynamics problem are investigated. Researchers begin with the derivation of O(n exp 3) algorithms based on the explicit computation of the mass matrix and it provides insight into the underlying basis of the O(n) algorithms. From a computational perspective, the optimal choice of a coordinate frame for the projection of the intrinsic equations is discussed and the serial computational complexity of the different algorithms is evaluated. The three classes of algorithms are also analyzed for suitability for parallel processing. It is shown that the problem belongs to the class of N C and the time and processor bounds are of O(log2/2(n)) and O(n exp 4), respectively. However, the algorithm that achieves the above bounds is not stable. Researchers show that the fastest stable parallel algorithm achieves a computational complexity of O(n) with O(n exp 4), respectively. However, the algorithm that achieves the above bounds is not stable. Researchers show that the fastest stable parallel algorithm achieves a computational complexity of O(n) with O(n exp 2) processors, and results from the parallelization of the O(n exp 3) serial algorithm.

Fast forward kinematics algorithm for real-time and high-precision control of the 3-RPS parallel mechanism

NASA Astrophysics Data System (ADS)

Wang, Yue; Yu, Jingjun; Pei, Xu

2018-06-01

A new forward kinematics algorithm for the mechanism of 3-RPS (R: Revolute; P: Prismatic; S: Spherical) parallel manipulators is proposed in this study. This algorithm is primarily based on the special geometric conditions of the 3-RPS parallel mechanism, and it eliminates the errors produced by parasitic motions to improve and ensure accuracy. Specifically, the errors can be less than 10-6. In this method, only the group of solutions that is consistent with the actual situation of the platform is obtained rapidly. This algorithm substantially improves calculation efficiency because the selected initial values are reasonable, and all the formulas in the calculation are analytical. This novel forward kinematics algorithm is well suited for real-time and high-precision control of the 3-RPS parallel mechanism.
Efficient parallel and out of core algorithms for constructing large bi-directed de Bruijn graphs.

PubMed

Kundeti, Vamsi K; Rajasekaran, Sanguthevar; Dinh, Hieu; Vaughn, Matthew; Thapar, Vishal

2010-11-15

Assembling genomic sequences from a set of overlapping reads is one of the most fundamental problems in computational biology. Algorithms addressing the assembly problem fall into two broad categories - based on the data structures which they employ. The first class uses an overlap/string graph and the second type uses a de Bruijn graph. However with the recent advances in short read sequencing technology, de Bruijn graph based algorithms seem to play a vital role in practice. Efficient algorithms for building these massive de Bruijn graphs are very essential in large sequencing projects based on short reads. In an earlier work, an O(n/p) time parallel algorithm has been given for this problem. Here n is the size of the input and p is the number of processors. This algorithm enumerates all possible bi-directed edges which can overlap with a node and ends up generating Θ(nΣ) messages (Σ being the size of the alphabet). In this paper we present a Θ(n/p) time parallel algorithm with a communication complexity that is equal to that of parallel sorting and is not sensitive to Σ. The generality of our algorithm makes it very easy to extend it even to the out-of-core model and in this case it has an optimal I/O complexity of Θ(nlog(n/B)Blog(M/B)) (M being the main memory size and B being the size of the disk block). We demonstrate the scalability of our parallel algorithm on a SGI/Altix computer. A comparison of our algorithm with the previous approaches reveals that our algorithm is faster--both asymptotically and practically. We demonstrate the scalability of our sequential out-of-core algorithm by comparing it with the algorithm used by VELVET to build the bi-directed de Bruijn graph. Our experiments reveal that our algorithm can build the graph with a constant amount of memory, which clearly outperforms VELVET. We also provide efficient algorithms for the bi-directed chain compaction problem. The bi-directed de Bruijn graph is a fundamental data structure for any sequence assembly program based on Eulerian approach. Our algorithms for constructing Bi-directed de Bruijn graphs are efficient in parallel and out of core settings. These algorithms can be used in building large scale bi-directed de Bruijn graphs. Furthermore, our algorithms do not employ any all-to-all communications in a parallel setting and perform better than the prior algorithms. Finally our out-of-core algorithm is extremely memory efficient and can replace the existing graph construction algorithm in VELVET.
Real-time implementations of image segmentation algorithms on shared memory multicore architecture: a survey (Conference Presentation)

NASA Astrophysics Data System (ADS)

Akil, Mohamed

2017-05-01

The real-time processing is getting more and more important in many image processing applications. Image segmentation is one of the most fundamental tasks image analysis. As a consequence, many different approaches for image segmentation have been proposed. The watershed transform is a well-known image segmentation tool. The watershed transform is a very data intensive task. To achieve acceleration and obtain real-time processing of watershed algorithms, parallel architectures and programming models for multicore computing have been developed. This paper focuses on the survey of the approaches for parallel implementation of sequential watershed algorithms on multicore general purpose CPUs: homogeneous multicore processor with shared memory. To achieve an efficient parallel implementation, it's necessary to explore different strategies (parallelization/distribution/distributed scheduling) combined with different acceleration and optimization techniques to enhance parallelism. In this paper, we give a comparison of various parallelization of sequential watershed algorithms on shared memory multicore architecture. We analyze the performance measurements of each parallel implementation and the impact of the different sources of overhead on the performance of the parallel implementations. In this comparison study, we also discuss the advantages and disadvantages of the parallel programming models. Thus, we compare the OpenMP (an application programming interface for multi-Processing) with Ptheads (POSIX Threads) to illustrate the impact of each parallel programming model on the performance of the parallel implementations.
Parallel solution of sparse one-dimensional dynamic programming problems

NASA Technical Reports Server (NTRS)

Nicol, David M.

1989-01-01

Parallel computation offers the potential for quickly solving large computational problems. However, it is often a non-trivial task to effectively use parallel computers. Solution methods must sometimes be reformulated to exploit parallelism; the reformulations are often more complex than their slower serial counterparts. We illustrate these points by studying the parallelization of sparse one-dimensional dynamic programming problems, those which do not obviously admit substantial parallelization. We propose a new method for parallelizing such problems, develop analytic models which help us to identify problems which parallelize well, and compare the performance of our algorithm with existing algorithms on a multiprocessor.
Soft-output decoding algorithms in iterative decoding of turbo codes

NASA Technical Reports Server (NTRS)

Benedetto, S.; Montorsi, G.; Divsalar, D.; Pollara, F.

1996-01-01

In this article, we present two versions of a simplified maximum a posteriori decoding algorithm. The algorithms work in a sliding window form, like the Viterbi algorithm, and can thus be used to decode continuously transmitted sequences obtained by parallel concatenated codes, without requiring code trellis termination. A heuristic explanation is also given of how to embed the maximum a posteriori algorithms into the iterative decoding of parallel concatenated codes (turbo codes). The performances of the two algorithms are compared on the basis of a powerful rate 1/3 parallel concatenated code. Basic circuits to implement the simplified a posteriori decoding algorithm using lookup tables, and two further approximations (linear and threshold), with a very small penalty, to eliminate the need for lookup tables are proposed.
A high-performance spatial database based approach for pathology imaging algorithm evaluation

PubMed Central

Wang, Fusheng; Kong, Jun; Gao, Jingjing; Cooper, Lee A.D.; Kurc, Tahsin; Zhou, Zhengwen; Adler, David; Vergara-Niedermayr, Cristobal; Katigbak, Bryan; Brat, Daniel J.; Saltz, Joel H.

2013-01-01

Background: Algorithm evaluation provides a means to characterize variability across image analysis algorithms, validate algorithms by comparison with human annotations, combine results from multiple algorithms for performance improvement, and facilitate algorithm sensitivity studies. The sizes of images and image analysis results in pathology image analysis pose significant challenges in algorithm evaluation. We present an efficient parallel spatial database approach to model, normalize, manage, and query large volumes of analytical image result data. This provides an efficient platform for algorithm evaluation. Our experiments with a set of brain tumor images demonstrate the application, scalability, and effectiveness of the platform. Context: The paper describes an approach and platform for evaluation of pathology image analysis algorithms. The platform facilitates algorithm evaluation through a high-performance database built on the Pathology Analytic Imaging Standards (PAIS) data model. Aims: (1) Develop a framework to support algorithm evaluation by modeling and managing analytical results and human annotations from pathology images; (2) Create a robust data normalization tool for converting, validating, and fixing spatial data from algorithm or human annotations; (3) Develop a set of queries to support data sampling and result comparisons; (4) Achieve high performance computation capacity via a parallel data management infrastructure, parallel data loading and spatial indexing optimizations in this infrastructure. Materials and Methods: We have considered two scenarios for algorithm evaluation: (1) algorithm comparison where multiple result sets from different methods are compared and consolidated; and (2) algorithm validation where algorithm results are compared with human annotations. We have developed a spatial normalization toolkit to validate and normalize spatial boundaries produced by image analysis algorithms or human annotations. The validated data were formatted based on the PAIS data model and loaded into a spatial database. To support efficient data loading, we have implemented a parallel data loading tool that takes advantage of multi-core CPUs to accelerate data injection. The spatial database manages both geometric shapes and image features or classifications, and enables spatial sampling, result comparison, and result aggregation through expressive structured query language (SQL) queries with spatial extensions. To provide scalable and efficient query support, we have employed a shared nothing parallel database architecture, which distributes data homogenously across multiple database partitions to take advantage of parallel computation power and implements spatial indexing to achieve high I/O throughput. Results: Our work proposes a high performance, parallel spatial database platform for algorithm validation and comparison. This platform was evaluated by storing, managing, and comparing analysis results from a set of brain tumor whole slide images. The tools we develop are open source and available to download. Conclusions: Pathology image algorithm validation and comparison are essential to iterative algorithm development and refinement. One critical component is the support for queries involving spatial predicates and comparisons. In our work, we develop an efficient data model and parallel database approach to model, normalize, manage and query large volumes of analytical image result data. Our experiments demonstrate that the data partitioning strategy and the grid-based indexing result in good data distribution across database nodes and reduce I/O overhead in spatial join queries through parallel retrieval of relevant data and quick subsetting of datasets. The set of tools in the framework provide a full pipeline to normalize, load, manage and query analytical results for algorithm evaluation. PMID:23599905
On the Accuracy and Parallelism of GPGPU-Powered Incremental Clustering Algorithms

PubMed Central

He, Li; Zheng, Hao; Wang, Lei

2017-01-01

Incremental clustering algorithms play a vital role in various applications such as massive data analysis and real-time data processing. Typical application scenarios of incremental clustering raise high demand on computing power of the hardware platform. Parallel computing is a common solution to meet this demand. Moreover, General Purpose Graphic Processing Unit (GPGPU) is a promising parallel computing device. Nevertheless, the incremental clustering algorithm is facing a dilemma between clustering accuracy and parallelism when they are powered by GPGPU. We formally analyzed the cause of this dilemma. First, we formalized concepts relevant to incremental clustering like evolving granularity. Second, we formally proved two theorems. The first theorem proves the relation between clustering accuracy and evolving granularity. Additionally, this theorem analyzes the upper and lower bounds of different-to-same mis-affiliation. Fewer occurrences of such mis-affiliation mean higher accuracy. The second theorem reveals the relation between parallelism and evolving granularity. Smaller work-depth means superior parallelism. Through the proofs, we conclude that accuracy of an incremental clustering algorithm is negatively related to evolving granularity while parallelism is positively related to the granularity. Thus the contradictory relations cause the dilemma. Finally, we validated the relations through a demo algorithm. Experiment results verified theoretical conclusions. PMID:29123546
Parallelization strategies for continuum-generalized method of moments on the multi-thread systems

NASA Astrophysics Data System (ADS)

Bustamam, A.; Handhika, T.; Ernastuti, Kerami, D.

2017-07-01

Continuum-Generalized Method of Moments (C-GMM) covers the Generalized Method of Moments (GMM) shortfall which is not as efficient as Maximum Likelihood estimator by using the continuum set of moment conditions in a GMM framework. However, this computation would take a very long time since optimizing regularization parameter. Unfortunately, these calculations are processed sequentially whereas in fact all modern computers are now supported by hierarchical memory systems and hyperthreading technology, which allowing for parallel computing. This paper aims to speed up the calculation process of C-GMM by designing a parallel algorithm for C-GMM on the multi-thread systems. First, parallel regions are detected for the original C-GMM algorithm. There are two parallel regions in the original C-GMM algorithm, that are contributed significantly to the reduction of computational time: the outer-loop and the inner-loop. Furthermore, this parallel algorithm will be implemented with standard shared-memory application programming interface, i.e. Open Multi-Processing (OpenMP). The experiment shows that the outer-loop parallelization is the best strategy for any number of observations.
Spiral waves characterization: Implications for an automated cardiodynamic tissue characterization.

PubMed

Alagoz, Celal; Cohen, Andrew R; Frisch, Daniel R; Tunç, Birkan; Phatharodom, Saran; Guez, Allon

2018-07-01

Spiral waves are phenomena observed in cardiac tissue especially during fibrillatory activities. Spiral waves are revealed through in-vivo and in-vitro studies using high density mapping that requires special experimental setup. Also, in-silico spiral wave analysis and classification is performed using membrane potentials from entire tissue. In this study, we report a characterization approach that identifies spiral wave behaviors using intracardiac electrogram (EGM) readings obtained with commonly used multipolar diagnostic catheters that perform localized but high-resolution readings. Specifically, the algorithm is designed to distinguish between stationary, meandering, and break-up rotors. The clustering and classification algorithms are tested on simulated data produced using a phenomenological 2D model of cardiac propagation. For EGM measurements, unipolar-bipolar EGM readings from various locations on tissue using two catheter types are modeled. The distance measure between spiral behaviors are assessed using normalized compression distance (NCD), an information theoretical distance. NCD is a universal metric in the sense it is solely based on compressibility of dataset and not requiring feature extraction. We also introduce normalized FFT distance (NFFTD) where compressibility is replaced with a FFT parameter. Overall, outstanding clustering performance was achieved across varying EGM reading configurations. We found that effectiveness in distinguishing was superior in case of NCD than NFFTD. We demonstrated that distinct spiral activity identification on a behaviorally heterogeneous tissue is also possible. This report demonstrates a theoretical validation of clustering and classification approaches that provide an automated mapping from EGM signals to assessment of spiral wave behaviors and hence offers a potential mapping and analysis framework for cardiac tissue wavefront propagation patterns. Copyright © 2018 Elsevier B.V. All rights reserved.
Parallel-SymD: A Parallel Approach to Detect Internal Symmetry in Protein Domains.

PubMed

Jha, Ashwani; Flurchick, K M; Bikdash, Marwan; Kc, Dukka B

2016-01-01

Internally symmetric proteins are proteins that have a symmetrical structure in their monomeric single-chain form. Around 10-15% of the protein domains can be regarded as having some sort of internal symmetry. In this regard, we previously published SymD (symmetry detection), an algorithm that determines whether a given protein structure has internal symmetry by attempting to align the protein to its own copy after the copy is circularly permuted by all possible numbers of residues. SymD has proven to be a useful algorithm to detect symmetry. In this paper, we present a new parallelized algorithm called Parallel-SymD for detecting symmetry of proteins on clusters of computers. The achieved speedup of the new Parallel-SymD algorithm scales well with the number of computing processors. Scaling is better for proteins with a larger number of residues. For a protein of 509 residues, a speedup of 63 was achieved on a parallel system with 100 processors.
Parallel-SymD: A Parallel Approach to Detect Internal Symmetry in Protein Domains

PubMed Central

Jha, Ashwani; Flurchick, K. M.; Bikdash, Marwan

2016-01-01

Internally symmetric proteins are proteins that have a symmetrical structure in their monomeric single-chain form. Around 10–15% of the protein domains can be regarded as having some sort of internal symmetry. In this regard, we previously published SymD (symmetry detection), an algorithm that determines whether a given protein structure has internal symmetry by attempting to align the protein to its own copy after the copy is circularly permuted by all possible numbers of residues. SymD has proven to be a useful algorithm to detect symmetry. In this paper, we present a new parallelized algorithm called Parallel-SymD for detecting symmetry of proteins on clusters of computers. The achieved speedup of the new Parallel-SymD algorithm scales well with the number of computing processors. Scaling is better for proteins with a larger number of residues. For a protein of 509 residues, a speedup of 63 was achieved on a parallel system with 100 processors. PMID:27747230
A Fast parallel tridiagonal algorithm for a class of CFD applications

NASA Technical Reports Server (NTRS)

Moitra, Stuti; Sun, Xian-He

1996-01-01

The parallel diagonal dominant (PDD) algorithm is an efficient tridiagonal solver. This paper presents for study a variation of the PDD algorithm, the reduced PDD algorithm. The new algorithm maintains the minimum communication provided by the PDD algorithm, but has a reduced operation count. The PDD algorithm also has a smaller operation count than the conventional sequential algorithm for many applications. Accuracy analysis is provided for the reduced PDD algorithm for symmetric Toeplitz tridiagonal (STT) systems. Implementation results on Langley's Intel Paragon and IBM SP2 show that both the PDD and reduced PDD algorithms are efficient and scalable.
Parallel language constructs for tensor product computations on loosely coupled architectures

NASA Technical Reports Server (NTRS)

Mehrotra, Piyush; Vanrosendale, John

1989-01-01

Distributed memory architectures offer high levels of performance and flexibility, but have proven awkard to program. Current languages for nonshared memory architectures provide a relatively low level programming environment, and are poorly suited to modular programming, and to the construction of libraries. A set of language primitives designed to allow the specification of parallel numerical algorithms at a higher level is described. Tensor product array computations are focused on along with a simple but important class of numerical algorithms. The problem of programming 1-D kernal routines is focused on first, such as parallel tridiagonal solvers, and then how such parallel kernels can be combined to form parallel tensor product algorithms is examined.
Parallel processing in finite element structural analysis

NASA Technical Reports Server (NTRS)

Noor, Ahmed K.

1987-01-01

A brief review is made of the fundamental concepts and basic issues of parallel processing. Discussion focuses on parallel numerical algorithms, performance evaluation of machines and algorithms, and parallelism in finite element computations. A computational strategy is proposed for maximizing the degree of parallelism at different levels of the finite element analysis process including: 1) formulation level (through the use of mixed finite element models); 2) analysis level (through additive decomposition of the different arrays in the governing equations into the contributions to a symmetrized response plus correction terms); 3) numerical algorithm level (through the use of operator splitting techniques and application of iterative processes); and 4) implementation level (through the effective combination of vectorization, multitasking and microtasking, whenever available).
Parallel Algorithms for the Exascale Era

DOE Office of Scientific and Technical Information (OSTI.GOV)

Robey, Robert W.

New parallel algorithms are needed to reach the Exascale level of parallelism with millions of cores. We look at some of the research developed by students in projects at LANL. The research blends ideas from the early days of computing while weaving in the fresh approach brought by students new to the field of high performance computing. We look at reproducibility of global sums and why it is important to parallel computing. Next we look at how the concept of hashing has led to the development of more scalable algorithms suitable for next-generation parallel computers. Nearly all of this workmore » has been done by undergraduates and published in leading scientific journals.« less
Bioinformatics algorithm based on a parallel implementation of a machine learning approach using transducers

NASA Astrophysics Data System (ADS)

Roche-Lima, Abiel; Thulasiram, Ruppa K.

2012-02-01

Finite automata, in which each transition is augmented with an output label in addition to the familiar input label, are considered finite-state transducers. Transducers have been used to analyze some fundamental issues in bioinformatics. Weighted finite-state transducers have been proposed to pairwise alignments of DNA and protein sequences; as well as to develop kernels for computational biology. Machine learning algorithms for conditional transducers have been implemented and used for DNA sequence analysis. Transducer learning algorithms are based on conditional probability computation. It is calculated by using techniques, such as pair-database creation, normalization (with Maximum-Likelihood normalization) and parameters optimization (with Expectation-Maximization - EM). These techniques are intrinsically costly for computation, even worse when are applied to bioinformatics, because the databases sizes are large. In this work, we describe a parallel implementation of an algorithm to learn conditional transducers using these techniques. The algorithm is oriented to bioinformatics applications, such as alignments, phylogenetic trees, and other genome evolution studies. Indeed, several experiences were developed using the parallel and sequential algorithm on Westgrid (specifically, on the Breeze cluster). As results, we obtain that our parallel algorithm is scalable, because execution times are reduced considerably when the data size parameter is increased. Another experience is developed by changing precision parameter. In this case, we obtain smaller execution times using the parallel algorithm. Finally, number of threads used to execute the parallel algorithm on the Breezy cluster is changed. In this last experience, we obtain as result that speedup is considerably increased when more threads are used; however there is a convergence for number of threads equal to or greater than 16.
Parallel image compression

NASA Technical Reports Server (NTRS)

Reif, John H.

1987-01-01

A parallel compression algorithm for the 16,384 processor MPP machine was developed. The serial version of the algorithm can be viewed as a combination of on-line dynamic lossless test compression techniques (which employ simple learning strategies) and vector quantization. These concepts are described. How these concepts are combined to form a new strategy for performing dynamic on-line lossy compression is discussed. Finally, the implementation of this algorithm in a massively parallel fashion on the MPP is discussed.
The remote sensing image segmentation mean shift algorithm parallel processing based on MapReduce

NASA Astrophysics Data System (ADS)

Chen, Xi; Zhou, Liqing

2015-12-01

With the development of satellite remote sensing technology and the remote sensing image data, traditional remote sensing image segmentation technology cannot meet the massive remote sensing image processing and storage requirements. This article put cloud computing and parallel computing technology in remote sensing image segmentation process, and build a cheap and efficient computer cluster system that uses parallel processing to achieve MeanShift algorithm of remote sensing image segmentation based on the MapReduce model, not only to ensure the quality of remote sensing image segmentation, improved split speed, and better meet the real-time requirements. The remote sensing image segmentation MeanShift algorithm parallel processing algorithm based on MapReduce shows certain significance and a realization of value.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Chen, Chao; Pouransari, Hadi; Rajamanickam, Sivasankaran

We present a parallel hierarchical solver for general sparse linear systems on distributed-memory machines. For large-scale problems, this fully algebraic algorithm is faster and more memory-efficient than sparse direct solvers because it exploits the low-rank structure of fill-in blocks. Depending on the accuracy of low-rank approximations, the hierarchical solver can be used either as a direct solver or as a preconditioner. The parallel algorithm is based on data decomposition and requires only local communication for updating boundary data on every processor. Moreover, the computation-to-communication ratio of the parallel algorithm is approximately the volume-to-surface-area ratio of the subdomain owned by everymore » processor. We also provide various numerical results to demonstrate the versatility and scalability of the parallel algorithm.« less
Performance Comparison of a Set of Periodic and Non-Periodic Tridiagonal Solvers on SP2 and Paragon Parallel Computers

NASA Technical Reports Server (NTRS)

Sun, Xian-He; Moitra, Stuti

1996-01-01

Various tridiagonal solvers have been proposed in recent years for different parallel platforms. In this paper, the performance of three tridiagonal solvers, namely, the parallel partition LU algorithm, the parallel diagonal dominant algorithm, and the reduced diagonal dominant algorithm, is studied. These algorithms are designed for distributed-memory machines and are tested on an Intel Paragon and an IBM SP2 machines. Measured results are reported in terms of execution time and speedup. Analytical study are conducted for different communication topologies and for different tridiagonal systems. The measured results match the analytical results closely. In addition to address implementation issues, performance considerations such as problem sizes and models of speedup are also discussed.

Parallelization and implementation of approximate root isolation for nonlinear system by Monte Carlo

NASA Astrophysics Data System (ADS)

Khosravi, Ebrahim

1998-12-01

This dissertation solves a fundamental problem of isolating the real roots of nonlinear systems of equations by Monte-Carlo that were published by Bush Jones. This algorithm requires only function values and can be applied readily to complicated systems of transcendental functions. The implementation of this sequential algorithm provides scientists with the means to utilize function analysis in mathematics or other fields of science. The algorithm, however, is so computationally intensive that the system is limited to a very small set of variables, and this will make it unfeasible for large systems of equations. Also a computational technique was needed for investigating a metrology of preventing the algorithm structure from converging to the same root along different paths of computation. The research provides techniques for improving the efficiency and correctness of the algorithm. The sequential algorithm for this technique was corrected and a parallel algorithm is presented. This parallel method has been formally analyzed and is compared with other known methods of root isolation. The effectiveness, efficiency, enhanced overall performance of the parallel processing of the program in comparison to sequential processing is discussed. The message passing model was used for this parallel processing, and it is presented and implemented on Intel/860 MIMD architecture. The parallel processing proposed in this research has been implemented in an ongoing high energy physics experiment: this algorithm has been used to track neutrinoes in a super K detector. This experiment is located in Japan, and data can be processed on-line or off-line locally or remotely.
Implementing a Parallel Image Edge Detection Algorithm Based on the Otsu-Canny Operator on the Hadoop Platform

PubMed Central

Wang, Min; Tian, Yun

2018-01-01

The Canny operator is widely used to detect edges in images. However, as the size of the image dataset increases, the edge detection performance of the Canny operator decreases and its runtime becomes excessive. To improve the runtime and edge detection performance of the Canny operator, in this paper, we propose a parallel design and implementation for an Otsu-optimized Canny operator using a MapReduce parallel programming model that runs on the Hadoop platform. The Otsu algorithm is used to optimize the Canny operator's dual threshold and improve the edge detection performance, while the MapReduce parallel programming model facilitates parallel processing for the Canny operator to solve the processing speed and communication cost problems that occur when the Canny edge detection algorithm is applied to big data. For the experiments, we constructed datasets of different scales from the Pascal VOC2012 image database. The proposed parallel Otsu-Canny edge detection algorithm performs better than other traditional edge detection algorithms. The parallel approach reduced the running time by approximately 67.2% on a Hadoop cluster architecture consisting of 5 nodes with a dataset of 60,000 images. Overall, our approach system speeds up the system by approximately 3.4 times when processing large-scale datasets, which demonstrates the obvious superiority of our method. The proposed algorithm in this study demonstrates both better edge detection performance and improved time performance. PMID:29861711
Mapped Landmark Algorithm for Precision Landing

NASA Technical Reports Server (NTRS)

Johnson, Andrew; Ansar, Adnan; Matthies, Larry

2007-01-01

A report discusses a computer vision algorithm for position estimation to enable precision landing during planetary descent. The Descent Image Motion Estimation System for the Mars Exploration Rovers has been used as a starting point for creating code for precision, terrain-relative navigation during planetary landing. The algorithm is designed to be general because it handles images taken at different scales and resolutions relative to the map, and can produce mapped landmark matches for any planetary terrain of sufficient texture. These matches provide a measurement of horizontal position relative to a known landing site specified on the surface map. Multiple mapped landmarks generated per image allow for automatic detection and elimination of bad matches. Attitude and position can be generated from each image; this image-based attitude measurement can be used by the onboard navigation filter to improve the attitude estimate, which will improve the position estimates. The algorithm uses normalized correlation of grayscale images, producing precise, sub-pixel images. The algorithm has been broken into two sub-algorithms: (1) FFT Map Matching (see figure), which matches a single large template by correlation in the frequency domain, and (2) Mapped Landmark Refinement, which matches many small templates by correlation in the spatial domain. Each relies on feature selection, the homography transform, and 3D image correlation. The algorithm is implemented in C++ and is rated at Technology Readiness Level (TRL) 4.
A parallel adaptive mesh refinement algorithm

NASA Technical Reports Server (NTRS)

Quirk, James J.; Hanebutte, Ulf R.

1993-01-01

Over recent years, Adaptive Mesh Refinement (AMR) algorithms which dynamically match the local resolution of the computational grid to the numerical solution being sought have emerged as powerful tools for solving problems that contain disparate length and time scales. In particular, several workers have demonstrated the effectiveness of employing an adaptive, block-structured hierarchical grid system for simulations of complex shock wave phenomena. Unfortunately, from the parallel algorithm developer's viewpoint, this class of scheme is quite involved; these schemes cannot be distilled down to a small kernel upon which various parallelizing strategies may be tested. However, because of their block-structured nature such schemes are inherently parallel, so all is not lost. In this paper we describe the method by which Quirk's AMR algorithm has been parallelized. This method is built upon just a few simple message passing routines and so it may be implemented across a broad class of MIMD machines. Moreover, the method of parallelization is such that the original serial code is left virtually intact, and so we are left with just a single product to support. The importance of this fact should not be underestimated given the size and complexity of the original algorithm.
A New Augmentation Based Algorithm for Extracting Maximal Chordal Subgraphs.

PubMed

Bhowmick, Sanjukta; Chen, Tzu-Yi; Halappanavar, Mahantesh

2015-02-01

A graph is chordal if every cycle of length greater than three contains an edge between non-adjacent vertices. Chordal graphs are of interest both theoretically, since they admit polynomial time solutions to a range of NP-hard graph problems, and practically, since they arise in many applications including sparse linear algebra, computer vision, and computational biology. A maximal chordal subgraph is a chordal subgraph that is not a proper subgraph of any other chordal subgraph. Existing algorithms for computing maximal chordal subgraphs depend on dynamically ordering the vertices, which is an inherently sequential process and therefore limits the algorithms' parallelizability. In this paper we explore techniques to develop a scalable parallel algorithm for extracting a maximal chordal subgraph. We demonstrate that an earlier attempt at developing a parallel algorithm may induce a non-optimal vertex ordering and is therefore not guaranteed to terminate with a maximal chordal subgraph. We then give a new algorithm that first computes and then repeatedly augments a spanning chordal subgraph. After proving that the algorithm terminates with a maximal chordal subgraph, we then demonstrate that this algorithm is more amenable to parallelization and that the parallel version also terminates with a maximal chordal subgraph. That said, the complexity of the new algorithm is higher than that of the previous parallel algorithm, although the earlier algorithm computes a chordal subgraph which is not guaranteed to be maximal. We experimented with our augmentation-based algorithm on both synthetic and real-world graphs. We provide scalability results and also explore the effect of different choices for the initial spanning chordal subgraph on both the running time and on the number of edges in the maximal chordal subgraph.
GPU-based parallel algorithm for blind image restoration using midfrequency-based methods

NASA Astrophysics Data System (ADS)

Xie, Lang; Luo, Yi-han; Bao, Qi-liang

2013-08-01

GPU-based general-purpose computing is a new branch of modern parallel computing, so the study of parallel algorithms specially designed for GPU hardware architecture is of great significance. In order to solve the problem of high computational complexity and poor real-time performance in blind image restoration, the midfrequency-based algorithm for blind image restoration was analyzed and improved in this paper. Furthermore, a midfrequency-based filtering method is also used to restore the image hardly with any recursion or iteration. Combining the algorithm with data intensiveness, data parallel computing and GPU execution model of single instruction and multiple threads, a new parallel midfrequency-based algorithm for blind image restoration is proposed in this paper, which is suitable for stream computing of GPU. In this algorithm, the GPU is utilized to accelerate the estimation of class-G point spread functions and midfrequency-based filtering. Aiming at better management of the GPU threads, the threads in a grid are scheduled according to the decomposition of the filtering data in frequency domain after the optimization of data access and the communication between the host and the device. The kernel parallelism structure is determined by the decomposition of the filtering data to ensure the transmission rate to get around the memory bandwidth limitation. The results show that, with the new algorithm, the operational speed is significantly increased and the real-time performance of image restoration is effectively improved, especially for high-resolution images.
Synchronization Of Parallel Discrete Event Simulations

NASA Technical Reports Server (NTRS)

Steinman, Jeffrey S.

1992-01-01

Adaptive, parallel, discrete-event-simulation-synchronization algorithm, Breathing Time Buckets, developed in Synchronous Parallel Environment for Emulation and Discrete Event Simulation (SPEEDES) operating system. Algorithm allows parallel simulations to process events optimistically in fluctuating time cycles that naturally adapt while simulation in progress. Combines best of optimistic and conservative synchronization strategies while avoiding major disadvantages. Algorithm processes events optimistically in time cycles adapting while simulation in progress. Well suited for modeling communication networks, for large-scale war games, for simulated flights of aircraft, for simulations of computer equipment, for mathematical modeling, for interactive engineering simulations, and for depictions of flows of information.
Characterization of robotics parallel algorithms and mapping onto a reconfigurable SIMD machine

NASA Technical Reports Server (NTRS)

Lee, C. S. G.; Lin, C. T.

1989-01-01

The kinematics, dynamics, Jacobian, and their corresponding inverse computations are six essential problems in the control of robot manipulators. Efficient parallel algorithms for these computations are discussed and analyzed. Their characteristics are identified and a scheme on the mapping of these algorithms to a reconfigurable parallel architecture is presented. Based on the characteristics including type of parallelism, degree of parallelism, uniformity of the operations, fundamental operations, data dependencies, and communication requirement, it is shown that most of the algorithms for robotic computations possess highly regular properties and some common structures, especially the linear recursive structure. Moreover, they are well-suited to be implemented on a single-instruction-stream multiple-data-stream (SIMD) computer with reconfigurable interconnection network. The model of a reconfigurable dual network SIMD machine with internal direct feedback is introduced. A systematic procedure internal direct feedback is introduced. A systematic procedure to map these computations to the proposed machine is presented. A new scheduling problem for SIMD machines is investigated and a heuristic algorithm, called neighborhood scheduling, that reorders the processing sequence of subtasks to reduce the communication time is described. Mapping results of a benchmark algorithm are illustrated and discussed.
Efficient sequential and parallel algorithms for record linkage.

PubMed

Mamun, Abdullah-Al; Mi, Tian; Aseltine, Robert; Rajasekaran, Sanguthevar

2014-01-01

Integrating data from multiple sources is a crucial and challenging problem. Even though there exist numerous algorithms for record linkage or deduplication, they suffer from either large time needs or restrictions on the number of datasets that they can integrate. In this paper we report efficient sequential and parallel algorithms for record linkage which handle any number of datasets and outperform previous algorithms. Our algorithms employ hierarchical clustering algorithms as the basis. A key idea that we use is radix sorting on certain attributes to eliminate identical records before any further processing. Another novel idea is to form a graph that links similar records and find the connected components. Our sequential and parallel algorithms have been tested on a real dataset of 1,083,878 records and synthetic datasets ranging in size from 50,000 to 9,000,000 records. Our sequential algorithm runs at least two times faster, for any dataset, than the previous best-known algorithm, the two-phase algorithm using faster computation of the edit distance (TPA (FCED)). The speedups obtained by our parallel algorithm are almost linear. For example, we get a speedup of 7.5 with 8 cores (residing in a single node), 14.1 with 16 cores (residing in two nodes), and 26.4 with 32 cores (residing in four nodes). We have compared the performance of our sequential algorithm with TPA (FCED) and found that our algorithm outperforms the previous one. The accuracy is the same as that of this previous best-known algorithm.
New Parallel Algorithms for Landscape Evolution Model

NASA Astrophysics Data System (ADS)

Jin, Y.; Zhang, H.; Shi, Y.

2017-12-01

Most landscape evolution models (LEM) developed in the last two decades solve the diffusion equation to simulate the transportation of surface sediments. This numerical approach is difficult to parallelize due to the computation of drainage area for each node, which needs huge amount of communication if run in parallel. In order to overcome this difficulty, we developed two parallel algorithms for LEM with a stream net. One algorithm handles the partition of grid with traditional methods and applies an efficient global reduction algorithm to do the computation of drainage areas and transport rates for the stream net; the other algorithm is based on a new partition algorithm, which partitions the nodes in catchments between processes first, and then partitions the cells according to the partition of nodes. Both methods focus on decreasing communication between processes and take the advantage of massive computing techniques, and numerical experiments show that they are both adequate to handle large scale problems with millions of cells. We implemented the two algorithms in our program based on the widely used finite element library deal.II, so that it can be easily coupled with ASPECT.
Parallel digital forensics infrastructure.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Liebrock, Lorie M.; Duggan, David Patrick

2009-10-01

This report documents the architecture and implementation of a Parallel Digital Forensics infrastructure. This infrastructure is necessary for supporting the design, implementation, and testing of new classes of parallel digital forensics tools. Digital Forensics has become extremely difficult with data sets of one terabyte and larger. The only way to overcome the processing time of these large sets is to identify and develop new parallel algorithms for performing the analysis. To support algorithm research, a flexible base infrastructure is required. A candidate architecture for this base infrastructure was designed, instantiated, and tested by this project, in collaboration with New Mexicomore » Tech. Previous infrastructures were not designed and built specifically for the development and testing of parallel algorithms. With the size of forensics data sets only expected to increase significantly, this type of infrastructure support is necessary for continued research in parallel digital forensics. This report documents the implementation of the parallel digital forensics (PDF) infrastructure architecture and implementation.« less
Enhancing PC Cluster-Based Parallel Branch-and-Bound Algorithms for the Graph Coloring Problem

NASA Astrophysics Data System (ADS)

Taoka, Satoshi; Takafuji, Daisuke; Watanabe, Toshimasa

A branch-and-bound algorithm (BB for short) is the most general technique to deal with various combinatorial optimization problems. Even if it is used, computation time is likely to increase exponentially. So we consider its parallelization to reduce it. It has been reported that the computation time of a parallel BB heavily depends upon node-variable selection strategies. And, in case of a parallel BB, it is also necessary to prevent increase in communication time. So, it is important to pay attention to how many and what kind of nodes are to be transferred (called sending-node selection strategy). In this paper, for the graph coloring problem, we propose some sending-node selection strategies for a parallel BB algorithm by adopting MPI for parallelization and experimentally evaluate how these strategies affect computation time of a parallel BB on a PC cluster network.
Data communications for a collective operation in a parallel active messaging interface of a parallel computer

DOE Office of Scientific and Technical Information (OSTI.GOV)

Faraj, Daniel A.

Algorithm selection for data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including specifications of a client, a context, and a task, endpoints coupled for data communications through the PAMI, including associating in the PAMI data communications algorithms and bit masks; receiving in an origin endpoint of the PAMI a collective instruction, the instruction specifying transmission of a data communications message from the origin endpoint to a target endpoint; constructing a bit mask for the received collective instruction; selecting, from among the associated algorithms and bit masks,more » a data communications algorithm in dependence upon the constructed bit mask; and executing the collective instruction, transmitting, according to the selected data communications algorithm from the origin endpoint to the target endpoint, the data communications message.« less
Data communications for a collective operation in a parallel active messaging interface of a parallel computer

DOEpatents

Faraj, Daniel A

2013-07-16

Algorithm selection for data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including specifications of a client, a context, and a task, endpoints coupled for data communications through the PAMI, including associating in the PAMI data communications algorithms and bit masks; receiving in an origin endpoint of the PAMI a collective instruction, the instruction specifying transmission of a data communications message from the origin endpoint to a target endpoint; constructing a bit mask for the received collective instruction; selecting, from among the associated algorithms and bit masks, a data communications algorithm in dependence upon the constructed bit mask; and executing the collective instruction, transmitting, according to the selected data communications algorithm from the origin endpoint to the target endpoint, the data communications message.
A simplified focusing and astigmatism correction method for a scanning electron microscope

NASA Astrophysics Data System (ADS)

Lu, Yihua; Zhang, Xianmin; Li, Hai

2018-01-01

Defocus and astigmatism can lead to blurred images and poor resolution. This paper presents a simplified method for focusing and astigmatism correction of a scanning electron microscope (SEM). The method consists of two steps. In the first step, the fast Fourier transform (FFT) of the SEM image is performed and the FFT is subsequently processed with a threshold to achieve a suitable result. In the second step, the threshold FFT is used for ellipse fitting to determine the presence of defocus and astigmatism. The proposed method clearly provides the relationships between the defocus, the astigmatism and the direction of stretching of the FFT, and it can determine the astigmatism in a single image. Experimental studies are conducted to demonstrate the validity of the proposed method.
Using Hadoop MapReduce for Parallel Genetic Algorithms: A Comparison of the Global, Grid and Island Models.

PubMed

Ferrucci, Filomena; Salza, Pasquale; Sarro, Federica

2017-06-29

The need to improve the scalability of Genetic Algorithms (GAs) has motivated the research on Parallel Genetic Algorithms (PGAs), and different technologies and approaches have been used. Hadoop MapReduce represents one of the most mature technologies to develop parallel algorithms. Based on the fact that parallel algorithms introduce communication overhead, the aim of the present work is to understand if, and possibly when, the parallel GAs solutions using Hadoop MapReduce show better performance than sequential versions in terms of execution time. Moreover, we are interested in understanding which PGA model can be most effective among the global, grid, and island models. We empirically assessed the performance of these three parallel models with respect to a sequential GA on a software engineering problem, evaluating the execution time and the achieved speedup. We also analysed the behaviour of the parallel models in relation to the overhead produced by the use of Hadoop MapReduce and the GAs' computational effort, which gives a more machine-independent measure of these algorithms. We exploited three problem instances to differentiate the computation load and three cluster configurations based on 2, 4, and 8 parallel nodes. Moreover, we estimated the costs of the execution of the experimentation on a potential cloud infrastructure, based on the pricing of the major commercial cloud providers. The empirical study revealed that the use of PGA based on the island model outperforms the other parallel models and the sequential GA for all the considered instances and clusters. Using 2, 4, and 8 nodes, the island model achieves an average speedup over the three datasets of 1.8, 3.4, and 7.0 times, respectively. Hadoop MapReduce has a set of different constraints that need to be considered during the design and the implementation of parallel algorithms. The overhead of data store (i.e., HDFS) accesses, communication, and latency requires solutions that reduce data store operations. For this reason, the island model is more suitable for PGAs than the global and grid model, also in terms of costs when executed on a commercial cloud provider.
An Ensemble Deep Convolutional Neural Network Model with Improved D-S Evidence Fusion for Bearing Fault Diagnosis.

PubMed

Li, Shaobo; Liu, Guokai; Tang, Xianghong; Lu, Jianguang; Hu, Jianjun

2017-07-28

Intelligent machine health monitoring and fault diagnosis are becoming increasingly important for modern manufacturing industries. Current fault diagnosis approaches mostly depend on expert-designed features for building prediction models. In this paper, we proposed IDSCNN, a novel bearing fault diagnosis algorithm based on ensemble deep convolutional neural networks and an improved Dempster-Shafer theory based evidence fusion. The convolutional neural networks take the root mean square (RMS) maps from the FFT (Fast Fourier Transformation) features of the vibration signals from two sensors as inputs. The improved D-S evidence theory is implemented via distance matrix from evidences and modified Gini Index. Extensive evaluations of the IDSCNN on the Case Western Reserve Dataset showed that our IDSCNN algorithm can achieve better fault diagnosis performance than existing machine learning methods by fusing complementary or conflicting evidences from different models and sensors and adapting to different load conditions.
An Ensemble Deep Convolutional Neural Network Model with Improved D-S Evidence Fusion for Bearing Fault Diagnosis

PubMed Central

Li, Shaobo; Liu, Guokai; Tang, Xianghong; Lu, Jianguang

2017-01-01

Intelligent machine health monitoring and fault diagnosis are becoming increasingly important for modern manufacturing industries. Current fault diagnosis approaches mostly depend on expert-designed features for building prediction models. In this paper, we proposed IDSCNN, a novel bearing fault diagnosis algorithm based on ensemble deep convolutional neural networks and an improved Dempster–Shafer theory based evidence fusion. The convolutional neural networks take the root mean square (RMS) maps from the FFT (Fast Fourier Transformation) features of the vibration signals from two sensors as inputs. The improved D-S evidence theory is implemented via distance matrix from evidences and modified Gini Index. Extensive evaluations of the IDSCNN on the Case Western Reserve Dataset showed that our IDSCNN algorithm can achieve better fault diagnosis performance than existing machine learning methods by fusing complementary or conflicting evidences from different models and sensors and adapting to different load conditions. PMID:28788099
Update on Development of Mesh Generation Algorithms in MeshKit

DOE Office of Scientific and Technical Information (OSTI.GOV)

Jain, Rajeev; Vanderzee, Evan; Mahadevan, Vijay

2015-09-30

MeshKit uses a graph-based design for coding all its meshing algorithms, which includes the Reactor Geometry (and mesh) Generation (RGG) algorithms. This report highlights the developmental updates of all the algorithms, results and future work. Parallel versions of algorithms, documentation and performance results are reported. RGG GUI design was updated to incorporate new features requested by the users; boundary layer generation and parallel RGG support were added to the GUI. Key contributions to the release, upgrade and maintenance of other SIGMA1 libraries (CGM and MOAB) were made. Several fundamental meshing algorithms for creating a robust parallel meshing pipeline in MeshKitmore » are under development. Results and current status of automated, open-source and high quality nuclear reactor assembly mesh generation algorithms such as trimesher, quadmesher, interval matching and multi-sweeper are reported.« less
Parallel algorithms for computation of the manipulator inertia matrix

NASA Technical Reports Server (NTRS)

Amin-Javaheri, Masoud; Orin, David E.

1989-01-01

The development of an O(log2N) parallel algorithm for the manipulator inertia matrix is presented. It is based on the most efficient serial algorithm which uses the composite rigid body method. Recursive doubling is used to reformulate the linear recurrence equations which are required to compute the diagonal elements of the matrix. It results in O(log2N) levels of computation. Computation of the off-diagonal elements involves N linear recurrences of varying-size and a new method, which avoids redundant computation of position and orientation transforms for the manipulator, is developed. The O(log2N) algorithm is presented in both equation and graphic forms which clearly show the parallelism inherent in the algorithm.

On the impact of communication complexity in the design of parallel numerical algorithms

NASA Technical Reports Server (NTRS)

Gannon, D.; Vanrosendale, J.

1984-01-01

This paper describes two models of the cost of data movement in parallel numerical algorithms. One model is a generalization of an approach due to Hockney, and is suitable for shared memory multiprocessors where each processor has vector capabilities. The other model is applicable to highly parallel nonshared memory MIMD systems. In the second model, algorithm performance is characterized in terms of the communication network design. Techniques used in VLSI complexity theory are also brought in, and algorithm independent upper bounds on system performance are derived for several problems that are important to scientific computation.
An efficient parallel algorithm for the solution of a tridiagonal linear system of equations

NASA Technical Reports Server (NTRS)

Stone, H. S.

1971-01-01

Tridiagonal linear systems of equations are solved on conventional serial machines in a time proportional to N, where N is the number of equations. The conventional algorithms do not lend themselves directly to parallel computations on computers of the ILLIAC IV class, in the sense that they appear to be inherently serial. An efficient parallel algorithm is presented in which computation time grows as log sub 2 N. The algorithm is based on recursive doubling solutions of linear recurrence relations, and can be used to solve recurrence relations of all orders.
Massively Parallel Solution of Poisson Equation on Coarse Grain MIMD Architectures

NASA Technical Reports Server (NTRS)

Fijany, A.; Weinberger, D.; Roosta, R.; Gulati, S.

1998-01-01

In this paper a new algorithm, designated as Fast Invariant Imbedding algorithm, for solution of Poisson equation on vector and massively parallel MIMD architectures is presented. This algorithm achieves the same optimal computational efficiency as other Fast Poisson solvers while offering a much better structure for vector and parallel implementation. Our implementation on the Intel Delta and Paragon shows that a speedup of over two orders of magnitude can be achieved even for moderate size problems.
A Hybrid Shared-Memory Parallel Max-Tree Algorithm for Extreme Dynamic-Range Images.

PubMed

Moschini, Ugo; Meijster, Arnold; Wilkinson, Michael H F

2018-03-01

Max-trees, or component trees, are graph structures that represent the connected components of an image in a hierarchical way. Nowadays, many application fields rely on images with high-dynamic range or floating point values. Efficient sequential algorithms exist to build trees and compute attributes for images of any bit depth. However, we show that the current parallel algorithms perform poorly already with integers at bit depths higher than 16 bits per pixel. We propose a parallel method combining the two worlds of flooding and merging max-tree algorithms. First, a pilot max-tree of a quantized version of the image is built in parallel using a flooding method. Later, this structure is used in a parallel leaf-to-root approach to compute efficiently the final max-tree and to drive the merging of the sub-trees computed by the threads. We present an analysis of the performance both on simulated and actual 2D images and 3D volumes. Execution times are about better than the fastest sequential algorithm and speed-up goes up to on 64 threads.
Empirical study of parallel LRU simulation algorithms

NASA Technical Reports Server (NTRS)

Carr, Eric; Nicol, David M.

1994-01-01

This paper reports on the performance of five parallel algorithms for simulating a fully associative cache operating under the LRU (Least-Recently-Used) replacement policy. Three of the algorithms are SIMD, and are implemented on the MasPar MP-2 architecture. Two other algorithms are parallelizations of an efficient serial algorithm on the Intel Paragon. One SIMD algorithm is quite simple, but its cost is linear in the cache size. The two other SIMD algorithm are more complex, but have costs that are independent on the cache size. Both the second and third SIMD algorithms compute all stack distances; the second SIMD algorithm is completely general, whereas the third SIMD algorithm presumes and takes advantage of bounds on the range of reference tags. Both MIMD algorithm implemented on the Paragon are general and compute all stack distances; they differ in one step that may affect their respective scalability. We assess the strengths and weaknesses of these algorithms as a function of problem size and characteristics, and compare their performance on traces derived from execution of three SPEC benchmark programs.
Fast Spatial Resolution Analysis of Quadratic Penalized Least-Squares Image Reconstruction With Separate Real and Imaginary Roughness Penalty: Application to fMRI.

PubMed

Olafsson, Valur T; Noll, Douglas C; Fessler, Jeffrey A

2018-02-01

Penalized least-squares iterative image reconstruction algorithms used for spatial resolution-limited imaging, such as functional magnetic resonance imaging (fMRI), commonly use a quadratic roughness penalty to regularize the reconstructed images. When used for complex-valued images, the conventional roughness penalty regularizes the real and imaginary parts equally. However, these imaging methods sometimes benefit from separate penalties for each part. The spatial smoothness from the roughness penalty on the reconstructed image is dictated by the regularization parameter(s). One method to set the parameter to a desired smoothness level is to evaluate the full width at half maximum of the reconstruction method's local impulse response. Previous work has shown that when using the conventional quadratic roughness penalty, one can approximate the local impulse response using an FFT-based calculation. However, that acceleration method cannot be applied directly for separate real and imaginary regularization. This paper proposes a fast and stable calculation for this case that also uses FFT-based calculations to approximate the local impulse responses of the real and imaginary parts. This approach is demonstrated with a quadratic image reconstruction of fMRI data that uses separate roughness penalties for the real and imaginary parts.
Near real-time analysis of extrinsic Fabry-Perot interferometric sensors under damped vibration using artificial neural networks

NASA Astrophysics Data System (ADS)

Dua, Rohit; Watkins, Steve E.

2009-03-01

Strain analysis due to vibration can provide insight into structural health. An Extrinsic Fabry-Perot Interferometric (EFPI) sensor under vibrational strain generates a non-linear modulated output. Advanced signal processing techniques, to extract important information such as absolute strain, are required to demodulate this non-linear output. Past research has employed Artificial Neural Networks (ANN) and Fast Fourier Transforms (FFT) to demodulate the EFPI sensor for limited conditions. These demodulation systems could only handle variations in absolute value of strain and frequency of actuation during a vibration event. This project uses an ANN approach to extend the demodulation system to include the variation in the damping coefficient of the actuating vibration, in a near real-time vibration scenario. A computer simulation provides training and testing data for the theoretical output of the EFPI sensor to demonstrate the approaches. FFT needed to be performed on a window of the EFPI output data. A small window of observation is obtained, while maintaining low absolute-strain prediction errors, heuristically. Results are obtained and compared from employing different ANN architectures including multi-layered feedforward ANN trained using Backpropagation Neural Network (BPNN), and Generalized Regression Neural Networks (GRNN). A two-layered algorithm fusion system is developed and tested that yields better results.
A new augmentation based algorithm for extracting maximal chordal subgraphs

DOE PAGES

Bhowmick, Sanjukta; Chen, Tzu-Yi; Halappanavar, Mahantesh

2014-10-18

If every cycle of a graph is chordal length greater than three then it contains an edge between non-adjacent vertices. Chordal graphs are of interest both theoretically, since they admit polynomial time solutions to a range of NP-hard graph problems, and practically, since they arise in many applications including sparse linear algebra, computer vision, and computational biology. A maximal chordal subgraph is a chordal subgraph that is not a proper subgraph of any other chordal subgraph. Existing algorithms for computing maximal chordal subgraphs depend on dynamically ordering the vertices, which is an inherently sequential process and therefore limits the algorithms’more » parallelizability. In our paper we explore techniques to develop a scalable parallel algorithm for extracting a maximal chordal subgraph. We demonstrate that an earlier attempt at developing a parallel algorithm may induce a non-optimal vertex ordering and is therefore not guaranteed to terminate with a maximal chordal subgraph. We then give a new algorithm that first computes and then repeatedly augments a spanning chordal subgraph. After proving that the algorithm terminates with a maximal chordal subgraph, we then demonstrate that this algorithm is more amenable to parallelization and that the parallel version also terminates with a maximal chordal subgraph. That said, the complexity of the new algorithm is higher than that of the previous parallel algorithm, although the earlier algorithm computes a chordal subgraph which is not guaranteed to be maximal. Finally, we experimented with our augmentation-based algorithm on both synthetic and real-world graphs. We provide scalability results and also explore the effect of different choices for the initial spanning chordal subgraph on both the running time and on the number of edges in the maximal chordal subgraph.« less
Forest fuel treatment detection using multi-temporal airborne Lidar data and high resolution aerial imagery ---- A case study at Sierra Nevada, California

NASA Astrophysics Data System (ADS)

Su, Y.; Guo, Q.; Collins, B.; Fry, D.; Kelly, M.

2014-12-01

Forest fuel treatments (FFT) are often employed in Sierra Nevada forest (located in California, US) to enhance forest health, regulate stand density, and reduce wildfire risk. However, there have been concerns that FFTs may have negative impacts on certain protected wildlife species. Due to the constraints and protection of resources (e.g., perennial streams, cultural resources, wildlife habitat, etc.), the actual FFT extents are usually different from planned extents. Identifying the actual extent of treated areas is of primary importance to understand the environmental influence of FFTs. Light detection and ranging (Lidar) is a powerful remote sensing technique that can provide accurate forest structure measurements, which provides great potential to monitor forest changes. This study used canopy height model (CHM) and canopy cover (CC) products derived from multi-temporal airborne Lidar data to detect FFTs by an approach combining a pixel-wise thresholding method and a object-of-interest segmentation method. We also investigated forest change following the implementation of landscape-scale FFT projects through the use of normalized difference vegetation index (NDVI) and standardized principle component analysis (PCA) from multi-temporal high resolution aerial imagery. The same FFT detection routine was applied on the Lidar data and aerial imagery for the purpose of comparing the capability of Lidar data and aerial imagery on FFT detection. Our results demonstrated that the FFT detection using Lidar derived CC products produced both the highest total accuracy and kappa coefficient, and was more robust at identifying areas with light FFTs. The accuracy using Lidar derived CHM products was significantly lower than that of the result using Lidar derived CC, but was still slightly higher than using aerial imagery. FFT detection results using NDVI and standardized PCA using multi-temporal aerial imagery produced almost identical total accuracy and kappa coefficient. Both methods showed relatively limited capacity to detect light FFT areas, and had higher false detection rate (recognized untreated areas as treated areas) compared to the methods using Lidar derived parameters.
A parallel approximate string matching under Levenshtein distance on graphics processing units using warp-shuffle operations

PubMed Central

Ho, ThienLuan; Oh, Seung-Rohk

2017-01-01

Approximate string matching with k-differences has a number of practical applications, ranging from pattern recognition to computational biology. This paper proposes an efficient memory-access algorithm for parallel approximate string matching with k-differences on Graphics Processing Units (GPUs). In the proposed algorithm, all threads in the same GPUs warp share data using warp-shuffle operation instead of accessing the shared memory. Moreover, we implement the proposed algorithm by exploiting the memory structure of GPUs to optimize its performance. Experiment results for real DNA packages revealed that the performance of the proposed algorithm and its implementation archived up to 122.64 and 1.53 times compared to that of sequential algorithm on CPU and previous parallel approximate string matching algorithm on GPUs, respectively. PMID:29016700
A matrix-algebraic formulation of distributed-memory maximal cardinality matching algorithms in bipartite graphs

DOE PAGES

Azad, Ariful; Buluç, Aydın

2016-05-16

We describe parallel algorithms for computing maximal cardinality matching in a bipartite graph on distributed-memory systems. Unlike traditional algorithms that match one vertex at a time, our algorithms process many unmatched vertices simultaneously using a matrix-algebraic formulation of maximal matching. This generic matrix-algebraic framework is used to develop three efficient maximal matching algorithms with minimal changes. The newly developed algorithms have two benefits over existing graph-based algorithms. First, unlike existing parallel algorithms, cardinality of matching obtained by the new algorithms stays constant with increasing processor counts, which is important for predictable and reproducible performance. Second, relying on bulk-synchronous matrix operations,more » these algorithms expose a higher degree of parallelism on distributed-memory platforms than existing graph-based algorithms. We report high-performance implementations of three maximal matching algorithms using hybrid OpenMP-MPI and evaluate the performance of these algorithm using more than 35 real and randomly generated graphs. On real instances, our algorithms achieve up to 200 × speedup on 2048 cores of a Cray XC30 supercomputer. Even higher speedups are obtained on larger synthetically generated graphs where our algorithms show good scaling on up to 16,384 cores.« less
Parallel grid generation algorithm for distributed memory computers

NASA Technical Reports Server (NTRS)

Moitra, Stuti; Moitra, Anutosh

1994-01-01

A parallel grid-generation algorithm and its implementation on the Intel iPSC/860 computer are described. The grid-generation scheme is based on an algebraic formulation of homotopic relations. Methods for utilizing the inherent parallelism of the grid-generation scheme are described, and implementation of multiple levELs of parallelism on multiple instruction multiple data machines are indicated. The algorithm is capable of providing near orthogonality and spacing control at solid boundaries while requiring minimal interprocessor communications. Results obtained on the Intel hypercube for a blended wing-body configuration are used to demonstrate the effectiveness of the algorithm. Fortran implementations bAsed on the native programming model of the iPSC/860 computer and the Express system of software tools are reported. Computational gains in execution time speed-up ratios are given.
A low-cost vector processor boosting compute-intensive image processing operations

NASA Technical Reports Server (NTRS)

Adorf, Hans-Martin

1992-01-01

Low-cost vector processing (VP) is within reach of everyone seriously engaged in scientific computing. The advent of affordable add-on VP-boards for standard workstations complemented by mathematical/statistical libraries is beginning to impact compute-intensive tasks such as image processing. A case in point in the restoration of distorted images from the Hubble Space Telescope. A low-cost implementation is presented of the standard Tarasko-Richardson-Lucy restoration algorithm on an Intel i860-based VP-board which is seamlessly interfaced to a commercial, interactive image processing system. First experience is reported (including some benchmarks for standalone FFT's) and some conclusions are drawn.
An Accurate and Stable FFT-based Method for Pricing Options under Exp-Lévy Processes

NASA Astrophysics Data System (ADS)

Ding, Deng; Chong U, Sio

2010-05-01

An accurate and stable method for pricing European options in exp-Lévy models is presented. The main idea of this new method is combining the quadrature technique and the Carr-Madan Fast Fourier Transform methods. The theoretical analysis shows that the overall complexity of this new method is still O(N log N) with N grid points as the fast Fourier transform methods. Numerical experiments for different exp-Lévy processes also show that the numerical algorithm proposed by this new method has an accuracy and stability for the small strike prices K. That develops and improves the Carr-Madan method.
A real-time KLT implementation for radio-SETI applications

NASA Astrophysics Data System (ADS)

Melis, Andrea; Concu, Raimondo; Pari, Pierpaolo; Maccone, Claudio; Montebugnoli, Stelio; Possenti, Andrea; Valente, Giuseppe; Antonietti, Nicoló; Perrodin, Delphine; Migoni, Carlo; Murgia, Matteo; Trois, Alessio; Barbaro, Massimo; Bocchinu, Alessandro; Casu, Silvia; Lunesu, Maria Ilaria; Monari, Jader; Navarrini, Alessandro; Pisanu, Tonino; Schilliró, Francesco; Vacca, Valentina

2016-07-01

SETI, the Search for ExtraTerrestrial Intelligence, is the search for radio signals emitted by alien civilizations living in the Galaxy. Narrow-band FFT-based approaches have been preferred in SETI, since their computation time only grows like N*lnN, where N is the number of time samples. On the contrary, a wide-band approach based on the Kahrunen-Lo`eve Transform (KLT) algorithm would be preferable, but it would scale like N*N. In this paper, we describe a hardware-software infrastructure based on FPGA boards and GPU-based PCs that circumvents this computation-time problem allowing for a real-time KLT.
Parallel processing considerations for image recognition tasks

NASA Astrophysics Data System (ADS)

Simske, Steven J.

2011-01-01

Many image recognition tasks are well-suited to parallel processing. The most obvious example is that many imaging tasks require the analysis of multiple images. From this standpoint, then, parallel processing need be no more complicated than assigning individual images to individual processors. However, there are three less trivial categories of parallel processing that will be considered in this paper: parallel processing (1) by task; (2) by image region; and (3) by meta-algorithm. Parallel processing by task allows the assignment of multiple workflows-as diverse as optical character recognition [OCR], document classification and barcode reading-to parallel pipelines. This can substantially decrease time to completion for the document tasks. For this approach, each parallel pipeline is generally performing a different task. Parallel processing by image region allows a larger imaging task to be sub-divided into a set of parallel pipelines, each performing the same task but on a different data set. This type of image analysis is readily addressed by a map-reduce approach. Examples include document skew detection and multiple face detection and tracking. Finally, parallel processing by meta-algorithm allows different algorithms to be deployed on the same image simultaneously. This approach may result in improved accuracy.
A privacy-preserving parallel and homomorphic encryption scheme

NASA Astrophysics Data System (ADS)

Min, Zhaoe; Yang, Geng; Shi, Jingqi

2017-04-01

In order to protect data privacy whilst allowing efficient access to data in multi-nodes cloud environments, a parallel homomorphic encryption (PHE) scheme is proposed based on the additive homomorphism of the Paillier encryption algorithm. In this paper we propose a PHE algorithm, in which plaintext is divided into several blocks and blocks are encrypted with a parallel mode. Experiment results demonstrate that the encryption algorithm can reach a speed-up ratio at about 7.1 in the MapReduce environment with 16 cores and 4 nodes.
A derivation and scalable implementation of the synchronous parallel kinetic Monte Carlo method for simulating long-time dynamics

NASA Astrophysics Data System (ADS)

Byun, Hye Suk; El-Naggar, Mohamed Y.; Kalia, Rajiv K.; Nakano, Aiichiro; Vashishta, Priya

2017-10-01

Kinetic Monte Carlo (KMC) simulations are used to study long-time dynamics of a wide variety of systems. Unfortunately, the conventional KMC algorithm is not scalable to larger systems, since its time scale is inversely proportional to the simulated system size. A promising approach to resolving this issue is the synchronous parallel KMC (SPKMC) algorithm, which makes the time scale size-independent. This paper introduces a formal derivation of the SPKMC algorithm based on local transition-state and time-dependent Hartree approximations, as well as its scalable parallel implementation based on a dual linked-list cell method. The resulting algorithm has achieved a weak-scaling parallel efficiency of 0.935 on 1024 Intel Xeon processors for simulating biological electron transfer dynamics in a 4.2 billion-heme system, as well as decent strong-scaling parallel efficiency. The parallel code has been used to simulate a lattice of cytochrome complexes on a bacterial-membrane nanowire, and it is broadly applicable to other problems such as computational synthesis of new materials.
A parallel row-based algorithm with error control for standard-cell replacement on a hypercube multiprocessor

NASA Technical Reports Server (NTRS)

Sargent, Jeff Scott

1988-01-01

A new row-based parallel algorithm for standard-cell placement targeted for execution on a hypercube multiprocessor is presented. Key features of this implementation include a dynamic simulated-annealing schedule, row-partitioning of the VLSI chip image, and two novel new approaches to controlling error in parallel cell-placement algorithms; Heuristic Cell-Coloring and Adaptive (Parallel Move) Sequence Control. Heuristic Cell-Coloring identifies sets of noninteracting cells that can be moved repeatedly, and in parallel, with no buildup of error in the placement cost. Adaptive Sequence Control allows multiple parallel cell moves to take place between global cell-position updates. This feedback mechanism is based on an error bound derived analytically from the traditional annealing move-acceptance profile. Placement results are presented for real industry circuits and the performance is summarized of an implementation on the Intel iPSC/2 Hypercube. The runtime of this algorithm is 5 to 16 times faster than a previous program developed for the Hypercube, while producing equivalent quality placement. An integrated place and route program for the Intel iPSC/2 Hypercube is currently being developed.
Parallel algorithms for mapping pipelined and parallel computations

NASA Technical Reports Server (NTRS)

Nicol, David M.

1988-01-01

Many computational problems in image processing, signal processing, and scientific computing are naturally structured for either pipelined or parallel computation. When mapping such problems onto a parallel architecture it is often necessary to aggregate an obvious problem decomposition. Even in this context the general mapping problem is known to be computationally intractable, but recent advances have been made in identifying classes of problems and architectures for which optimal solutions can be found in polynomial time. Among these, the mapping of pipelined or parallel computations onto linear array, shared memory, and host-satellite systems figures prominently. This paper extends that work first by showing how to improve existing serial mapping algorithms. These improvements have significantly lower time and space complexities: in one case a published O(nm sup 3) time algorithm for mapping m modules onto n processors is reduced to an O(nm log m) time complexity, and its space requirements reduced from O(nm sup 2) to O(m). Run time complexity is further reduced with parallel mapping algorithms based on these improvements, which run on the architecture for which they create the mappings.

A method to perform a fast fourier transform with primitive image transformations.

PubMed

Sheridan, Phil

2007-05-01

The Fourier transform is one of the most important transformations in image processing. A major component of this influence comes from the ability to implement it efficiently on a digital computer. This paper describes a new methodology to perform a fast Fourier transform (FFT). This methodology emerges from considerations of the natural physical constraints imposed by image capture devices (camera/eye). The novel aspects of the specific FFT method described include: 1) a bit-wise reversal re-grouping operation of the conventional FFT is replaced by the use of lossless image rotation and scaling and 2) the usual arithmetic operations of complex multiplication are replaced with integer addition. The significance of the FFT presented in this paper is introduced by extending a discrete and finite image algebra, named Spiral Honeycomb Image Algebra (SHIA), to a continuous version, named SHIAC.
On the electromagnetic scattering from infinite rectangular grids with finite conductivity

NASA Technical Reports Server (NTRS)

Christodoulou, C. G.; Kauffman, J. F.

1986-01-01

A variety of methods can be used in constructing solutions to the problem of mesh scattering. However, each of these methods has certain drawbacks. The present paper is concerned with a new technique which is valid for all spacings. The new method involved, called the fast Fourier transform-conjugate gradient method (FFT-CGM), represents an iterative technique which employs the conjugate gradient method to improve upon each iterate, utilizing the fast Fourier transform. The FFT-CGM method provides a new accurate model which can be extended and applied to the more difficult problems of woven mesh surfaces. The formulation of the FFT-conjugate gradient method for aperture fields and current densities for a planar periodic structure is considered along with singular operators, the formulation of the FFT-CG method for thin wires with finite conductivity, and reflection coefficients.
Pipelined digital SAR azimuth correlator using hybrid FFT-transversal filter

NASA Technical Reports Server (NTRS)

Wu, C.; Liu, K. Y. (Inventor)

1984-01-01

A synthetic aperture radar system (SAR) having a range correlator is provided with a hybrid azimuth correlator which utilizes a block-pipe-lined fast Fourier transform (FFT). The correlator has a predetermined FFT transform size with delay elements for delaying SAR range correlated data so as to embed in the Fourier transform operation a corner-turning function as the range correlated SAR data is converted from the time domain to a frequency domain. The azimuth correlator is comprised of a transversal filter to receive the SAR data in the frequency domain, a generator for range migration compensation and azimuth reference functions, and an azimuth reference multiplier for correlation of the SAR data. Following the transversal filter is a block-pipelined inverse FFT used to restore azimuth correlated data in the frequency domain to the time domain for imaging.
Efficient sequential and parallel algorithms for record linkage

PubMed Central

Mamun, Abdullah-Al; Mi, Tian; Aseltine, Robert; Rajasekaran, Sanguthevar

2014-01-01

Background and objective Integrating data from multiple sources is a crucial and challenging problem. Even though there exist numerous algorithms for record linkage or deduplication, they suffer from either large time needs or restrictions on the number of datasets that they can integrate. In this paper we report efficient sequential and parallel algorithms for record linkage which handle any number of datasets and outperform previous algorithms. Methods Our algorithms employ hierarchical clustering algorithms as the basis. A key idea that we use is radix sorting on certain attributes to eliminate identical records before any further processing. Another novel idea is to form a graph that links similar records and find the connected components. Results Our sequential and parallel algorithms have been tested on a real dataset of 1 083 878 records and synthetic datasets ranging in size from 50 000 to 9 000 000 records. Our sequential algorithm runs at least two times faster, for any dataset, than the previous best-known algorithm, the two-phase algorithm using faster computation of the edit distance (TPA (FCED)). The speedups obtained by our parallel algorithm are almost linear. For example, we get a speedup of 7.5 with 8 cores (residing in a single node), 14.1 with 16 cores (residing in two nodes), and 26.4 with 32 cores (residing in four nodes). Conclusions We have compared the performance of our sequential algorithm with TPA (FCED) and found that our algorithm outperforms the previous one. The accuracy is the same as that of this previous best-known algorithm. PMID:24154837
A Parallel Ghosting Algorithm for The Flexible Distributed Mesh Database

DOE PAGES

Mubarak, Misbah; Seol, Seegyoung; Lu, Qiukai; ...

2013-01-01

Critical to the scalability of parallel adaptive simulations are parallel control functions including load balancing, reduced inter-process communication and optimal data decomposition. In distributed meshes, many mesh-based applications frequently access neighborhood information for computational purposes which must be transmitted efficiently to avoid parallel performance degradation when the neighbors are on different processors. This article presents a parallel algorithm of creating and deleting data copies, referred to as ghost copies, which localize neighborhood data for computation purposes while minimizing inter-process communication. The key characteristics of the algorithm are: (1) It can create ghost copies of any permissible topological order in amore » 1D, 2D or 3D mesh based on selected adjacencies. (2) It exploits neighborhood communication patterns during the ghost creation process thus eliminating all-to-all communication. (3) For applications that need neighbors of neighbors, the algorithm can create n number of ghost layers up to a point where the whole partitioned mesh can be ghosted. Strong and weak scaling results are presented for the IBM BG/P and Cray XE6 architectures up to a core count of 32,768 processors. The algorithm also leads to scalable results when used in a parallel super-convergent patch recovery error estimator, an application that frequently accesses neighborhood data to carry out computation.« less
Fourier transform and particle swarm optimization based modified LQR algorithm for mitigation of vibrations using magnetorheological dampers

NASA Astrophysics Data System (ADS)

Kumar, Gaurav; Kumar, Ashok

2017-11-01

Structural control has gained significant attention in recent times. The standalone issue of power requirement during an earthquake has already been solved up to a large extent by designing semi-active control systems using conventional linear quadratic control theory, and many other intelligent control algorithms such as fuzzy controllers, artificial neural networks, etc. In conventional linear-quadratic regulator (LQR) theory, it is customary to note that the values of the design parameters are decided at the time of designing the controller and cannot be subsequently altered. During an earthquake event, the response of the structure may increase or decrease, depending the quasi-resonance occurring between the structure and the earthquake. In this case, it is essential to modify the value of the design parameters of the conventional LQR controller to obtain optimum control force to mitigate the vibrations due to the earthquake. A few studies have been done to sort out this issue but in all these studies it was necessary to maintain a database of the earthquake. To solve this problem and to find the optimized design parameters of the LQR controller in real time, a fast Fourier transform and particle swarm optimization based modified linear quadratic regulator method is presented here. This method comprises four different algorithms: particle swarm optimization (PSO), the fast Fourier transform (FFT), clipped control algorithm and the LQR. The FFT helps to obtain the dominant frequency for every time window. PSO finds the optimum gain matrix through the real-time update of the weighting matrix R, thereby, dispensing with the experimentation. The clipped control law is employed to match the magnetorheological (MR) damper force with the desired force given by the controller. The modified Bouc-Wen phenomenological model is taken to recognize the nonlinearities in the MR damper. The assessment of the advised method is done by simulation of a three-story structure having an MR damper at the ground floor level subjected to three different near-fault historical earthquake time histories, and the outcomes are equated with those of simple conventional LQR. The results establish that the advised methodology is more effective than conventional LQR controllers in reducing inter-storey drift, relative displacement, and acceleration response.
Parallelization of a blind deconvolution algorithm

NASA Astrophysics Data System (ADS)

Matson, Charles L.; Borelli, Kathy J.

2006-09-01

Often it is of interest to deblur imagery in order to obtain higher-resolution images. Deblurring requires knowledge of the blurring function - information that is often not available separately from the blurred imagery. Blind deconvolution algorithms overcome this problem by jointly estimating both the high-resolution image and the blurring function from the blurred imagery. Because blind deconvolution algorithms are iterative in nature, they can take minutes to days to deblur an image depending how many frames of data are used for the deblurring and the platforms on which the algorithms are executed. Here we present our progress in parallelizing a blind deconvolution algorithm to increase its execution speed. This progress includes sub-frame parallelization and a code structure that is not specialized to a specific computer hardware architecture.
High Performance Implementation of 3D Convolutional Neural Networks on a GPU.

PubMed

Lan, Qiang; Wang, Zelong; Wen, Mei; Zhang, Chunyuan; Wang, Yijie

2017-01-01

Convolutional neural networks have proven to be highly successful in applications such as image classification, object tracking, and many other tasks based on 2D inputs. Recently, researchers have started to apply convolutional neural networks to video classification, which constitutes a 3D input and requires far larger amounts of memory and much more computation. FFT based methods can reduce the amount of computation, but this generally comes at the cost of an increased memory requirement. On the other hand, the Winograd Minimal Filtering Algorithm (WMFA) can reduce the number of operations required and thus can speed up the computation, without increasing the required memory. This strategy was shown to be successful for 2D neural networks. We implement the algorithm for 3D convolutional neural networks and apply it to a popular 3D convolutional neural network which is used to classify videos and compare it to cuDNN. For our highly optimized implementation of the algorithm, we observe a twofold speedup for most of the 3D convolution layers of our test network compared to the cuDNN version.
High Performance Implementation of 3D Convolutional Neural Networks on a GPU

PubMed Central

Wang, Zelong; Wen, Mei; Zhang, Chunyuan; Wang, Yijie

2017-01-01

Convolutional neural networks have proven to be highly successful in applications such as image classification, object tracking, and many other tasks based on 2D inputs. Recently, researchers have started to apply convolutional neural networks to video classification, which constitutes a 3D input and requires far larger amounts of memory and much more computation. FFT based methods can reduce the amount of computation, but this generally comes at the cost of an increased memory requirement. On the other hand, the Winograd Minimal Filtering Algorithm (WMFA) can reduce the number of operations required and thus can speed up the computation, without increasing the required memory. This strategy was shown to be successful for 2D neural networks. We implement the algorithm for 3D convolutional neural networks and apply it to a popular 3D convolutional neural network which is used to classify videos and compare it to cuDNN. For our highly optimized implementation of the algorithm, we observe a twofold speedup for most of the 3D convolution layers of our test network compared to the cuDNN version. PMID:29250109
BeiDou Signal Acquisition with Neumann–Hoffman Code Modulation in a Degraded Channel

PubMed Central

Zhao, Lin; Liu, Aimeng; Ding, Jicheng; Wang, Jing

2017-01-01

With the modernization of global navigation satellite systems (GNSS), secondary codes, also known as the Neumann–Hoffman (NH) codes, are modulated on the satellite signal to obtain a better positioning performance. However, this leads to an attenuation of the acquisition sensitivity of classic integration algorithms because of the frequent bit transitions that refer to the NH codes. Taking weak BeiDou navigation satellite system (BDS) signals as objects, the present study analyzes the side effect of NH codes on acquisition in detail and derives a straightforward formula, which indicates that bit transitions decrease the frequency accuracy. To meet the requirement of carrier-tracking loop initialization, a frequency recalculation algorithm is proposed based on verified fast Fourier transform (FFT) to mitigate the effect, meanwhile, the starting point of NH codes is found. Then, a differential correction is utilized to improve the acquisition accuracy of code phase. Monte Carlo simulations and real BDS data tests demonstrate that the new structure is superior to the conventional algorithms both in detection probability and frequency accuracy in a degraded channel. PMID:28208776
Algorithm for solving the linear Cauchy problem for large systems of ordinary differential equations with the use of parallel computations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Moryakov, A. V., E-mail: sailor@orc.ru

2016-12-15

An algorithm for solving the linear Cauchy problem for large systems of ordinary differential equations is presented. The algorithm for systems of first-order differential equations is implemented in the EDELWEISS code with the possibility of parallel computations on supercomputers employing the MPI (Message Passing Interface) standard for the data exchange between parallel processes. The solution is represented by a series of orthogonal polynomials on the interval [0, 1]. The algorithm is characterized by simplicity and the possibility to solve nonlinear problems with a correction of the operator in accordance with the solution obtained in the previous iterative process.
Fructan:fructan 1-fructosyltransferase and inulin hydrolase activities relating to inulin and soluble sugars in Jerusalem artichoke (Helianthus tuberosus Linn.) tubers during storage.

PubMed

Maicaurkaew, Sukanya; Jogloy, Sanun; Hamaker, Bruce R; Ningsanond, Suwayd

2017-03-01

Influences of harvest time and storage conditions on activities of fructan:fructan1-fructosyltransferase (1-FFT) and inulin hydrolase (InH) in relation to inulin and soluble sugars of Jerusalem artichoke ( Helianthus tuberosus L.) tubers were investigated. Maturity affected 1-FFT-activity, inulin contents, and inulin profiles of the tubers harvested between 30 and 70 days after flowering (DAF). Decreases in 1-FFT activity, high molecular weight inulin, and inulin content were observed in late-harvested tubers. The tubers harvested at 50 DAF had the highest inulin content (734.9 ± 20.5 g kg -1 DW) with a high degree of polymerization (28% of DP >30). During storage of the tubers, increases in InH activity (reached its peak at 15 days of storage) and gradual decreases in 1-FFT activity took placed. These changes were associated with inulin depolymerization, causing decreases in inulin content and increases in soluble sugars. As well, decreasing storage temperatures would retain high inulin content and keep low soluble sugars; and freezing at -18 °C would best retard 1-FFT, InH, and inulin changes.
Parallel, stochastic measurement of molecular surface area.

PubMed

Juba, Derek; Varshney, Amitabh

2008-08-01

Biochemists often wish to compute surface areas of proteins. A variety of algorithms have been developed for this task, but they are designed for traditional single-processor architectures. The current trend in computer hardware is towards increasingly parallel architectures for which these algorithms are not well suited. We describe a parallel, stochastic algorithm for molecular surface area computation that maps well to the emerging multi-core architectures. Our algorithm is also progressive, providing a rough estimate of surface area immediately and refining this estimate as time goes on. Furthermore, the algorithm generates points on the molecular surface which can be used for point-based rendering. We demonstrate a GPU implementation of our algorithm and show that it compares favorably with several existing molecular surface computation programs, giving fast estimates of the molecular surface area with good accuracy.
Concurrent extensions to the FORTRAN language for parallel programming of computational fluid dynamics algorithms

NASA Technical Reports Server (NTRS)

Weeks, Cindy Lou

1986-01-01

Experiments were conducted at NASA Ames Research Center to define multi-tasking software requirements for multiple-instruction, multiple-data stream (MIMD) computer architectures. The focus was on specifying solutions for algorithms in the field of computational fluid dynamics (CFD). The program objectives were to allow researchers to produce usable parallel application software as soon as possible after acquiring MIMD computer equipment, to provide researchers with an easy-to-learn and easy-to-use parallel software language which could be implemented on several different MIMD machines, and to enable researchers to list preferred design specifications for future MIMD computer architectures. Analysis of CFD algorithms indicated that extensions of an existing programming language, adaptable to new computer architectures, provided the best solution to meeting program objectives. The CoFORTRAN Language was written in response to these objectives and to provide researchers a means to experiment with parallel software solutions to CFD algorithms on machines with parallel architectures.
Parallel algorithm for determining motion vectors in ice floe images by matching edge features

NASA Technical Reports Server (NTRS)

Manohar, M.; Ramapriyan, H. K.; Strong, J. P.

1988-01-01

A parallel algorithm is described to determine motion vectors of ice floes using time sequences of images of the Arctic ocean obtained from the Synthetic Aperture Radar (SAR) instrument flown on-board the SEASAT spacecraft. Researchers describe a parallel algorithm which is implemented on the MPP for locating corresponding objects based on their translationally and rotationally invariant features. The algorithm first approximates the edges in the images by polygons or sets of connected straight-line segments. Each such edge structure is then reduced to a seed point. Associated with each seed point are the descriptions (lengths, orientations and sequence numbers) of the lines constituting the corresponding edge structure. A parallel matching algorithm is used to match packed arrays of such descriptions to identify corresponding seed points in the two images. The matching algorithm is designed such that fragmentation and merging of ice floes are taken into account by accepting partial matches. The technique has been demonstrated to work on synthetic test patterns and real image pairs from SEASAT in times ranging from .5 to 0.7 seconds for 128 x 128 images.
A Parallel Point Matching Algorithm for Landmark Based Image Registration Using Multicore Platform

PubMed Central

Yang, Lin; Gong, Leiguang; Zhang, Hong; Nosher, John L.; Foran, David J.

2013-01-01

Point matching is crucial for many computer vision applications. Establishing the correspondence between a large number of data points is a computationally intensive process. Some point matching related applications, such as medical image registration, require real time or near real time performance if applied to critical clinical applications like image assisted surgery. In this paper, we report a new multicore platform based parallel algorithm for fast point matching in the context of landmark based medical image registration. We introduced a non-regular data partition algorithm which utilizes the K-means clustering algorithm to group the landmarks based on the number of available processing cores, which optimize the memory usage and data transfer. We have tested our method using the IBM Cell Broadband Engine (Cell/B.E.) platform. The results demonstrated a significant speed up over its sequential implementation. The proposed data partition and parallelization algorithm, though tested only on one multicore platform, is generic by its design. Therefore the parallel algorithm can be extended to other computing platforms, as well as other point matching related applications. PMID:24308014
Implementation and analysis of a Navier-Stokes algorithm on parallel computers

NASA Technical Reports Server (NTRS)

Fatoohi, Raad A.; Grosch, Chester E.

1988-01-01

The results of the implementation of a Navier-Stokes algorithm on three parallel/vector computers are presented. The object of this research is to determine how well, or poorly, a single numerical algorithm would map onto three different architectures. The algorithm is a compact difference scheme for the solution of the incompressible, two-dimensional, time-dependent Navier-Stokes equations. The computers were chosen so as to encompass a variety of architectures. They are the following: the MPP, an SIMD machine with 16K bit serial processors; Flex/32, an MIMD machine with 20 processors; and Cray/2. The implementation of the algorithm is discussed in relation to these architectures and measures of the performance on each machine are given. The basic comparison is among SIMD instruction parallelism on the MPP, MIMD process parallelism on the Flex/32, and vectorization of a serial code on the Cray/2. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Finally, conclusions are presented.
Solving very large, sparse linear systems on mesh-connected parallel computers

NASA Technical Reports Server (NTRS)

Opsahl, Torstein; Reif, John

1987-01-01

The implementation of Pan and Reif's Parallel Nested Dissection (PND) algorithm on mesh connected parallel computers is described. This is the first known algorithm that allows very large, sparse linear systems of equations to be solved efficiently in polylog time using a small number of processors. How the processor bound of PND can be matched to the number of processors available on a given parallel computer by slowing down the algorithm by constant factors is described. Also, for the important class of problems where G(A) is a grid graph, a unique memory mapping that reduces the inter-processor communication requirements of PND to those that can be executed on mesh connected parallel machines is detailed. A description of an implementation on the Goodyear Massively Parallel Processor (MPP), located at Goddard is given. Also, a detailed discussion of data mappings and performance issues is given.
Handling Big Data in Medical Imaging: Iterative Reconstruction with Large-Scale Automated Parallel Computation

PubMed Central

Lee, Jae H.; Yao, Yushu; Shrestha, Uttam; Gullberg, Grant T.; Seo, Youngho

2014-01-01

The primary goal of this project is to implement the iterative statistical image reconstruction algorithm, in this case maximum likelihood expectation maximum (MLEM) used for dynamic cardiac single photon emission computed tomography, on Spark/GraphX. This involves porting the algorithm to run on large-scale parallel computing systems. Spark is an easy-to- program software platform that can handle large amounts of data in parallel. GraphX is a graph analytic system running on top of Spark to handle graph and sparse linear algebra operations in parallel. The main advantage of implementing MLEM algorithm in Spark/GraphX is that it allows users to parallelize such computation without any expertise in parallel computing or prior knowledge in computer science. In this paper we demonstrate a successful implementation of MLEM in Spark/GraphX and present the performance gains with the goal to eventually make it useable in clinical setting. PMID:27081299
Handling Big Data in Medical Imaging: Iterative Reconstruction with Large-Scale Automated Parallel Computation.

PubMed

Lee, Jae H; Yao, Yushu; Shrestha, Uttam; Gullberg, Grant T; Seo, Youngho

2014-11-01

The primary goal of this project is to implement the iterative statistical image reconstruction algorithm, in this case maximum likelihood expectation maximum (MLEM) used for dynamic cardiac single photon emission computed tomography, on Spark/GraphX. This involves porting the algorithm to run on large-scale parallel computing systems. Spark is an easy-to- program software platform that can handle large amounts of data in parallel. GraphX is a graph analytic system running on top of Spark to handle graph and sparse linear algebra operations in parallel. The main advantage of implementing MLEM algorithm in Spark/GraphX is that it allows users to parallelize such computation without any expertise in parallel computing or prior knowledge in computer science. In this paper we demonstrate a successful implementation of MLEM in Spark/GraphX and present the performance gains with the goal to eventually make it useable in clinical setting.

A Parallel Saturation Algorithm on Shared Memory Architectures

NASA Technical Reports Server (NTRS)

Ezekiel, Jonathan; Siminiceanu

2007-01-01

Symbolic state-space generators are notoriously hard to parallelize. However, the Saturation algorithm implemented in the SMART verification tool differs from other sequential symbolic state-space generators in that it exploits the locality of ring events in asynchronous system models. This paper explores whether event locality can be utilized to efficiently parallelize Saturation on shared-memory architectures. Conceptually, we propose to parallelize the ring of events within a decision diagram node, which is technically realized via a thread pool. We discuss the challenges involved in our parallel design and conduct experimental studies on its prototypical implementation. On a dual-processor dual core PC, our studies show speed-ups for several example models, e.g., of up to 50% for a Kanban model, when compared to running our algorithm only on a single core.
PEPSI-Dock: a detailed data-driven protein-protein interaction potential accelerated by polar Fourier correlation.

PubMed

Neveu, Emilie; Ritchie, David W; Popov, Petr; Grudinin, Sergei

2016-09-01

Docking prediction algorithms aim to find the native conformation of a complex of proteins from knowledge of their unbound structures. They rely on a combination of sampling and scoring methods, adapted to different scales. Polynomial Expansion of Protein Structures and Interactions for Docking (PEPSI-Dock) improves the accuracy of the first stage of the docking pipeline, which will sharpen up the final predictions. Indeed, PEPSI-Dock benefits from the precision of a very detailed data-driven model of the binding free energy used with a global and exhaustive rigid-body search space. As well as being accurate, our computations are among the fastest by virtue of the sparse representation of the pre-computed potentials and FFT-accelerated sampling techniques. Overall, this is the first demonstration of a FFT-accelerated docking method coupled with an arbitrary-shaped distance-dependent interaction potential. First, we present a novel learning process to compute data-driven distant-dependent pairwise potentials, adapted from our previous method used for rescoring of putative protein-protein binding poses. The potential coefficients are learned by combining machine-learning techniques with physically interpretable descriptors. Then, we describe the integration of the deduced potentials into a FFT-accelerated spherical sampling provided by the Hex library. Overall, on a training set of 163 heterodimers, PEPSI-Dock achieves a success rate of 91% mid-quality predictions in the top-10 solutions. On a subset of the protein docking benchmark v5, it achieves 44.4% mid-quality predictions in the top-10 solutions when starting from bound structures and 20.5% when starting from unbound structures. The method runs in 5-15 min on a modern laptop and can easily be extended to other types of interactions. https://team.inria.fr/nano-d/software/PEPSI-Dock sergei.grudinin@inria.fr. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
An Integrated Approach to Locality-Conscious Processor Allocation and Scheduling of Mixed-Parallel Applications

DOE Office of Scientific and Technical Information (OSTI.GOV)

Vydyanathan, Naga; Krishnamoorthy, Sriram; Sabin, Gerald M.

2009-08-01

Complex parallel applications can often be modeled as directed acyclic graphs of coarse-grained application-tasks with dependences. These applications exhibit both task- and data-parallelism, and combining these two (also called mixedparallelism), has been shown to be an effective model for their execution. In this paper, we present an algorithm to compute the appropriate mix of task- and data-parallelism required to minimize the parallel completion time (makespan) of these applications. In other words, our algorithm determines the set of tasks that should be run concurrently and the number of processors to be allocated to each task. The processor allocation and scheduling decisionsmore » are made in an integrated manner and are based on several factors such as the structure of the taskgraph, the runtime estimates and scalability characteristics of the tasks and the inter-task data communication volumes. A locality conscious scheduling strategy is used to improve inter-task data reuse. Evaluation through simulations and actual executions of task graphs derived from real applications as well as synthetic graphs shows that our algorithm consistently generates schedules with lower makespan as compared to CPR and CPA, two previously proposed scheduling algorithms. Our algorithm also produces schedules that have lower makespan than pure taskand data-parallel schedules. For task graphs with known optimal schedules or lower bounds on the makespan, our algorithm generates schedules that are closer to the optima than other scheduling approaches.« less
A Parallel Compact Multi-Dimensional Numerical Algorithm with Aeroacoustics Applications

NASA Technical Reports Server (NTRS)

Povitsky, Alex; Morris, Philip J.

1999-01-01

In this study we propose a novel method to parallelize high-order compact numerical algorithms for the solution of three-dimensional PDEs (Partial Differential Equations) in a space-time domain. For this numerical integration most of the computer time is spent in computation of spatial derivatives at each stage of the Runge-Kutta temporal update. The most efficient direct method to compute spatial derivatives on a serial computer is a version of Gaussian elimination for narrow linear banded systems known as the Thomas algorithm. In a straightforward pipelined implementation of the Thomas algorithm processors are idle due to the forward and backward recurrences of the Thomas algorithm. To utilize processors during this time, we propose to use them for either non-local data independent computations, solving lines in the next spatial direction, or local data-dependent computations by the Runge-Kutta method. To achieve this goal, control of processor communication and computations by a static schedule is adopted. Thus, our parallel code is driven by a communication and computation schedule instead of the usual "creative, programming" approach. The obtained parallelization speed-up of the novel algorithm is about twice as much as that for the standard pipelined algorithm and close to that for the explicit DRP algorithm.
Relation of Parallel Discrete Event Simulation algorithms with physical models

NASA Astrophysics Data System (ADS)

Shchur, L. N.; Shchur, L. V.

2015-09-01

We extend concept of local simulation times in parallel discrete event simulation (PDES) in order to take into account architecture of the current hardware and software in high-performance computing. We shortly review previous research on the mapping of PDES on physical problems, and emphasise how physical results may help to predict parallel algorithms behaviour.
Highly parallel sparse Cholesky factorization

NASA Technical Reports Server (NTRS)

Gilbert, John R.; Schreiber, Robert

1990-01-01

Several fine grained parallel algorithms were developed and compared to compute the Cholesky factorization of a sparse matrix. The experimental implementations are on the Connection Machine, a distributed memory SIMD machine whose programming model conceptually supplies one processor per data element. In contrast to special purpose algorithms in which the matrix structure conforms to the connection structure of the machine, the focus is on matrices with arbitrary sparsity structure. The most promising algorithm is one whose inner loop performs several dense factorizations simultaneously on a 2-D grid of processors. Virtually any massively parallel dense factorization algorithm can be used as the key subroutine. The sparse code attains execution rates comparable to those of the dense subroutine. Although at present architectural limitations prevent the dense factorization from realizing its potential efficiency, it is concluded that a regular data parallel architecture can be used efficiently to solve arbitrarily structured sparse problems. A performance model is also presented and it is used to analyze the algorithms.
Mining algorithm for association rules in big data based on Hadoop

NASA Astrophysics Data System (ADS)

Fu, Chunhua; Wang, Xiaojing; Zhang, Lijun; Qiao, Liying

2018-04-01

In order to solve the problem that the traditional association rules mining algorithm has been unable to meet the mining needs of large amount of data in the aspect of efficiency and scalability, take FP-Growth as an example, the algorithm is realized in the parallelization based on Hadoop framework and Map Reduce model. On the basis, it is improved using the transaction reduce method for further enhancement of the algorithm's mining efficiency. The experiment, which consists of verification of parallel mining results, comparison on efficiency between serials and parallel, variable relationship between mining time and node number and between mining time and data amount, is carried out in the mining results and efficiency by Hadoop clustering. Experiments show that the paralleled FP-Growth algorithm implemented is able to accurately mine frequent item sets, with a better performance and scalability. It can be better to meet the requirements of big data mining and efficiently mine frequent item sets and association rules from large dataset.
A heuristic for suffix solutions

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bilgory, A.; Gajski, D.D.

1986-01-01

The suffix problem has appeared in solutions of recurrence systems for parallel and pipelined machines and more recently in the design of gate and silicon compilers. In this paper the authors present two algorithms. The first algorithm generates parallel suffix solutions with minimum cost for a given length, time delay, availability of initial values, and fanout. This algorithm generates a minimal solution for any length n and depth range log/sub 2/ N to N. The second algorithm reduces the size of the solutions generated by the first algorithm.
On the relationship between competitive flow and FFT analysis of the flow waves in the left internal mammary artery graft in the process of CABG.

PubMed

Mao, Boyan; Wang, Wenxin; Zhao, Zhou; Zhao, Xi; Li, Lanlan; Zhang, Huixia; Liu, Youjun

2016-12-28

During coronary artery bypass grafting (CABG), the ratio of powers of the fundamental frequency and its first harmonic (F0/H1) in fast Fourier transformation (FFT) analysis of the graft's flow waves has been used in the field of evaluation of the patency in anastomosis. But there is no report about using the FFT method to evaluate the magnitude of competitive flow. This study is aiming at exploring the relationship between competitive flow and FFT analysis of the flow waves in left internal mammary artery (LIMA) graft, and finding a new method to evaluate the magnitude of competitive flow. At first, establishing the CABG multiscale models of different stenosis in left anterior descending artery (LAD) to get different magnitude of competitive flows. Then, calculating the models by ANSYS-CFX and getting the flow waves in LIMA. Finally, analyzing the flow waves by FFT method and comparing the FFT results with the magnitude of competitive flow. There is no relationship between competitive flow and F0/H1. As for F0/H2 and F0/H3, they both increase with the reduction of the stenosis in LAD. But the increase of F0/H3 is not obviously enough and it can't identify the significant competitive flow clearly, so it can't be used as the evaluation index. It is found that F0/H2 increases obviously with the increase of the competitive flow and can identify the significant competitive flow. The FFT method can be used in the evaluation of competitive flow and the F0/H2 is the ideal index. High F0/H2 refers to the significant competitive flow. This method can be used during CABG to avoid the risk of competitive flow.
GPU based cloud system for high-performance arrhythmia detection with parallel k-NN algorithm.

PubMed

Tae Joon Jun; Hyun Ji Park; Hyuk Yoo; Young-Hak Kim; Daeyoung Kim

2016-08-01

In this paper, we propose an GPU based Cloud system for high-performance arrhythmia detection. Pan-Tompkins algorithm is used for QRS detection and we optimized beat classification algorithm with K-Nearest Neighbor (K-NN). To support high performance beat classification on the system, we parallelized beat classification algorithm with CUDA to execute the algorithm on virtualized GPU devices on the Cloud system. MIT-BIH Arrhythmia database is used for validation of the algorithm. The system achieved about 93.5% of detection rate which is comparable to previous researches while our algorithm shows 2.5 times faster execution time compared to CPU only detection algorithm.
Evolution of Functional Family Therapy as an Evidence-Based Practice for Adolescents with Disruptive Behavior Problems.

PubMed

Robbins, Michael S; Alexander, James F; Turner, Charles W; Hollimon, Amy

2016-09-01

This article summarizes the evolution of functional family therapy (FFT) based upon four decades of clinical practice and scientific scrutiny through research evidence. FFT research has evolved from an initial focus upon clinical process research, which examined sequential exchanges between therapists and family members. A key element of this research has been an examination of the way in which clinicians acquire, consolidate, and maintain the skills needed to implement FFT effectively with youth and families. Many randomized efficacy and effectiveness studies have evaluated the impact of FFT across diverse clinical populations. Subsequent research investigated factors that influence the effectiveness of implementation across more than 300 clinical settings in which more than 2,500 trained clinicians have provided service to nearly 400,000 families. Another important set of investigations concerned the cost-effectiveness of the interventions. © 2016 Family Process Institute.
The Fourier analysis of biological transients.

PubMed

Harris, C M

1998-08-31

With modern computing technology the digital implementation of the Fourier transform is widely available, mostly in the form of the fast Fourier transform (FFT). Although the FFT has become almost synonymous with the Fourier transform, it is a fast numerical technique for computing the discrete Fourier transform (DFT) of a finite sequence of sampled data. The DFT is not directly equivalent to the continuous Fourier transform of the underlying biological signal, which becomes important when analyzing biological transients. Although this distinction is well known by some, for many it leads to confusion in how to interpret the FFT of biological data, and in how to precondition data so as to yield a more accurate Fourier transform using the FFT. We review here the fundamentals of Fourier analysis with emphasis on the analysis of transient signals. As an example of a transient, we consider the human saccade to illustrate the pitfalls and advantages of various Fourier analyses.
Parallel Clustering Algorithm for Large-Scale Biological Data Sets

PubMed Central

Wang, Minchao; Zhang, Wu; Ding, Wang; Dai, Dongbo; Zhang, Huiran; Xie, Hao; Chen, Luonan; Guo, Yike; Xie, Jiang

2014-01-01

Backgrounds Recent explosion of biological data brings a great challenge for the traditional clustering algorithms. With increasing scale of data sets, much larger memory and longer runtime are required for the cluster identification problems. The affinity propagation algorithm outperforms many other classical clustering algorithms and is widely applied into the biological researches. However, the time and space complexity become a great bottleneck when handling the large-scale data sets. Moreover, the similarity matrix, whose constructing procedure takes long runtime, is required before running the affinity propagation algorithm, since the algorithm clusters data sets based on the similarities between data pairs. Methods Two types of parallel architectures are proposed in this paper to accelerate the similarity matrix constructing procedure and the affinity propagation algorithm. The memory-shared architecture is used to construct the similarity matrix, and the distributed system is taken for the affinity propagation algorithm, because of its large memory size and great computing capacity. An appropriate way of data partition and reduction is designed in our method, in order to minimize the global communication cost among processes. Result A speedup of 100 is gained with 128 cores. The runtime is reduced from serval hours to a few seconds, which indicates that parallel algorithm is capable of handling large-scale data sets effectively. The parallel affinity propagation also achieves a good performance when clustering large-scale gene data (microarray) and detecting families in large protein superfamilies. PMID:24705246
Applications of New Surrogate Global Optimization Algorithms including Efficient Synchronous and Asynchronous Parallelism for Calibration of Expensive Nonlinear Geophysical Simulation Models.

NASA Astrophysics Data System (ADS)

Shoemaker, C. A.; Pang, M.; Akhtar, T.; Bindel, D.

2016-12-01

New parallel surrogate global optimization algorithms are developed and applied to objective functions that are expensive simulations (possibly with multiple local minima). The algorithms can be applied to most geophysical simulations, including those with nonlinear partial differential equations. The optimization does not require simulations be parallelized. Asynchronous (and synchronous) parallel execution is available in the optimization toolbox "pySOT". The parallel algorithms are modified from serial to eliminate fine grained parallelism. The optimization is computed with open source software pySOT, a Surrogate Global Optimization Toolbox that allows user to pick the type of surrogate (or ensembles), the search procedure on surrogate, and the type of parallelism (synchronous or asynchronous). pySOT also allows the user to develop new algorithms by modifying parts of the code. In the applications here, the objective function takes up to 30 minutes for one simulation, and serial optimization can take over 200 hours. Results from Yellowstone (NSF) and NCSS (Singapore) supercomputers are given for groundwater contaminant hydrology simulations with applications to model parameter estimation and decontamination management. All results are compared with alternatives. The first results are for optimization of pumping at many wells to reduce cost for decontamination of groundwater at a superfund site. The optimization runs with up to 128 processors. Superlinear speed up is obtained for up to 16 processors, and efficiency with 64 processors is over 80%. Each evaluation of the objective function requires the solution of nonlinear partial differential equations to describe the impact of spatially distributed pumping and model parameters on model predictions for the spatial and temporal distribution of groundwater contaminants. The second application uses an asynchronous parallel global optimization for groundwater quality model calibration. The time for a single objective function evaluation varies unpredictably, so efficiency is improved with asynchronous parallel calculations to improve load balancing. The third application (done at NCSS) incorporates new global surrogate multi-objective parallel search algorithms into pySOT and applies it to a large watershed calibration problem.
Parallel-Processing Test Bed For Simulation Software

NASA Technical Reports Server (NTRS)

Blech, Richard; Cole, Gary; Townsend, Scott

1996-01-01

Second-generation Hypercluster computing system is multiprocessor test bed for research on parallel algorithms for simulation in fluid dynamics, electromagnetics, chemistry, and other fields with large computational requirements but relatively low input/output requirements. Built from standard, off-shelf hardware readily upgraded as improved technology becomes available. System used for experiments with such parallel-processing concepts as message-passing algorithms, debugging software tools, and computational steering. First-generation Hypercluster system described in "Hypercluster Parallel Processor" (LEW-15283).
Customizing FP-growth algorithm to parallel mining with Charm++ library

NASA Astrophysics Data System (ADS)

Puścian, Marek

2017-08-01

This paper presents a frequent item mining algorithm that was customized to handle growing data repositories. The proposed solution applies Master Slave scheme to frequent pattern growth technique. Efficient utilization of available computation units is achieved by dynamic reallocation of tasks. Conditional frequent trees are assigned to parallel workers basing on their workload. Proposed enhancements have been successfully implemented using Charm++ library. This paper discusses results of the performance of parallelized FP-growth algorithm against different datasets. The approach has been illustrated with many experiments and measurements performed using multiprocessor and multithreaded computer.
Simulated parallel annealing within a neighborhood for optimization of biomechanical systems.

PubMed

Higginson, J S; Neptune, R R; Anderson, F C

2005-09-01

Optimization problems for biomechanical systems have become extremely complex. Simulated annealing (SA) algorithms have performed well in a variety of test problems and biomechanical applications; however, despite advances in computer speed, convergence to optimal solutions for systems of even moderate complexity has remained prohibitive. The objective of this study was to develop a portable parallel version of a SA algorithm for solving optimization problems in biomechanics. The algorithm for simulated parallel annealing within a neighborhood (SPAN) was designed to minimize interprocessor communication time and closely retain the heuristics of the serial SA algorithm. The computational speed of the SPAN algorithm scaled linearly with the number of processors on different computer platforms for a simple quadratic test problem and for a more complex forward dynamic simulation of human pedaling.
A GPU-paralleled implementation of an enhanced face recognition algorithm

NASA Astrophysics Data System (ADS)

Chen, Hao; Liu, Xiyang; Shao, Shuai; Zan, Jiguo

2013-03-01

Face recognition algorithm based on compressed sensing and sparse representation is hotly argued in these years. The scheme of this algorithm increases recognition rate as well as anti-noise capability. However, the computational cost is expensive and has become a main restricting factor for real world applications. In this paper, we introduce a GPU-accelerated hybrid variant of face recognition algorithm named parallel face recognition algorithm (pFRA). We describe here how to carry out parallel optimization design to take full advantage of many-core structure of a GPU. The pFRA is tested and compared with several other implementations under different data sample size. Finally, Our pFRA, implemented with NVIDIA GPU and Computer Unified Device Architecture (CUDA) programming model, achieves a significant speedup over the traditional CPU implementations.
Conjugate-Gradient Algorithms For Dynamics Of Manipulators

NASA Technical Reports Server (NTRS)

Fijany, Amir; Scheid, Robert E.

1993-01-01

Algorithms for serial and parallel computation of forward dynamics of multiple-link robotic manipulators by conjugate-gradient method developed. Parallel algorithms have potential for speedup of computations on multiple linked, specialized processors implemented in very-large-scale integrated circuits. Such processors used to stimulate dynamics, possibly faster than in real time, for purposes of planning and control.
Extending molecular simulation time scales: Parallel in time integrations for high-level quantum chemistry and complex force representations

NASA Astrophysics Data System (ADS)

Bylaska, Eric J.; Weare, Jonathan Q.; Weare, John H.

2013-08-01

Parallel in time simulation algorithms are presented and applied to conventional molecular dynamics (MD) and ab initio molecular dynamics (AIMD) models of realistic complexity. Assuming that a forward time integrator, f (e.g., Verlet algorithm), is available to propagate the system from time ti (trajectory positions and velocities xi = (ri, vi)) to time ti + 1 (xi + 1) by xi + 1 = fi(xi), the dynamics problem spanning an interval from t0…tM can be transformed into a root finding problem, F(X) = [xi - f(x(i - 1)]i = 1, M = 0, for the trajectory variables. The root finding problem is solved using a variety of root finding techniques, including quasi-Newton and preconditioned quasi-Newton schemes that are all unconditionally convergent. The algorithms are parallelized by assigning a processor to each time-step entry in the columns of F(X). The relation of this approach to other recently proposed parallel in time methods is discussed, and the effectiveness of various approaches to solving the root finding problem is tested. We demonstrate that more efficient dynamical models based on simplified interactions or coarsening time-steps provide preconditioners for the root finding problem. However, for MD and AIMD simulations, such preconditioners are not required to obtain reasonable convergence and their cost must be considered in the performance of the algorithm. The parallel in time algorithms developed are tested by applying them to MD and AIMD simulations of size and complexity similar to those encountered in present day applications. These include a 1000 Si atom MD simulation using Stillinger-Weber potentials, and a HCl + 4H2O AIMD simulation at the MP2 level. The maximum speedup (serial execution time/parallel execution time) obtained by parallelizing the Stillinger-Weber MD simulation was nearly 3.0. For the AIMD MP2 simulations, the algorithms achieved speedups of up to 14.3. The parallel in time algorithms can be implemented in a distributed computing environment using very slow transmission control protocol/Internet protocol networks. Scripts written in Python that make calls to a precompiled quantum chemistry package (NWChem) are demonstrated to provide an actual speedup of 8.2 for a 2.5 ps AIMD simulation of HCl + 4H2O at the MP2/6-31G* level. Implemented in this way these algorithms can be used for long time high-level AIMD simulations at a modest cost using machines connected by very slow networks such as WiFi, or in different time zones connected by the Internet. The algorithms can also be used with programs that are already parallel. Using these algorithms, we are able to reduce the cost of a MP2/6-311++G(2d,2p) simulation that had reached its maximum possible speedup in the parallelization of the electronic structure calculation from 32 s/time step to 6.9 s/time step.

A parallel algorithm for generation and assembly of finite element stiffness and mass matrices

NASA Technical Reports Server (NTRS)

Storaasli, O. O.; Carmona, E. A.; Nguyen, D. T.; Baddourah, M. A.

1991-01-01

A new algorithm is proposed for parallel generation and assembly of the finite element stiffness and mass matrices. The proposed assembly algorithm is based on a node-by-node approach rather than the more conventional element-by-element approach. The new algorithm's generality and computation speed-up when using multiple processors are demonstrated for several practical applications on multi-processor Cray Y-MP and Cray 2 supercomputers.
Parallel Implementation of the Wideband DOA Algorithm on the IBM Cell BE Processor

DTIC Science & Technology

2010-05-01

Abstract—The Multiple Signal Classification ( MUSIC ) algorithm is a powerful technique for determining the Direction of Arrival (DOA) of signals...Broadband Engine Processor (Cell BE). The process of adapting the serial based MUSIC algorithm to the Cell BE will be analyzed in terms of parallelism and...using Multiple Signal Classification MUSIC algorithm [4] • Computation of Focus matrix • Computation of number of sources • Separation of Signal
On Parallel Push-Relabel based Algorithms for Bipartite Maximum Matching

DOE Office of Scientific and Technical Information (OSTI.GOV)

Langguth, Johannes; Azad, Md Ariful; Halappanavar, Mahantesh

2014-07-01

We study multithreaded push-relabel based algorithms for computing maximum cardinality matching in bipartite graphs. Matching is a fundamental combinatorial (graph) problem with applications in a wide variety of problems in science and engineering. We are motivated by its use in the context of sparse linear solvers for computing maximum transversal of a matrix. We implement and test our algorithms on several multi-socket multicore systems and compare their performance to state-of-the-art augmenting path-based serial and parallel algorithms using a testset comprised of a wide range of real-world instances. Building on several heuristics for enhancing performance, we demonstrate good scaling for themore » parallel push-relabel algorithm. We show that it is comparable to the best augmenting path-based algorithms for bipartite matching. To the best of our knowledge, this is the first extensive study of multithreaded push-relabel based algorithms. In addition to a direct impact on the applications using matching, the proposed algorithmic techniques can be extended to preflow-push based algorithms for computing maximum flow in graphs.« less
Implementation of a parallel protein structure alignment service on cloud.

PubMed

Hung, Che-Lun; Lin, Yaw-Ling

2013-01-01

Protein structure alignment has become an important strategy by which to identify evolutionary relationships between protein sequences. Several alignment tools are currently available for online comparison of protein structures. In this paper, we propose a parallel protein structure alignment service based on the Hadoop distribution framework. This service includes a protein structure alignment algorithm, a refinement algorithm, and a MapReduce programming model. The refinement algorithm refines the result of alignment. To process vast numbers of protein structures in parallel, the alignment and refinement algorithms are implemented using MapReduce. We analyzed and compared the structure alignments produced by different methods using a dataset randomly selected from the PDB database. The experimental results verify that the proposed algorithm refines the resulting alignments more accurately than existing algorithms. Meanwhile, the computational performance of the proposed service is proportional to the number of processors used in our cloud platform.
Implementation of a Parallel Protein Structure Alignment Service on Cloud

PubMed Central

Hung, Che-Lun; Lin, Yaw-Ling

2013-01-01

Protein structure alignment has become an important strategy by which to identify evolutionary relationships between protein sequences. Several alignment tools are currently available for online comparison of protein structures. In this paper, we propose a parallel protein structure alignment service based on the Hadoop distribution framework. This service includes a protein structure alignment algorithm, a refinement algorithm, and a MapReduce programming model. The refinement algorithm refines the result of alignment. To process vast numbers of protein structures in parallel, the alignment and refinement algorithms are implemented using MapReduce. We analyzed and compared the structure alignments produced by different methods using a dataset randomly selected from the PDB database. The experimental results verify that the proposed algorithm refines the resulting alignments more accurately than existing algorithms. Meanwhile, the computational performance of the proposed service is proportional to the number of processors used in our cloud platform. PMID:23671842
3D Kirchhoff depth migration algorithm: A new scalable approach for parallelization on multicore CPU based cluster

NASA Astrophysics Data System (ADS)

Rastogi, Richa; Londhe, Ashutosh; Srivastava, Abhishek; Sirasala, Kirannmayi M.; Khonde, Kiran

2017-03-01

In this article, a new scalable 3D Kirchhoff depth migration algorithm is presented on state of the art multicore CPU based cluster. Parallelization of 3D Kirchhoff depth migration is challenging due to its high demand of compute time, memory, storage and I/O along with the need of their effective management. The most resource intensive modules of the algorithm are traveltime calculations and migration summation which exhibit an inherent trade off between compute time and other resources. The parallelization strategy of the algorithm largely depends on the storage of calculated traveltimes and its feeding mechanism to the migration process. The presented work is an extension of our previous work, wherein a 3D Kirchhoff depth migration application for multicore CPU based parallel system had been developed. Recently, we have worked on improving parallel performance of this application by re-designing the parallelization approach. The new algorithm is capable to efficiently migrate both prestack and poststack 3D data. It exhibits flexibility for migrating large number of traces within the available node memory and with minimal requirement of storage, I/O and inter-node communication. The resultant application is tested using 3D Overthrust data on PARAM Yuva II, which is a Xeon E5-2670 based multicore CPU cluster with 16 cores/node and 64 GB shared memory. Parallel performance of the algorithm is studied using different numerical experiments and the scalability results show striking improvement over its previous version. An impressive 49.05X speedup with 76.64% efficiency is achieved for 3D prestack data and 32.00X speedup with 50.00% efficiency for 3D poststack data, using 64 nodes. The results also demonstrate the effectiveness and robustness of the improved algorithm with high scalability and efficiency on a multicore CPU cluster.
A new method for gravity field recovery based on frequency analysis of spherical harmonics

NASA Astrophysics Data System (ADS)

Cai, Lin; Zhou, Zebing

2017-04-01

All existing methods for gravity field recovery are mostly based on the space-wise and time-wise approach, whose core processes are constructing the observation equations and solving them by the least square method. It's should be pointed that the least square method means the approximation. On the other hand, we can directly and precisely obtain the coefficients of harmonics by computing the Fast Fourier Transform (FFT) when we do 1-D data (time series) analysis. So the question whether we directly and precisely obtain the coefficients of spherical harmonic by computing 2-D FFT of measurements of satellite gravity mission is of great significance, since this may guide us to a new understanding of the signal components of gravity field and make us determine it quickly by taking advantage of FFT. Like the 1-D data analysis, the 2-D FFT of measurements of satellite can be computed rapidly. If we can determine the relationship between spherical harmonics and 2-D Fourier frequencies and the transfer function from measurements to spherical coefficients, the question mentioned above can be solved. So the objective of this research project is to establish a new method based on frequency analysis of spherical harmonic, which directly compute the confidents of spherical harmonic of gravity field, which is differ from recovery by least squares. There is a one to one correspondence between frequency spectrum and the time series in 1-D FFT. The 2-D FFT has a similar relationship to 1-D FFT. Owing to the fact that any degree or order (higher than one) of spherical function has multi frequencies and these frequencies may be aliased. Fortunately, the elements and ratio of these frequencies of spherical function can be determined, and we can compute the coefficients of spherical function from 2-D FFT. This relationship can be written as equations and equivalent to a matrix, which is solid and can be derived in advance. Until now the relationship has be determined. Some preliminary results, which only compute lower degree spherical harmonics, indicates that the difference between the input (EGM2008) and output (coefficients from recovery) is smaller than 5E-17, while the minimal precision of computer software (Matlab) is 2.2204E-16.
Recordings of mucociliary activity in vivo: benefit of fast Fourier transformation of the photoelectric signal.

PubMed

Lindberg, S; Cervin, A; Runer, T; Thomasson, L

1996-09-01

Investigations of mucociliary activity in vivo are based on photoelectric recordings of light reflections from the mucosa. The alterations in light intensity produced by the beating cilia are picked up by a photodetector and converted to photoelectric signals. The optimal processing of these signals is not known, but in vitro recordings have been reported to benefit from fast Fourier transformation (FFT) of the signal. The aim of the investigation was to study the effect of FFT for frequency analysis of photoelectric signals originating from an artificial light source simulating mucociliary activity or from sinus or nasal mucosa in vivo, as compared to a conventional method of calculating mucociliary wave frequency, in which each peak in the signal is interpreted as a beat (old method). In the experiments with the artificial light source, the FFT system was superior to the conventional method by a factor of 50 in detecting weak signals. By using FFT signal processing, frequency could be correctly calculated in experiments with a compound signal. In experiments in the rabbit maxillary sinus, the spontaneous variations were greater when signals were processed by FFT. The correlation between the two methods was excellent: r = .92. The increase in mucociliary activity in response to the ciliary stimulant methacholine at a dosage of 0.5 microgram/kg was greater measured with the FFT than with the old method (55.3% +/- 8.3% versus 43.0% +/- 8.2%, p < .05, N = 8), and only with the FFT system could a significant effect of a threshold dose (0.05 microgram/kg) of methacholine be detected. In the human nose, recordings from aluminum foil placed on the nasal dorsum and from the nasal septa mucosa displayed some similarities in the lower frequency spectrum (< 5 Hz) attributable to artifacts. The predominant cause of these artifacts was the pulse beat, whereas in the frequency spectrum above 5 Hz, results differed for the two sources of reflected light, the mean frequency in seven healthy volunteers being 7.8 +/- 1.6 Hz for the human nasal mucosa. It is concluded that the FFT system has greater sensitivity in detecting photoelectric signals derived from the mucociliary system, and that it is also a useful tool for analyzing the contributions of artifacts to the signal.
NETRA: A parallel architecture for integrated vision systems. 1: Architecture and organization

NASA Technical Reports Server (NTRS)

Choudhary, Alok N.; Patel, Janak H.; Ahuja, Narendra

1989-01-01

Computer vision is regarded as one of the most complex and computationally intensive problems. An integrated vision system (IVS) is considered to be a system that uses vision algorithms from all levels of processing for a high level application (such as object recognition). A model of computation is presented for parallel processing for an IVS. Using the model, desired features and capabilities of a parallel architecture suitable for IVSs are derived. Then a multiprocessor architecture (called NETRA) is presented. This architecture is highly flexible without the use of complex interconnection schemes. The topology of NETRA is recursively defined and hence is easily scalable from small to large systems. Homogeneity of NETRA permits fault tolerance and graceful degradation under faults. It is a recursively defined tree-type hierarchical architecture where each of the leaf nodes consists of a cluster of processors connected with a programmable crossbar with selective broadcast capability to provide for desired flexibility. A qualitative evaluation of NETRA is presented. Then general schemes are described to map parallel algorithms onto NETRA. Algorithms are classified according to their communication requirements for parallel processing. An extensive analysis of inter-cluster communication strategies in NETRA is presented, and parameters affecting performance of parallel algorithms when mapped on NETRA are discussed. Finally, a methodology to evaluate performance of algorithms on NETRA is described.
Applications of Parallel Computation in Micro-Mechanics and Finite Element Method

NASA Technical Reports Server (NTRS)

Tan, Hui-Qian

1996-01-01

This project discusses the application of parallel computations related with respect to material analyses. Briefly speaking, we analyze some kind of material by elements computations. We call an element a cell here. A cell is divided into a number of subelements called subcells and all subcells in a cell have the identical structure. The detailed structure will be given later in this paper. It is obvious that the problem is "well-structured". SIMD machine would be a better choice. In this paper we try to look into the potentials of SIMD machine in dealing with finite element computation by developing appropriate algorithms on MasPar, a SIMD parallel machine. In section 2, the architecture of MasPar will be discussed. A brief review of the parallel programming language MPL also is given in that section. In section 3, some general parallel algorithms which might be useful to the project will be proposed. And, combining with the algorithms, some features of MPL will be discussed in more detail. In section 4, the computational structure of cell/subcell model will be given. The idea of designing the parallel algorithm for the model will be demonstrated. Finally in section 5, a summary will be given.
Eigensolution of finite element problems in a completely connected parallel architecture

NASA Technical Reports Server (NTRS)

Akl, F.; Morel, M.

1989-01-01

A parallel algorithm is presented for the solution of the generalized eigenproblem in linear elastic finite element analysis. The algorithm is based on a completely connected parallel architecture in which each processor is allowed to communicate with all other processors. The algorithm is successfully implemented on a tightly coupled MIMD parallel processor. A finite element model is divided into m domains each of which is assumed to process n elements. Each domain is then assigned to a processor or to a logical processor (task) if the number of domains exceeds the number of physical processors. The effect of the number of domains, the number of degrees-of-freedom located along the global fronts, and the dimension of the subspace on the performance of the algorithm is investigated. For a 64-element rectangular plate, speed-ups of 1.86, 3.13, 3.18, and 3.61 are achieved on two, four, six, and eight processors, respectively.
Data decomposition method for parallel polygon rasterization considering load balancing

NASA Astrophysics Data System (ADS)

Zhou, Chen; Chen, Zhenjie; Liu, Yongxue; Li, Feixue; Cheng, Liang; Zhu, A.-xing; Li, Manchun

2015-12-01

It is essential to adopt parallel computing technology to rapidly rasterize massive polygon data. In parallel rasterization, it is difficult to design an effective data decomposition method. Conventional methods ignore load balancing of polygon complexity in parallel rasterization and thus fail to achieve high parallel efficiency. In this paper, a novel data decomposition method based on polygon complexity (DMPC) is proposed. First, four factors that possibly affect the rasterization efficiency were investigated. Then, a metric represented by the boundary number and raster pixel number in the minimum bounding rectangle was developed to calculate the complexity of each polygon. Using this metric, polygons were rationally allocated according to the polygon complexity, and each process could achieve balanced loads of polygon complexity. To validate the efficiency of DMPC, it was used to parallelize different polygon rasterization algorithms and tested on different datasets. Experimental results showed that DMPC could effectively parallelize polygon rasterization algorithms. Furthermore, the implemented parallel algorithms with DMPC could achieve good speedup ratios of at least 15.69 and generally outperformed conventional decomposition methods in terms of parallel efficiency and load balancing. In addition, the results showed that DMPC exhibited consistently better performance for different spatial distributions of polygons.
A Massively Parallel Computational Method of Reading Index Files for SOAPsnv.

PubMed

Zhu, Xiaoqian; Peng, Shaoliang; Liu, Shaojie; Cui, Yingbo; Gu, Xiang; Gao, Ming; Fang, Lin; Fang, Xiaodong

2015-12-01

SOAPsnv is the software used for identifying the single nucleotide variation in cancer genes. However, its performance is yet to match the massive amount of data to be processed. Experiments reveal that the main performance bottleneck of SOAPsnv software is the pileup algorithm. The original pileup algorithm's I/O process is time-consuming and inefficient to read input files. Moreover, the scalability of the pileup algorithm is also poor. Therefore, we designed a new algorithm, named BamPileup, aiming to improve the performance of sequential read, and the new pileup algorithm implemented a parallel read mode based on index. Using this method, each thread can directly read the data start from a specific position. The results of experiments on the Tianhe-2 supercomputer show that, when reading data in a multi-threaded parallel I/O way, the processing time of algorithm is reduced to 3.9 s and the application program can achieve a speedup up to 100×. Moreover, the scalability of the new algorithm is also satisfying.
Optical ranging and communication method based on all-phase FFT

NASA Astrophysics Data System (ADS)

Li, Zening; Chen, Gang

2014-10-01

This paper describes an optical ranging and communication method based on all-phase fast fourier transform (FFT). This kind of system is mainly designed for vehicle safety application. Particularly, the phase shift of the reflecting orthogonal frequency division multiplexing (OFDM) symbol is measured to determine the signal time of flight. Then the distance is calculated according to the time of flight. Several key factors affecting the phase measurement accuracy are studied. The all-phase FFT, which can reduce the effects of frequency offset, phase noise and the inter-carrier interference (ICI), is applied to measure the OFDM symbol phase shift.
Doppler lidar power, aperture diameter, and FFT size trade-off study

NASA Astrophysics Data System (ADS)

Chester, David B.; Budge, Scott E.

2017-05-01

In the design or selection of a Doppler lidar instrument for a spacecraft landing system, it is important to evaluate the balance between performance requirements and cost, weight, and power consumption. Leveraging the capability of LadarSIM, a trade-off study was performed to evaluate the interaction between the laser transmission power, aperture diameter, and FFT size in a Doppler lidar system. For this study the probabilities of detection and false alarm were calculated using LadarSIM to simulate FMCW lidar systems with varying power, aperture diameter, and FFT size. This paper reports the results of this trade-off study.
A simplified implementation of edge detection in MATLAB is faster and more sensitive than fast fourier transform for actin fiber alignment quantification.

PubMed

Kemeny, Steven Frank; Clyne, Alisa Morss

2011-04-01

Fiber alignment plays a critical role in the structure and function of cells and tissues. While fiber alignment quantification is important to experimental analysis and several different methods for quantifying fiber alignment exist, many studies focus on qualitative rather than quantitative analysis perhaps due to the complexity of current fiber alignment methods. Speed and sensitivity were compared in edge detection and fast Fourier transform (FFT) for measuring actin fiber alignment in cells exposed to shear stress. While edge detection using matrix multiplication was consistently more sensitive than FFT, image processing time was significantly longer. However, when MATLAB functions were used to implement edge detection, MATLAB's efficient element-by-element calculations and fast filtering techniques reduced computation cost 100 times compared to the matrix multiplication edge detection method. The new computation time was comparable to the FFT method, and MATLAB edge detection produced well-distributed fiber angle distributions that statistically distinguished aligned and unaligned fibers in half as many sample images. When the FFT sensitivity was improved by dividing images into smaller subsections, processing time grew larger than the time required for MATLAB edge detection. Implementation of edge detection in MATLAB is simpler, faster, and more sensitive than FFT for fiber alignment quantification.
Extraluminal venous interruption for free-floating thrombus in the deep veins of lower limbs.

PubMed

Casian, D; Gutsu, E; Culiuc, V

2010-01-01

The free-floating thrombus (FFT) represents a particular form of deep vein thrombosis with extremely high potential of fatal pulmonary embolism. The purpose of the study was to evaluate the early results of aggressive surgical approach to FFT. During the period 2005-2008 years FFT was diagnosed in 13 patients. Demographic characteristics of patients: medium age--54.7 years, male--76.9%, significant comorbidity--5 (38.5%) cases. Localization of FFT: superficial femoral vein (SFV)--5 (38.5%), common femoral vein (CFV)--4 (30.7%), external iliac vein (EIV)--2 (15.4%), inferior cava vein (ICV)--2 (15.4%). Manifestations of previous pulmonary embolism were documented preoperatively in 3 (23.1%) cases. The following emergency surgical procedures were performed: ligation--3 (23.1%) or plication--2 (15.4%) of SFV; plication of CFV--5 (38.5%) patients, combined in 4 cases with partial thrombectomy (free-floating part of thrombus); plication of common iliac vein--1 (7.6%); plication of ICV--2 (15.4%) cases. Primary or recurrent cases of clinically significant pulmonary embolism were not detected in the postoperative period. The accumulated experience of surgical management of patients with FFT reveals the important role of deep vein ligation/plication in prevention of fatal pulmonary embolism.
A Parallel Nonrigid Registration Algorithm Based on B-Spline for Medical Images.

PubMed

Du, Xiaogang; Dang, Jianwu; Wang, Yangping; Wang, Song; Lei, Tao

2016-01-01

The nonrigid registration algorithm based on B-spline Free-Form Deformation (FFD) plays a key role and is widely applied in medical image processing due to the good flexibility and robustness. However, it requires a tremendous amount of computing time to obtain more accurate registration results especially for a large amount of medical image data. To address the issue, a parallel nonrigid registration algorithm based on B-spline is proposed in this paper. First, the Logarithm Squared Difference (LSD) is considered as the similarity metric in the B-spline registration algorithm to improve registration precision. After that, we create a parallel computing strategy and lookup tables (LUTs) to reduce the complexity of the B-spline registration algorithm. As a result, the computing time of three time-consuming steps including B-splines interpolation, LSD computation, and the analytic gradient computation of LSD, is efficiently reduced, for the B-spline registration algorithm employs the Nonlinear Conjugate Gradient (NCG) optimization method. Experimental results of registration quality and execution efficiency on the large amount of medical images show that our algorithm achieves a better registration accuracy in terms of the differences between the best deformation fields and ground truth and a speedup of 17 times over the single-threaded CPU implementation due to the powerful parallel computing ability of Graphics Processing Unit (GPU).
A scalable parallel algorithm for multiple objective linear programs

NASA Technical Reports Server (NTRS)

Wiecek, Malgorzata M.; Zhang, Hong

1994-01-01

This paper presents an ADBASE-based parallel algorithm for solving multiple objective linear programs (MOLP's). Job balance, speedup and scalability are of primary interest in evaluating efficiency of the new algorithm. Implementation results on Intel iPSC/2 and Paragon multiprocessors show that the algorithm significantly speeds up the process of solving MOLP's, which is understood as generating all or some efficient extreme points and unbounded efficient edges. The algorithm gives specially good results for large and very large problems. Motivation and justification for solving such large MOLP's are also included.
Integrated Network Decompositions and Dynamic Programming for Graph Optimization (INDDGO)

DOE Office of Scientific and Technical Information (OSTI.GOV)

The INDDGO software package offers a set of tools for finding exact solutions to graph optimization problems via tree decompositions and dynamic programming algorithms. Currently the framework offers serial and parallel (distributed memory) algorithms for finding tree decompositions and solving the maximum weighted independent set problem. The parallel dynamic programming algorithm is implemented on top of the MADNESS task-based runtime.

Automated Handling of Garments for Pressing

DTIC Science & Technology

1991-09-30

Parallel Algorithms for 2D Kalman Filtering ................................. 47 DJ. Potter and M.P. Cline Hash Table and Sorted Array: A Case Study of... Kalman Filtering on the Connection Machine ............................ 55 MA. Palis and D.K. Krecker Parallel Sorting of Large Arrays on the MasPar...ALGORITHM’VS FOR SEAM SENSING. .. .. .. ... ... .... ..... 24 6.1 KarelTW Algorithms .. .. ... ... ... ... .... ... ...... 24 6.1.1 Image Filtering
Efficient Scalable Median Filtering Using Histogram-Based Operations.

PubMed

Green, Oded

2018-05-01

Median filtering is a smoothing technique for noise removal in images. While there are various implementations of median filtering for a single-core CPU, there are few implementations for accelerators and multi-core systems. Many parallel implementations of median filtering use a sorting algorithm for rearranging the values within a filtering window and taking the median of the sorted value. While using sorting algorithms allows for simple parallel implementations, the cost of the sorting becomes prohibitive as the filtering windows grow. This makes such algorithms, sequential and parallel alike, inefficient. In this work, we introduce the first software parallel median filtering that is non-sorting-based. The new algorithm uses efficient histogram-based operations. These reduce the computational requirements of the new algorithm while also accessing the image fewer times. We show an implementation of our algorithm for both the CPU and NVIDIA's CUDA supported graphics processing unit (GPU). The new algorithm is compared with several other leading CPU and GPU implementations. The CPU implementation has near perfect linear scaling with a speedup on a quad-core system. The GPU implementation is several orders of magnitude faster than the other GPU implementations for mid-size median filters. For small kernels, and , comparison-based approaches are preferable as fewer operations are required. Lastly, the new algorithm is open-source and can be found in the OpenCV library.
A fast sorting algorithm for a hypersonic rarefied flow particle simulation on the connection machine

NASA Technical Reports Server (NTRS)

Dagum, Leonardo

1989-01-01

The data parallel implementation of a particle simulation for hypersonic rarefied flow described by Dagum associates a single parallel data element with each particle in the simulation. The simulated space is divided into discrete regions called cells containing a variable and constantly changing number of particles. The implementation requires a global sort of the parallel data elements so as to arrange them in an order that allows immediate access to the information associated with cells in the simulation. Described here is a very fast algorithm for performing the necessary ranking of the parallel data elements. The performance of the new algorithm is compared with that of the microcoded instruction for ranking on the Connection Machine.
Parallelization of sequential Gaussian, indicator and direct simulation algorithms

NASA Astrophysics Data System (ADS)

Nunes, Ruben; Almeida, José A.

2010-08-01

Improving the performance and robustness of algorithms on new high-performance parallel computing architectures is a key issue in efficiently performing 2D and 3D studies with large amount of data. In geostatistics, sequential simulation algorithms are good candidates for parallelization. When compared with other computational applications in geosciences (such as fluid flow simulators), sequential simulation software is not extremely computationally intensive, but parallelization can make it more efficient and creates alternatives for its integration in inverse modelling approaches. This paper describes the implementation and benchmarking of a parallel version of the three classic sequential simulation algorithms: direct sequential simulation (DSS), sequential indicator simulation (SIS) and sequential Gaussian simulation (SGS). For this purpose, the source used was GSLIB, but the entire code was extensively modified to take into account the parallelization approach and was also rewritten in the C programming language. The paper also explains in detail the parallelization strategy and the main modifications. Regarding the integration of secondary information, the DSS algorithm is able to perform simple kriging with local means, kriging with an external drift and collocated cokriging with both local and global correlations. SIS includes a local correction of probabilities. Finally, a brief comparison is presented of simulation results using one, two and four processors. All performance tests were carried out on 2D soil data samples. The source code is completely open source and easy to read. It should be noted that the code is only fully compatible with Microsoft Visual C and should be adapted for other systems/compilers.
Three-dimensional photoacoustic tomography based on graphics-processing-unit-accelerated finite element method.

PubMed

Peng, Kuan; He, Ling; Zhu, Ziqiang; Tang, Jingtian; Xiao, Jiaying

2013-12-01

Compared with commonly used analytical reconstruction methods, the frequency-domain finite element method (FEM) based approach has proven to be an accurate and flexible algorithm for photoacoustic tomography. However, the FEM-based algorithm is computationally demanding, especially for three-dimensional cases. To enhance the algorithm's efficiency, in this work a parallel computational strategy is implemented in the framework of the FEM-based reconstruction algorithm using a graphic-processing-unit parallel frame named the "compute unified device architecture." A series of simulation experiments is carried out to test the accuracy and accelerating effect of the improved method. The results obtained indicate that the parallel calculation does not change the accuracy of the reconstruction algorithm, while its computational cost is significantly reduced by a factor of 38.9 with a GTX 580 graphics card using the improved method.
Parallel Algorithms for Image Analysis.

DTIC Science & Technology

1982-06-01

8217 _ _ _ _ _ _ _ 4. TITLE (aid Subtitle) S. TYPE OF REPORT & PERIOD COVERED PARALLEL ALGORITHMS FOR IMAGE ANALYSIS TECHNICAL 6. PERFORMING O4G. REPORT NUMBER TR-1180...Continue on reverse side it neceesary aid Identlfy by block number) Image processing; image analysis ; parallel processing; cellular computers. 20... IMAGE ANALYSIS TECHNICAL 6. PERFORMING ONG. REPORT NUMBER TR-1180 - 7. AUTHOR(&) S. CONTRACT OR GRANT NUMBER(s) Azriel Rosenfeld AFOSR-77-3271 9
Efficient parallel resolution of the simplified transport equations in mixed-dual formulation

NASA Astrophysics Data System (ADS)

Barrault, M.; Lathuilière, B.; Ramet, P.; Roman, J.

2011-03-01

A reactivity computation consists of computing the highest eigenvalue of a generalized eigenvalue problem, for which an inverse power algorithm is commonly used. Very fine modelizations are difficult to treat for our sequential solver, based on the simplified transport equations, in terms of memory consumption and computational time. A first implementation of a Lagrangian based domain decomposition method brings to a poor parallel efficiency because of an increase in the power iterations [1]. In order to obtain a high parallel efficiency, we improve the parallelization scheme by changing the location of the loop over the subdomains in the overall algorithm and by benefiting from the characteristics of the Raviart-Thomas finite element. The new parallel algorithm still allows us to locally adapt the numerical scheme (mesh, finite element order). However, it can be significantly optimized for the matching grid case. The good behavior of the new parallelization scheme is demonstrated for the matching grid case on several hundreds of nodes for computations based on a pin-by-pin discretization.
Scalable Domain Decomposed Monte Carlo Particle Transport

NASA Astrophysics Data System (ADS)

O'Brien, Matthew Joseph

In this dissertation, we present the parallel algorithms necessary to run domain decomposed Monte Carlo particle transport on large numbers of processors (millions of processors). Previous algorithms were not scalable, and the parallel overhead became more computationally costly than the numerical simulation. The main algorithms we consider are: • Domain decomposition of constructive solid geometry: enables extremely large calculations in which the background geometry is too large to fit in the memory of a single computational node. • Load Balancing: keeps the workload per processor as even as possible so the calculation runs efficiently. • Global Particle Find: if particles are on the wrong processor, globally resolve their locations to the correct processor based on particle coordinate and background domain. • Visualizing constructive solid geometry, sourcing particles, deciding that particle streaming communication is completed and spatial redecomposition. These algorithms are some of the most important parallel algorithms required for domain decomposed Monte Carlo particle transport. We demonstrate that our previous algorithms were not scalable, prove that our new algorithms are scalable, and run some of the algorithms up to 2 million MPI processes on the Sequoia supercomputer.
Parallel spatial direct numerical simulations on the Intel iPSC/860 hypercube

NASA Technical Reports Server (NTRS)

Joslin, Ronald D.; Zubair, Mohammad

1993-01-01

The implementation and performance of a parallel spatial direct numerical simulation (PSDNS) approach on the Intel iPSC/860 hypercube is documented. The direct numerical simulation approach is used to compute spatially evolving disturbances associated with the laminar-to-turbulent transition in boundary-layer flows. The feasibility of using the PSDNS on the hypercube to perform transition studies is examined. The results indicate that the direct numerical simulation approach can effectively be parallelized on a distributed-memory parallel machine. By increasing the number of processors nearly ideal linear speedups are achieved with nonoptimized routines; slower than linear speedups are achieved with optimized (machine dependent library) routines. This slower than linear speedup results because the Fast Fourier Transform (FFT) routine dominates the computational cost and because the routine indicates less than ideal speedups. However with the machine-dependent routines the total computational cost decreases by a factor of 4 to 5 compared with standard FORTRAN routines. The computational cost increases linearly with spanwise wall-normal and streamwise grid refinements. The hypercube with 32 processors was estimated to require approximately twice the amount of Cray supercomputer single processor time to complete a comparable simulation; however it is estimated that a subgrid-scale model which reduces the required number of grid points and becomes a large-eddy simulation (PSLES) would reduce the computational cost and memory requirements by a factor of 10 over the PSDNS. This PSLES implementation would enable transition simulations on the hypercube at a reasonable computational cost.
Field Programmable Gate Array Based Parallel Strapdown Algorithm Design for Strapdown Inertial Navigation Systems

PubMed Central

Li, Zong-Tao; Wu, Tie-Jun; Lin, Can-Long; Ma, Long-Hua

2011-01-01

A new generalized optimum strapdown algorithm with coning and sculling compensation is presented, in which the position, velocity and attitude updating operations are carried out based on the single-speed structure in which all computations are executed at a single updating rate that is sufficiently high to accurately account for high frequency angular rate and acceleration rectification effects. Different from existing algorithms, the updating rates of the coning and sculling compensations are unrelated with the number of the gyro incremental angle samples and the number of the accelerometer incremental velocity samples. When the output sampling rate of inertial sensors remains constant, this algorithm allows increasing the updating rate of the coning and sculling compensation, yet with more numbers of gyro incremental angle and accelerometer incremental velocity in order to improve the accuracy of system. Then, in order to implement the new strapdown algorithm in a single FPGA chip, the parallelization of the algorithm is designed and its computational complexity is analyzed. The performance of the proposed parallel strapdown algorithm is tested on the Xilinx ISE 12.3 software platform and the FPGA device XC6VLX550T hardware platform on the basis of some fighter data. It is shown that this parallel strapdown algorithm on the FPGA platform can greatly decrease the execution time of algorithm to meet the real-time and high precision requirements of system on the high dynamic environment, relative to the existing implemented on the DSP platform. PMID:22164058
Scalable Domain Decomposed Monte Carlo Particle Transport

DOE Office of Scientific and Technical Information (OSTI.GOV)

O'Brien, Matthew Joseph

2013-12-05

In this dissertation, we present the parallel algorithms necessary to run domain decomposed Monte Carlo particle transport on large numbers of processors (millions of processors). Previous algorithms were not scalable, and the parallel overhead became more computationally costly than the numerical simulation.
Multi-GPU parallel algorithm design and analysis for improved inversion of probability tomography with gravity gradiometry data

NASA Astrophysics Data System (ADS)

Hou, Zhenlong; Huang, Danian

2017-09-01

In this paper, we make a study on the inversion of probability tomography (IPT) with gravity gradiometry data at first. The space resolution of the results is improved by multi-tensor joint inversion, depth weighting matrix and the other methods. Aiming at solving the problems brought by the big data in the exploration, we present the parallel algorithm and the performance analysis combining Compute Unified Device Architecture (CUDA) with Open Multi-Processing (OpenMP) based on Graphics Processing Unit (GPU) accelerating. In the test of the synthetic model and real data from Vinton Dome, we get the improved results. It is also proved that the improved inversion algorithm is effective and feasible. The performance of parallel algorithm we designed is better than the other ones with CUDA. The maximum speedup could be more than 200. In the performance analysis, multi-GPU speedup and multi-GPU efficiency are applied to analyze the scalability of the multi-GPU programs. The designed parallel algorithm is demonstrated to be able to process larger scale of data and the new analysis method is practical.
Algorithms and programming tools for image processing on the MPP, part 2

NASA Technical Reports Server (NTRS)

Reeves, Anthony P.

1986-01-01

A number of algorithms were developed for image warping and pyramid image filtering. Techniques were investigated for the parallel processing of a large number of independent irregular shaped regions on the MPP. In addition some utilities for dealing with very long vectors and for sorting were developed. Documentation pages for the algorithms which are available for distribution are given. The performance of the MPP for a number of basic data manipulations was determined. From these results it is possible to predict the efficiency of the MPP for a number of algorithms and applications. The Parallel Pascal development system, which is a portable programming environment for the MPP, was improved and better documentation including a tutorial was written. This environment allows programs for the MPP to be developed on any conventional computer system; it consists of a set of system programs and a library of general purpose Parallel Pascal functions. The algorithms were tested on the MPP and a presentation on the development system was made to the MPP users group. The UNIX version of the Parallel Pascal System was distributed to a number of new sites.
Research of the effectiveness of parallel multithreaded realizations of interpolation methods for scaling raster images

NASA Astrophysics Data System (ADS)

Vnukov, A. A.; Shershnev, M. B.

2018-01-01

The aim of this work is the software implementation of three image scaling algorithms using parallel computations, as well as the development of an application with a graphical user interface for the Windows operating system to demonstrate the operation of algorithms and to study the relationship between system performance, algorithm execution time and the degree of parallelization of computations. Three methods of interpolation were studied, formalized and adapted to scale images. The result of the work is a program for scaling images by different methods. Comparison of the quality of scaling by different methods is given.
On the impact of communication complexity on the design of parallel numerical algorithms

NASA Technical Reports Server (NTRS)

Gannon, D. B.; Van Rosendale, J.

1984-01-01

This paper describes two models of the cost of data movement in parallel numerical alorithms. One model is a generalization of an approach due to Hockney, and is suitable for shared memory multiprocessors where each processor has vector capabilities. The other model is applicable to highly parallel nonshared memory MIMD systems. In this second model, algorithm performance is characterized in terms of the communication network design. Techniques used in VLSI complexity theory are also brought in, and algorithm-independent upper bounds on system performance are derived for several problems that are important to scientific computation.
Eigensolution of finite element problems in a completely connected parallel architecture

NASA Technical Reports Server (NTRS)

Akl, Fred A.; Morel, Michael R.

1989-01-01

A parallel algorithm for the solution of the generalized eigenproblem in linear elastic finite element analysis, (K)(phi)=(M)(phi)(omega), where (K) and (M) are of order N, and (omega) is of order q is presented. The parallel algorithm is based on a completely connected parallel architecture in which each processor is allowed to communicate with all other processors. The algorithm has been successfully implemented on a tightly coupled multiple-instruction-multiple-data (MIMD) parallel processing computer, Cray X-MP. A finite element model is divided into m domains each of which is assumed to process n elements. Each domain is then assigned to a processor, or to a logical processor (task) if the number of domains exceeds the number of physical processors. The macro-tasking library routines are used in mapping each domain to a user task. Computational speed-up and efficiency are used to determine the effectiveness of the algorithm. The effect of the number of domains, the number of degrees-of-freedom located along the global fronts and the dimension of the subspace on the performance of the algorithm are investigated. For a 64-element rectangular plate, speed-ups of 1.86, 3.13, 3.18 and 3.61 are achieved on two, four, six and eight processors, respectively.
A review on quantum search algorithms

NASA Astrophysics Data System (ADS)

Giri, Pulak Ranjan; Korepin, Vladimir E.

2017-12-01

The use of superposition of states in quantum computation, known as quantum parallelism, has significant advantage in terms of speed over the classical computation. It is evident from the early invented quantum algorithms such as Deutsch's algorithm, Deutsch-Jozsa algorithm and its variation as Bernstein-Vazirani algorithm, Simon algorithm, Shor's algorithms, etc. Quantum parallelism also significantly speeds up the database search algorithm, which is important in computer science because it comes as a subroutine in many important algorithms. Quantum database search of Grover achieves the task of finding the target element in an unsorted database in a time quadratically faster than the classical computer. We review Grover's quantum search algorithms for a singe and multiple target elements in a database. The partial search algorithm of Grover and Radhakrishnan and its optimization by Korepin called GRK algorithm are also discussed.
Data parallel sorting for particle simulation

NASA Technical Reports Server (NTRS)

Dagum, Leonardo

1992-01-01

Sorting on a parallel architecture is a communications intensive event which can incur a high penalty in applications where it is required. In the case of particle simulation, only integer sorting is necessary, and sequential implementations easily attain the minimum performance bound of O (N) for N particles. Parallel implementations, however, have to cope with the parallel sorting problem which, in addition to incurring a heavy communications cost, can make the minimun performance bound difficult to attain. This paper demonstrates how the sorting problem in a particle simulation can be reduced to a merging problem, and describes an efficient data parallel algorithm to solve this merging problem in a particle simulation. The new algorithm is shown to be optimal under conditions usual for particle simulation, and its fieldwise implementation on the Connection Machine is analyzed in detail. The new algorithm is about four times faster than a fieldwise implementation of radix sort on the Connection Machine.
Message-passing-interface-based parallel FDTD investigation on the EM scattering from a 1-D rough sea surface using uniaxial perfectly matched layer absorbing boundary.

PubMed

Li, J; Guo, L-X; Zeng, H; Han, X-B

2009-06-01

A message-passing-interface (MPI)-based parallel finite-difference time-domain (FDTD) algorithm for the electromagnetic scattering from a 1-D randomly rough sea surface is presented. The uniaxial perfectly matched layer (UPML) medium is adopted for truncation of FDTD lattices, in which the finite-difference equations can be used for the total computation domain by properly choosing the uniaxial parameters. This makes the parallel FDTD algorithm easier to implement. The parallel performance with different processors is illustrated for one sea surface realization, and the computation time of the parallel FDTD algorithm is dramatically reduced compared to a single-process implementation. Finally, some numerical results are shown, including the backscattering characteristics of sea surface for different polarization and the bistatic scattering from a sea surface with large incident angle and large wind speed.
A Parallel Numerical Algorithm To Solve Linear Systems Of Equations Emerging From 3D Radiative Transfer

NASA Astrophysics Data System (ADS)

Wichert, Viktoria; Arkenberg, Mario; Hauschildt, Peter H.

2016-10-01

Highly resolved state-of-the-art 3D atmosphere simulations will remain computationally extremely expensive for years to come. In addition to the need for more computing power, rethinking coding practices is necessary. We take a dual approach by introducing especially adapted, parallel numerical methods and correspondingly parallelizing critical code passages. In the following, we present our respective work on PHOENIX/3D. With new parallel numerical algorithms, there is a big opportunity for improvement when iteratively solving the system of equations emerging from the operator splitting of the radiative transfer equation J = ΛS. The narrow-banded approximate Λ-operator Λ* , which is used in PHOENIX/3D, occurs in each iteration step. By implementing a numerical algorithm which takes advantage of its characteristic traits, the parallel code's efficiency is further increased and a speed-up in computational time can be achieved.

Treatment Development and Feasibility Study of Family-Focused Treatment for Adolescents with Bipolar Disorder and Comorbid Substance Use Disorders

PubMed Central

Goldstein, Benjamin I.; Goldstein, Tina R.; Collinger, Katelyn A.; Axelson, David A.; Bukstein, Oscar G.; Birmaher, Boris; Miklowitz, David J.

2014-01-01

Background Comorbid substance use disorders (SUD) are associated with increased illness severity and functional impairment among adolescents with bipolar disorder (BD). Previous psychosocial treatment studies have excluded adolescents with both BD and SUD. Studies suggest that integrated interventions are optimal for adults with BD and SUD. Methods We modified family-focused treatment for adolescents with BD (FFT-A) in order to explicitly target comorbid SUD (FFT-SUD). Ten adolescents with BD who had both SUD and an exacerbation of manic, depressed, or mixed symptoms within the last 3 months were enrolled. FFT-SUD was offered as an adjunct to pharmacotherapy, with a target of 21 sessions over 12 months of treatment. The FFT-SUD manual was iteratively modified to integrate a concurrent focus on SUD. Results Six subjects completed a mid-treatment 6-month assessment (after a mean of 16 sessions was completed). Of the 10 subjects, 3 dropped out early ( after ≤ 1 session); in the case of each of these subjects, the participating parent had active SUD. No other subjects in the study had a parent with active SUD. Preliminary findings suggested significant reductions in manic symptoms and depressive symptoms and improved global functioning. Reduction in cannabis use was modest and did not reach significance. Limitations Limitations included a small sample, open treatment, concurrent medications, and no control group. Conclusions These preliminary findings suggest that FFT-SUD is a feasible intervention, particularly for youth without parental SUD. FFT-SUD may be effective in treating mood symptoms, particularly depression, despite modest reductions in substance use. Integrating motivation enhancing strategies may augment the effect of this intervention on substance use. Additional strategies, such as targeting parental substance use, may prevent early attrition. PMID:24847999
Detection of buried targets using a new enhanced very early time electromagnetic (VETEM) prototype system

USGS Publications Warehouse

Cui, T.J.; Chew, W.C.; Aydiner, A.A.; Wright, D.L.; Smith, D.V.

2001-01-01

In this paper, numerical simulations of a new enhanced very early time electromagnetic (VETEM) prototype system are presented, where a horizontal transmitting loop and two horizontal receiving loops are used to detect buried targets, in which three loops share the same axis and the transmitter is located at the center of receivers. In the new VETEM system, the difference of signals from two receivers is taken to eliminate strong direct-signals from the transmitter and background clutter and furthermore to obtain a better SNR for buried targets. Because strong coupling exists between the transmitter and receivers, accurate analysis of the three-loop antenna system is required, for which a loop-tree basis function method has been utilized to overcome the low-frequency breakdown problem. In the analysis of scattering problem from buried targets, a conjugate gradient (CG) method with fast Fourier transform (FFT) is applied to solve the electric field integral equation. However, the convergence of such CG-FFT algorithm is extremely slow at very low frequencies. In order to increase the convergence rate, a frequency-hopping approach has been used. Finally, the primary, coupling, reflected, and scattered magnetic fields are evaluated at receiving loops to calculate the output electric current. Numerous simulation results are given to interpret the new VETEM system. Comparing with other single-transmitter-receiver systems, the new VETEM has better SNR and ability to reduce the clutter.
Performance analysis of a finite radon transform in OFDM system under different channel models

DOE Office of Scientific and Technical Information (OSTI.GOV)

Dawood, Sameer A.; Anuar, M. S.; Fayadh, Rashid A.

In this paper, a class of discrete Radon transforms namely Finite Radon Transform (FRAT) was proposed as a modulation technique in the realization of Orthogonal Frequency Division Multiplexing (OFDM). The proposed FRAT operates as a data mapper in the OFDM transceiver instead of the conventional phase shift mapping and quadrature amplitude mapping that are usually used with the standard OFDM based on Fast Fourier Transform (FFT), by the way that ensure increasing the orthogonality of the system. The Fourier domain approach was found here to be the more suitable way for obtaining the forward and inverse FRAT. This structure resultedmore » in a more suitable realization of conventional FFT- OFDM. It was shown that this application increases the orthogonality significantly in this case due to the use of Inverse Fast Fourier Transform (IFFT) twice, namely, in the data mapping and in the sub-carrier modulation also due to the use of an efficient algorithm in determining the FRAT coefficients called the optimal ordering method. The proposed approach was tested and compared with conventional OFDM, for additive white Gaussian noise (AWGN) channel, flat fading channel, and multi-path frequency selective fading channel. The obtained results showed that the proposed system has improved the bit error rate (BER) performance by reducing inter-symbol interference (ISI) and inter-carrier interference (ICI), comparing with conventional OFDM system.« less
Performance analysis of a finite radon transform in OFDM system under different channel models

NASA Astrophysics Data System (ADS)

Dawood, Sameer A.; Malek, F.; Anuar, M. S.; Fayadh, Rashid A.; Abdullah, Farrah Salwani

2015-05-01

In this paper, a class of discrete Radon transforms namely Finite Radon Transform (FRAT) was proposed as a modulation technique in the realization of Orthogonal Frequency Division Multiplexing (OFDM). The proposed FRAT operates as a data mapper in the OFDM transceiver instead of the conventional phase shift mapping and quadrature amplitude mapping that are usually used with the standard OFDM based on Fast Fourier Transform (FFT), by the way that ensure increasing the orthogonality of the system. The Fourier domain approach was found here to be the more suitable way for obtaining the forward and inverse FRAT. This structure resulted in a more suitable realization of conventional FFT- OFDM. It was shown that this application increases the orthogonality significantly in this case due to the use of Inverse Fast Fourier Transform (IFFT) twice, namely, in the data mapping and in the sub-carrier modulation also due to the use of an efficient algorithm in determining the FRAT coefficients called the optimal ordering method. The proposed approach was tested and compared with conventional OFDM, for additive white Gaussian noise (AWGN) channel, flat fading channel, and multi-path frequency selective fading channel. The obtained results showed that the proposed system has improved the bit error rate (BER) performance by reducing inter-symbol interference (ISI) and inter-carrier interference (ICI), comparing with conventional OFDM system.
Reducing acquisition times in multidimensional NMR with a time-optimized Fourier encoding algorithm

DOE Office of Scientific and Technical Information (OSTI.GOV)

Zhang, Zhiyong; Department of Electronic Science, Fujian Provincial Key Laboratory of Plasma and Magnetic Resonance, Xiamen University, Xiamen, Fujian 361005; Smith, Pieter E. S.

Speeding up the acquisition of multidimensional nuclear magnetic resonance (NMR) spectra is an important topic in contemporary NMR, with central roles in high-throughput investigations and analyses of marginally stable samples. A variety of fast NMR techniques have been developed, including methods based on non-uniform sampling and Hadamard encoding, that overcome the long sampling times inherent to schemes based on fast-Fourier-transform (FFT) methods. Here, we explore the potential of an alternative fast acquisition method that leverages a priori knowledge, to tailor polychromatic pulses and customized time delays for an efficient Fourier encoding of the indirect domain of an NMR experiment. Bymore » porting the encoding of the indirect-domain to the excitation process, this strategy avoids potential artifacts associated with non-uniform sampling schemes and uses a minimum number of scans equal to the number of resonances present in the indirect dimension. An added convenience is afforded by the fact that a usual 2D FFT can be used to process the generated data. Acquisitions of 2D heteronuclear correlation NMR spectra on quinine and on the anti-inflammatory drug isobutyl propionic phenolic acid illustrate the new method's performance. This method can be readily automated to deal with complex samples such as those occurring in metabolomics, in in-cell as well as in in vivo NMR applications, where speed and temporal stability are often primary concerns.« less
Partitioning and packing mathematical simulation models for calculation on parallel computers

NASA Technical Reports Server (NTRS)

Arpasi, D. J.; Milner, E. J.

1986-01-01

The development of multiprocessor simulations from a serial set of ordinary differential equations describing a physical system is described. Degrees of parallelism (i.e., coupling between the equations) and their impact on parallel processing are discussed. The problem of identifying computational parallelism within sets of closely coupled equations that require the exchange of current values of variables is described. A technique is presented for identifying this parallelism and for partitioning the equations for parallel solution on a multiprocessor. An algorithm which packs the equations into a minimum number of processors is also described. The results of the packing algorithm when applied to a turbojet engine model are presented in terms of processor utilization.
Parallel integer sorting with medium and fine-scale parallelism

NASA Technical Reports Server (NTRS)

Dagum, Leonardo

1993-01-01

Two new parallel integer sorting algorithms, queue-sort and barrel-sort, are presented and analyzed in detail. These algorithms do not have optimal parallel complexity, yet they show very good performance in practice. Queue-sort designed for fine-scale parallel architectures which allow the queueing of multiple messages to the same destination. Barrel-sort is designed for medium-scale parallel architectures with a high message passing overhead. The performance results from the implementation of queue-sort on a Connection Machine CM-2 and barrel-sort on a 128 processor iPSC/860 are given. The two implementations are found to be comparable in performance but not as good as a fully vectorized bucket sort on the Cray YMP.
Extending molecular simulation time scales: Parallel in time integrations for high-level quantum chemistry and complex force representations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bylaska, Eric J.; Weare, Jonathan Q.; Weare, John H.

2013-08-21

Parallel in time simulation algorithms are presented and applied to conventional molecular dynamics (MD) and ab initio molecular dynamics (AIMD) models of realistic complexity. Assuming that a forward time integrator, f , (e.g. Verlet algorithm) is available to propagate the system from time ti (trajectory positions and velocities xi = (ri; vi)) to time ti+1 (xi+1) by xi+1 = fi(xi), the dynamics problem spanning an interval from t0 : : : tM can be transformed into a root finding problem, F(X) = [xi - f (x(i-1)]i=1;M = 0, for the trajectory variables. The root finding problem is solved using amore » variety of optimization techniques, including quasi-Newton and preconditioned quasi-Newton optimization schemes that are all unconditionally convergent. The algorithms are parallelized by assigning a processor to each time-step entry in the columns of F(X). The relation of this approach to other recently proposed parallel in time methods is discussed and the effectiveness of various approaches to solving the root finding problem are tested. We demonstrate that more efficient dynamical models based on simplified interactions or coarsening time-steps provide preconditioners for the root finding problem. However, for MD and AIMD simulations such preconditioners are not required to obtain reasonable convergence and their cost must be considered in the performance of the algorithm. The parallel in time algorithms developed are tested by applying them to MD and AIMD simulations of size and complexity similar to those encountered in present day applications. These include a 1000 Si atom MD simulation using Stillinger-Weber potentials, and a HCl+4H2O AIMD simulation at the MP2 level. The maximum speedup obtained by parallelizing the Stillinger-Weber MD simulation was nearly 3.0. For the AIMD MP2 simulations the algorithms achieved speedups of up to 14.3. The parallel in time algorithms can be implemented in a distributed computing environment using very slow TCP/IP networks. Scripts written in Python that make calls to a precompiled quantum chemistry package (NWChem) are demonstrated to provide an actual speedup of 8.2 for a 2.5 ps AIMD simulation of HCl+4H2O at the MP2/6-31G* level. Implemented in this way these algorithms can be used for long time high-level AIMD simulations at a modest cost using machines connected by very slow networks such as WiFi, or in different time zones connected by the Internet. The algorithms can also be used with programs that are already parallel. By using these algorithms we are able to reduce the cost of a MP2/6-311++G(2d,2p) simulation that had reached its maximum possible speedup in the parallelization of the electronic structure calculation from 32 seconds per time step to 6.9 seconds per time step.« less
Extending molecular simulation time scales: Parallel in time integrations for high-level quantum chemistry and complex force representations.

PubMed

Bylaska, Eric J; Weare, Jonathan Q; Weare, John H

2013-08-21

Parallel in time simulation algorithms are presented and applied to conventional molecular dynamics (MD) and ab initio molecular dynamics (AIMD) models of realistic complexity. Assuming that a forward time integrator, f (e.g., Verlet algorithm), is available to propagate the system from time ti (trajectory positions and velocities xi = (ri, vi)) to time ti + 1 (xi + 1) by xi + 1 = fi(xi), the dynamics problem spanning an interval from t0[ellipsis (horizontal)]tM can be transformed into a root finding problem, F(X) = [xi - f(x(i - 1)]i = 1, M = 0, for the trajectory variables. The root finding problem is solved using a variety of root finding techniques, including quasi-Newton and preconditioned quasi-Newton schemes that are all unconditionally convergent. The algorithms are parallelized by assigning a processor to each time-step entry in the columns of F(X). The relation of this approach to other recently proposed parallel in time methods is discussed, and the effectiveness of various approaches to solving the root finding problem is tested. We demonstrate that more efficient dynamical models based on simplified interactions or coarsening time-steps provide preconditioners for the root finding problem. However, for MD and AIMD simulations, such preconditioners are not required to obtain reasonable convergence and their cost must be considered in the performance of the algorithm. The parallel in time algorithms developed are tested by applying them to MD and AIMD simulations of size and complexity similar to those encountered in present day applications. These include a 1000 Si atom MD simulation using Stillinger-Weber potentials, and a HCl + 4H2O AIMD simulation at the MP2 level. The maximum speedup (serial execution/timeparallel execution time) obtained by parallelizing the Stillinger-Weber MD simulation was nearly 3.0. For the AIMD MP2 simulations, the algorithms achieved speedups of up to 14.3. The parallel in time algorithms can be implemented in a distributed computing environment using very slow transmission control protocol/Internet protocol networks. Scripts written in Python that make calls to a precompiled quantum chemistry package (NWChem) are demonstrated to provide an actual speedup of 8.2 for a 2.5 ps AIMD simulation of HCl + 4H2O at the MP2/6-31G* level. Implemented in this way these algorithms can be used for long time high-level AIMD simulations at a modest cost using machines connected by very slow networks such as WiFi, or in different time zones connected by the Internet. The algorithms can also be used with programs that are already parallel. Using these algorithms, we are able to reduce the cost of a MP2/6-311++G(2d,2p) simulation that had reached its maximum possible speedup in the parallelization of the electronic structure calculation from 32 s/time step to 6.9 s/time step.
Extending molecular simulation time scales: Parallel in time integrations for high-level quantum chemistry and complex force representations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bylaska, Eric J., E-mail: Eric.Bylaska@pnnl.gov; Weare, Jonathan Q., E-mail: weare@uchicago.edu; Weare, John H., E-mail: jweare@ucsd.edu

2013-08-21

Parallel in time simulation algorithms are presented and applied to conventional molecular dynamics (MD) and ab initio molecular dynamics (AIMD) models of realistic complexity. Assuming that a forward time integrator, f (e.g., Verlet algorithm), is available to propagate the system from time t{sub i} (trajectory positions and velocities x{sub i} = (r{sub i}, v{sub i})) to time t{sub i+1} (x{sub i+1}) by x{sub i+1} = f{sub i}(x{sub i}), the dynamics problem spanning an interval from t{sub 0}…t{sub M} can be transformed into a root finding problem, F(X) = [x{sub i} − f(x{sub (i−1})]{sub i} {sub =1,M} = 0, for themore » trajectory variables. The root finding problem is solved using a variety of root finding techniques, including quasi-Newton and preconditioned quasi-Newton schemes that are all unconditionally convergent. The algorithms are parallelized by assigning a processor to each time-step entry in the columns of F(X). The relation of this approach to other recently proposed parallel in time methods is discussed, and the effectiveness of various approaches to solving the root finding problem is tested. We demonstrate that more efficient dynamical models based on simplified interactions or coarsening time-steps provide preconditioners for the root finding problem. However, for MD and AIMD simulations, such preconditioners are not required to obtain reasonable convergence and their cost must be considered in the performance of the algorithm. The parallel in time algorithms developed are tested by applying them to MD and AIMD simulations of size and complexity similar to those encountered in present day applications. These include a 1000 Si atom MD simulation using Stillinger-Weber potentials, and a HCl + 4H{sub 2}O AIMD simulation at the MP2 level. The maximum speedup ((serial execution time)/(parallel execution time) ) obtained by parallelizing the Stillinger-Weber MD simulation was nearly 3.0. For the AIMD MP2 simulations, the algorithms achieved speedups of up to 14.3. The parallel in time algorithms can be implemented in a distributed computing environment using very slow transmission control protocol/Internet protocol networks. Scripts written in Python that make calls to a precompiled quantum chemistry package (NWChem) are demonstrated to provide an actual speedup of 8.2 for a 2.5 ps AIMD simulation of HCl + 4H{sub 2}O at the MP2/6-31G* level. Implemented in this way these algorithms can be used for long time high-level AIMD simulations at a modest cost using machines connected by very slow networks such as WiFi, or in different time zones connected by the Internet. The algorithms can also be used with programs that are already parallel. Using these algorithms, we are able to reduce the cost of a MP2/6-311++G(2d,2p) simulation that had reached its maximum possible speedup in the parallelization of the electronic structure calculation from 32 s/time step to 6.9 s/time step.« less
Parallel Processing of Broad-Band PPM Signals

NASA Technical Reports Server (NTRS)

Gray, Andrew; Kang, Edward; Lay, Norman; Vilnrotter, Victor; Srinivasan, Meera; Lee, Clement

2010-01-01

A parallel-processing algorithm and a hardware architecture to implement the algorithm have been devised for timeslot synchronization in the reception of pulse-position-modulated (PPM) optical or radio signals. As in the cases of some prior algorithms and architectures for parallel, discrete-time, digital processing of signals other than PPM, an incoming broadband signal is divided into multiple parallel narrower-band signals by means of sub-sampling and filtering. The number of parallel streams is chosen so that the frequency content of the narrower-band signals is low enough to enable processing by relatively-low speed complementary metal oxide semiconductor (CMOS) electronic circuitry. The algorithm and architecture are intended to satisfy requirements for time-varying time-slot synchronization and post-detection filtering, with correction of timing errors independent of estimation of timing errors. They are also intended to afford flexibility for dynamic reconfiguration and upgrading. The architecture is implemented in a reconfigurable CMOS processor in the form of a field-programmable gate array. The algorithm and its hardware implementation incorporate three separate time-varying filter banks for three distinct functions: correction of sub-sample timing errors, post-detection filtering, and post-detection estimation of timing errors. The design of the filter bank for correction of timing errors, the method of estimating timing errors, and the design of a feedback-loop filter are governed by a host of parameters, the most critical one, with regard to processing very broadband signals with CMOS hardware, being the number of parallel streams (equivalently, the rate-reduction parameter).
Arrhythmia Evaluation in Wearable ECG Devices

PubMed Central

Sadrawi, Muammar; Lin, Chien-Hung; Hsieh, Yita; Kuo, Chia-Chun; Chien, Jen Chien; Haraikawa, Koichi; Abbod, Maysam F.; Shieh, Jiann-Shing

2017-01-01

This study evaluates four databases from PhysioNet: The American Heart Association database (AHADB), Creighton University Ventricular Tachyarrhythmia database (CUDB), MIT-BIH Arrhythmia database (MITDB), and MIT-BIH Noise Stress Test database (NSTDB). The ANSI/AAMI EC57:2012 is used for the evaluation of the algorithms for the supraventricular ectopic beat (SVEB), ventricular ectopic beat (VEB), atrial fibrillation (AF), and ventricular fibrillation (VF) via the evaluation of the sensitivity, positive predictivity and false positive rate. Sample entropy, fast Fourier transform (FFT), and multilayer perceptron neural network with backpropagation training algorithm are selected for the integrated detection algorithms. For this study, the result for SVEB has some improvements compared to a previous study that also utilized ANSI/AAMI EC57. In further, VEB sensitivity and positive predictivity gross evaluations have greater than 80%, except for the positive predictivity of the NSTDB database. For AF gross evaluation of MITDB database, the results show very good classification, excluding the episode sensitivity. In advanced, for VF gross evaluation, the episode sensitivity and positive predictivity for the AHADB, MITDB, and CUDB, have greater than 80%, except for MITDB episode positive predictivity, which is 75%. The achieved results show that the proposed integrated SVEB, VEB, AF, and VF detection algorithm has an accurate classification according to ANSI/AAMI EC57:2012. In conclusion, the proposed integrated detection algorithm can achieve good accuracy in comparison with other previous studies. Furthermore, more advanced algorithms and hardware devices should be performed in future for arrhythmia detection and evaluation. PMID:29068369
Acoustic simulation in architecture with parallel algorithm

NASA Astrophysics Data System (ADS)

Li, Xiaohong; Zhang, Xinrong; Li, Dan

2004-03-01

In allusion to complexity of architecture environment and Real-time simulation of architecture acoustics, a parallel radiosity algorithm was developed. The distribution of sound energy in scene is solved with this method. And then the impulse response between sources and receivers at frequency segment, which are calculated with multi-process, are combined into whole frequency response. The numerical experiment shows that parallel arithmetic can improve the acoustic simulating efficiency of complex scene.
Parallel eigenanalysis of finite element models in a completely connected architecture

NASA Technical Reports Server (NTRS)

Akl, F. A.; Morel, M. R.

1989-01-01

A parallel algorithm is presented for the solution of the generalized eigenproblem in linear elastic finite element analysis, (K)(phi) = (M)(phi)(omega), where (K) and (M) are of order N, and (omega) is order of q. The concurrent solution of the eigenproblem is based on the multifrontal/modified subspace method and is achieved in a completely connected parallel architecture in which each processor is allowed to communicate with all other processors. The algorithm was successfully implemented on a tightly coupled multiple-instruction multiple-data parallel processing machine, Cray X-MP. A finite element model is divided into m domains each of which is assumed to process n elements. Each domain is then assigned to a processor or to a logical processor (task) if the number of domains exceeds the number of physical processors. The macrotasking library routines are used in mapping each domain to a user task. Computational speed-up and efficiency are used to determine the effectiveness of the algorithm. The effect of the number of domains, the number of degrees-of-freedom located along the global fronts and the dimension of the subspace on the performance of the algorithm are investigated. A parallel finite element dynamic analysis program, p-feda, is documented and the performance of its subroutines in parallel environment is analyzed.
Highly Parallel Alternating Directions Algorithm for Time Dependent Problems

NASA Astrophysics Data System (ADS)

Ganzha, M.; Georgiev, K.; Lirkov, I.; Margenov, S.; Paprzycki, M.

2011-11-01

In our work, we consider the time dependent Stokes equation on a finite time interval and on a uniform rectangular mesh, written in terms of velocity and pressure. For this problem, a parallel algorithm based on a novel direction splitting approach is developed. Here, the pressure equation is derived from a perturbed form of the continuity equation, in which the incompressibility constraint is penalized in a negative norm induced by the direction splitting. The scheme used in the algorithm is composed of two parts: (i) velocity prediction, and (ii) pressure correction. This is a Crank-Nicolson-type two-stage time integration scheme for two and three dimensional parabolic problems in which the second-order derivative, with respect to each space variable, is treated implicitly while the other variable is made explicit at each time sub-step. In order to achieve a good parallel performance the solution of the Poison problem for the pressure correction is replaced by solving a sequence of one-dimensional second order elliptic boundary value problems in each spatial direction. The parallel code is implemented using the standard MPI functions and tested on two modern parallel computer systems. The performed numerical tests demonstrate good level of parallel efficiency and scalability of the studied direction-splitting-based algorithm.
Application of hybrid clustering using parallel k-means algorithm and DIANA algorithm

NASA Astrophysics Data System (ADS)

Umam, Khoirul; Bustamam, Alhadi; Lestari, Dian

2017-03-01

DNA is one of the carrier of genetic information of living organisms. Encoding, sequencing, and clustering DNA sequences has become the key jobs and routine in the world of molecular biology, in particular on bioinformatics application. There are two type of clustering, hierarchical clustering and partitioning clustering. In this paper, we combined two type clustering i.e. K-Means (partitioning clustering) and DIANA (hierarchical clustering), therefore it called Hybrid clustering. Application of hybrid clustering using Parallel K-Means algorithm and DIANA algorithm used to clustering DNA sequences of Human Papillomavirus (HPV). The clustering process is started with Collecting DNA sequences of HPV are obtained from NCBI (National Centre for Biotechnology Information), then performing characteristics extraction of DNA sequences. The characteristics extraction result is store in a matrix form, then normalize this matrix using Min-Max normalization and calculate genetic distance using Euclidian Distance. Furthermore, the hybrid clustering is applied by using implementation of Parallel K-Means algorithm and DIANA algorithm. The aim of using Hybrid Clustering is to obtain better clusters result. For validating the resulted clusters, to get optimum number of clusters, we use Davies-Bouldin Index (DBI). In this study, the result of implementation of Parallel K-Means clustering is data clustered become 5 clusters with minimal IDB value is 0.8741, and Hybrid Clustering clustered data become 13 sub-clusters with minimal IDB values = 0.8216, 0.6845, 0.3331, 0.1994 and 0.3952. The IDB value of hybrid clustering less than IBD value of Parallel K-Means clustering only that perform at 1ts stage. Its means clustering using Hybrid Clustering have the better result to clustered DNA sequence of HPV than perform parallel K-Means Clustering only.
FPGA implementation of sparse matrix algorithm for information retrieval

NASA Astrophysics Data System (ADS)

Bojanic, Slobodan; Jevtic, Ruzica; Nieto-Taladriz, Octavio

2005-06-01

Information text data retrieval requires a tremendous amount of processing time because of the size of the data and the complexity of information retrieval algorithms. In this paper the solution to this problem is proposed via hardware supported information retrieval algorithms. Reconfigurable computing may adopt frequent hardware modifications through its tailorable hardware and exploits parallelism for a given application through reconfigurable and flexible hardware units. The degree of the parallelism can be tuned for data. In this work we implemented standard BLAS (basic linear algebra subprogram) sparse matrix algorithm named Compressed Sparse Row (CSR) that is showed to be more efficient in terms of storage space requirement and query-processing timing over the other sparse matrix algorithms for information retrieval application. Although inverted index algorithm is treated as the de facto standard for information retrieval for years, an alternative approach to store the index of text collection in a sparse matrix structure gains more attention. This approach performs query processing using sparse matrix-vector multiplication and due to parallelization achieves a substantial efficiency over the sequential inverted index. The parallel implementations of information retrieval kernel are presented in this work targeting the Virtex II Field Programmable Gate Arrays (FPGAs) board from Xilinx. A recent development in scientific applications is the use of FPGA to achieve high performance results. Computational results are compared to implementations on other platforms. The design achieves a high level of parallelism for the overall function while retaining highly optimised hardware within processing unit.
Implementation of a fully-balanced periodic tridiagonal solver on a parallel distributed memory architecture

NASA Technical Reports Server (NTRS)

Eidson, T. M.; Erlebacher, G.

1994-01-01

While parallel computers offer significant computational performance, it is generally necessary to evaluate several programming strategies. Two programming strategies for a fairly common problem - a periodic tridiagonal solver - are developed and evaluated. Simple model calculations as well as timing results are presented to evaluate the various strategies. The particular tridiagonal solver evaluated is used in many computational fluid dynamic simulation codes. The feature that makes this algorithm unique is that these simulation codes usually require simultaneous solutions for multiple right-hand-sides (RHS) of the system of equations. Each RHS solutions is independent and thus can be computed in parallel. Thus a Gaussian elimination type algorithm can be used in a parallel computation and the more complicated approaches such as cyclic reduction are not required. The two strategies are a transpose strategy and a distributed solver strategy. For the transpose strategy, the data is moved so that a subset of all the RHS problems is solved on each of the several processors. This usually requires significant data movement between processor memories across a network. The second strategy attempts to have the algorithm allow the data across processor boundaries in a chained manner. This usually requires significantly less data movement. An approach to accomplish this second strategy in a near-perfect load-balanced manner is developed. In addition, an algorithm will be shown to directly transform a sequential Gaussian elimination type algorithm into the parallel chained, load-balanced algorithm.
Parallel fuzzy connected image segmentation on GPU

PubMed Central

Zhuge, Ying; Cao, Yong; Udupa, Jayaram K.; Miller, Robert W.

2011-01-01

Purpose: Image segmentation techniques using fuzzy connectedness (FC) principles have shown their effectiveness in segmenting a variety of objects in several large applications. However, one challenge in these algorithms has been their excessive computational requirements when processing large image datasets. Nowadays, commodity graphics hardware provides a highly parallel computing environment. In this paper, the authors present a parallel fuzzy connected image segmentation algorithm implementation on NVIDIA’s compute unified device Architecture (cuda) platform for segmenting medical image data sets. Methods: In the FC algorithm, there are two major computational tasks: (i) computing the fuzzy affinity relations and (ii) computing the fuzzy connectedness relations. These two tasks are implemented as cuda kernels and executed on GPU. A dramatic improvement in speed for both tasks is achieved as a result. Results: Our experiments based on three data sets of small, medium, and large data size demonstrate the efficiency of the parallel algorithm, which achieves a speed-up factor of 24.4x, 18.1x, and 10.3x, correspondingly, for the three data sets on the NVIDIA Tesla C1060 over the implementation of the algorithm on CPU, and takes 0.25, 0.72, and 15.04 s, correspondingly, for the three data sets. Conclusions: The authors developed a parallel algorithm of the widely used fuzzy connected image segmentation method on the NVIDIA GPUs, which are far more cost- and speed-effective than both cluster of workstations and multiprocessing systems. A near-interactive speed of segmentation has been achieved, even for the large data set. PMID:21859037
Parallel fuzzy connected image segmentation on GPU.

PubMed

Zhuge, Ying; Cao, Yong; Udupa, Jayaram K; Miller, Robert W

2011-07-01

Image segmentation techniques using fuzzy connectedness (FC) principles have shown their effectiveness in segmenting a variety of objects in several large applications. However, one challenge in these algorithms has been their excessive computational requirements when processing large image datasets. Nowadays, commodity graphics hardware provides a highly parallel computing environment. In this paper, the authors present a parallel fuzzy connected image segmentation algorithm implementation on NVIDIA's compute unified device Architecture (CUDA) platform for segmenting medical image data sets. In the FC algorithm, there are two major computational tasks: (i) computing the fuzzy affinity relations and (ii) computing the fuzzy connectedness relations. These two tasks are implemented as CUDA kernels and executed on GPU. A dramatic improvement in speed for both tasks is achieved as a result. Our experiments based on three data sets of small, medium, and large data size demonstrate the efficiency of the parallel algorithm, which achieves a speed-up factor of 24.4x, 18.1x, and 10.3x, correspondingly, for the three data sets on the NVIDIA Tesla C1060 over the implementation of the algorithm on CPU, and takes 0.25, 0.72, and 15.04 s, correspondingly, for the three data sets. The authors developed a parallel algorithm of the widely used fuzzy connected image segmentation method on the NVIDIA GPUs, which are far more cost- and speed-effective than both cluster of workstations and multiprocessing systems. A near-interactive speed of segmentation has been achieved, even for the large data set.

Some links on this page may take you to non-federal websites. Their policies may differ from this site.