The parallel algorithm for the 2D discrete wavelet transform
NASA Astrophysics Data System (ADS)
Barina, David; Najman, Pavel; Kleparnik, Petr; Kula, Michal; Zemcik, Pavel
2018-04-01
The discrete wavelet transform can be found at the heart of many image-processing algorithms. Until now, the transform on general-purpose processors (CPUs) was mostly computed using a separable lifting scheme. As the lifting scheme consists of a small number of operations, it is preferred for processing using single-core CPUs. However, considering a parallel processing using multi-core processors, this scheme is inappropriate due to a large number of steps. On such architectures, the number of steps corresponds to the number of points that represent the exchange of data. Consequently, these points often form a performance bottleneck. Our approach appropriately rearranges calculations inside the transform, and thereby reduces the number of steps. In other words, we propose a new scheme that is friendly to parallel environments. When evaluating on multi-core CPUs, we consistently overcome the original lifting scheme. The evaluation was performed on 61-core Intel Xeon Phi and 8-core Intel Xeon processors.
SKIRT: Hybrid parallelization of radiative transfer simulations
NASA Astrophysics Data System (ADS)
Verstocken, S.; Van De Putte, D.; Camps, P.; Baes, M.
2017-07-01
We describe the design, implementation and performance of the new hybrid parallelization scheme in our Monte Carlo radiative transfer code SKIRT, which has been used extensively for modelling the continuum radiation of dusty astrophysical systems including late-type galaxies and dusty tori. The hybrid scheme combines distributed memory parallelization, using the standard Message Passing Interface (MPI) to communicate between processes, and shared memory parallelization, providing multiple execution threads within each process to avoid duplication of data structures. The synchronization between multiple threads is accomplished through atomic operations without high-level locking (also called lock-free programming). This improves the scaling behaviour of the code and substantially simplifies the implementation of the hybrid scheme. The result is an extremely flexible solution that adjusts to the number of available nodes, processors and memory, and consequently performs well on a wide variety of computing architectures.
Proposed scheme for parallel 10Gb/s VSR system and its verilog HDL realization
NASA Astrophysics Data System (ADS)
Zhou, Yi; Chen, Hongda; Zuo, Chao; Jia, Jiuchun; Shen, Rongxuan; Chen, Xiongbin
2005-02-01
This paper proposes a novel and innovative scheme for 10Gb/s parallel Very Short Reach (VSR) optical communication system. The optimized scheme properly manages the SDH/SONET redundant bytes and adjusts the position of error detecting bytes and error correction bytes. Compared with the OIF-VSR4-01.0 proposal, the scheme has a coding process module. The SDH/SONET frames in transmission direction are disposed as follows: (1) The Framer-Serdes Interface (FSI) gets 16×622.08Mb/s STM-64 frame. (2) The STM-64 frame is byte-wise stripped across 12 channels, all channels are data channels. During this process, the parity bytes and CRC bytes are generated in the similar way as OIF-VSR4-01.0 and stored in the code process module. (3) The code process module will regularly convey the additional parity bytes and CRC bytes to all 12 data channels. (4) After the 8B/10B coding, the 12 channels is transmitted to the parallel VCSEL array. The receive process approximately in reverse order of transmission process. By applying this scheme to 10Gb/s VSR system, the frame size in VSR system is reduced from 15552×12 bytes to 14040×12 bytes, the system redundancy is reduced obviously.
Parallel discrete-event simulation schemes with heterogeneous processing elements.
Kim, Yup; Kwon, Ikhyun; Chae, Huiseung; Yook, Soon-Hyung
2014-07-01
To understand the effects of nonidentical processing elements (PEs) on parallel discrete-event simulation (PDES) schemes, two stochastic growth models, the restricted solid-on-solid (RSOS) model and the Family model, are investigated by simulations. The RSOS model is the model for the PDES scheme governed by the Kardar-Parisi-Zhang equation (KPZ scheme). The Family model is the model for the scheme governed by the Edwards-Wilkinson equation (EW scheme). Two kinds of distributions for nonidentical PEs are considered. In the first kind computing capacities of PEs are not much different, whereas in the second kind the capacities are extremely widespread. The KPZ scheme on the complex networks shows the synchronizability and scalability regardless of the kinds of PEs. The EW scheme never shows the synchronizability for the random configuration of PEs of the first kind. However, by regularizing the arrangement of PEs of the first kind, the EW scheme is made to show the synchronizability. In contrast, EW scheme never shows the synchronizability for any configuration of PEs of the second kind.
Improved CDMA Performance Using Parallel Interference Cancellation
NASA Technical Reports Server (NTRS)
Simon, Marvin; Divsalar, Dariush
1995-01-01
This report considers a general parallel interference cancellation scheme that significantly reduces the degradation effect of user interference but with a lesser implementation complexity than the maximum-likelihood technique. The scheme operates on the fact that parallel processing simultaneously removes from each user the interference produced by the remaining users accessing the channel in an amount proportional to their reliability. The parallel processing can be done in multiple stages. The proposed scheme uses tentative decision devices with different optimum thresholds at the multiple stages to produce the most reliably received data for generation and cancellation of user interference. The 1-stage interference cancellation is analyzed for three types of tentative decision devices, namely, hard, null zone, and soft decision, and two types of user power distribution, namely, equal and unequal powers. Simulation results are given for a multitude of different situations, in particular, those cases for which the analysis is too complex.
A parallel time integrator for noisy nonlinear oscillatory systems
NASA Astrophysics Data System (ADS)
Subber, Waad; Sarkar, Abhijit
2018-06-01
In this paper, we adapt a parallel time integration scheme to track the trajectories of noisy non-linear dynamical systems. Specifically, we formulate a parallel algorithm to generate the sample path of nonlinear oscillator defined by stochastic differential equations (SDEs) using the so-called parareal method for ordinary differential equations (ODEs). The presence of Wiener process in SDEs causes difficulties in the direct application of any numerical integration techniques of ODEs including the parareal algorithm. The parallel implementation of the algorithm involves two SDEs solvers, namely a fine-level scheme to integrate the system in parallel and a coarse-level scheme to generate and correct the required initial conditions to start the fine-level integrators. For the numerical illustration, a randomly excited Duffing oscillator is investigated in order to study the performance of the stochastic parallel algorithm with respect to a range of system parameters. The distributed implementation of the algorithm exploits Massage Passing Interface (MPI).
NASA Astrophysics Data System (ADS)
Kim, Jae Wook
2013-05-01
This paper proposes a novel systematic approach for the parallelization of pentadiagonal compact finite-difference schemes and filters based on domain decomposition. The proposed approach allows a pentadiagonal banded matrix system to be split into quasi-disjoint subsystems by using a linear-algebraic transformation technique. As a result the inversion of pentadiagonal matrices can be implemented within each subdomain in an independent manner subject to a conventional halo-exchange process. The proposed matrix transformation leads to new subdomain boundary (SB) compact schemes and filters that require three halo terms to exchange with neighboring subdomains. The internode communication overhead in the present approach is equivalent to that of standard explicit schemes and filters based on seven-point discretization stencils. The new SB compact schemes and filters demand additional arithmetic operations compared to the original serial ones. However, it is shown that the additional cost becomes sufficiently low by choosing optimal sizes of their discretization stencils. Compared to earlier published results, the proposed SB compact schemes and filters successfully reduce parallelization artifacts arising from subdomain boundaries to a level sufficiently negligible for sophisticated aeroacoustic simulations without degrading parallel efficiency. The overall performance and parallel efficiency of the proposed approach are demonstrated by stringent benchmark tests.
Hielscher, Andreas H; Bartel, Sebastian
2004-02-01
Optical tomography (OT) is a fast developing novel imaging modality that uses near-infrared (NIR) light to obtain cross-sectional views of optical properties inside the human body. A major challenge remains the time-consuming, computational-intensive image reconstruction problem that converts NIR transmission measurements into cross-sectional images. To increase the speed of iterative image reconstruction schemes that are commonly applied for OT, we have developed and implemented several parallel algorithms on a cluster of workstations. Static process distribution as well as dynamic load balancing schemes suitable for heterogeneous clusters and varying machine performances are introduced and tested. The resulting algorithms are shown to accelerate the reconstruction process to various degrees, substantially reducing the computation times for clinically relevant problems.
Multiple grid problems on concurrent-processing computers
NASA Technical Reports Server (NTRS)
Eberhardt, D. S.; Baganoff, D.
1986-01-01
Three computer codes were studied which make use of concurrent processing computer architectures in computational fluid dynamics (CFD). The three parallel codes were tested on a two processor multiple-instruction/multiple-data (MIMD) facility at NASA Ames Research Center, and are suggested for efficient parallel computations. The first code is a well-known program which makes use of the Beam and Warming, implicit, approximate factored algorithm. This study demonstrates the parallelism found in a well-known scheme and it achieved speedups exceeding 1.9 on the two processor MIMD test facility. The second code studied made use of an embedded grid scheme which is used to solve problems having complex geometries. The particular application for this study considered an airfoil/flap geometry in an incompressible flow. The scheme eliminates some of the inherent difficulties found in adapting approximate factorization techniques onto MIMD machines and allows the use of chaotic relaxation and asynchronous iteration techniques. The third code studied is an application of overset grids to a supersonic blunt body problem. The code addresses the difficulties encountered when using embedded grids on a compressible, and therefore nonlinear, problem. The complex numerical boundary system associated with overset grids is discussed and several boundary schemes are suggested. A boundary scheme based on the method of characteristics achieved the best results.
A transient FETI methodology for large-scale parallel implicit computations in structural mechanics
NASA Technical Reports Server (NTRS)
Farhat, Charbel; Crivelli, Luis; Roux, Francois-Xavier
1992-01-01
Explicit codes are often used to simulate the nonlinear dynamics of large-scale structural systems, even for low frequency response, because the storage and CPU requirements entailed by the repeated factorizations traditionally found in implicit codes rapidly overwhelm the available computing resources. With the advent of parallel processing, this trend is accelerating because explicit schemes are also easier to parallelize than implicit ones. However, the time step restriction imposed by the Courant stability condition on all explicit schemes cannot yet -- and perhaps will never -- be offset by the speed of parallel hardware. Therefore, it is essential to develop efficient and robust alternatives to direct methods that are also amenable to massively parallel processing because implicit codes using unconditionally stable time-integration algorithms are computationally more efficient when simulating low-frequency dynamics. Here we present a domain decomposition method for implicit schemes that requires significantly less storage than factorization algorithms, that is several times faster than other popular direct and iterative methods, that can be easily implemented on both shared and local memory parallel processors, and that is both computationally and communication-wise efficient. The proposed transient domain decomposition method is an extension of the method of Finite Element Tearing and Interconnecting (FETI) developed by Farhat and Roux for the solution of static problems. Serial and parallel performance results on the CRAY Y-MP/8 and the iPSC-860/128 systems are reported and analyzed for realistic structural dynamics problems. These results establish the superiority of the FETI method over both the serial/parallel conjugate gradient algorithm with diagonal scaling and the serial/parallel direct method, and contrast the computational power of the iPSC-860/128 parallel processor with that of the CRAY Y-MP/8 system.
Hamming and Accumulator Codes Concatenated with MPSK or QAM
NASA Technical Reports Server (NTRS)
Divsalar, Dariush; Dolinar, Samuel
2009-01-01
In a proposed coding-and-modulation scheme, a high-rate binary data stream would be processed as follows: 1. The input bit stream would be demultiplexed into multiple bit streams. 2. The multiple bit streams would be processed simultaneously into a high-rate outer Hamming code that would comprise multiple short constituent Hamming codes a distinct constituent Hamming code for each stream. 3. The streams would be interleaved. The interleaver would have a block structure that would facilitate parallelization for high-speed decoding. 4. The interleaved streams would be further processed simultaneously into an inner two-state, rate-1 accumulator code that would comprise multiple constituent accumulator codes - a distinct accumulator code for each stream. 5. The resulting bit streams would be mapped into symbols to be transmitted by use of a higher-order modulation - for example, M-ary phase-shift keying (MPSK) or quadrature amplitude modulation (QAM). The novelty of the scheme lies in the concatenation of the multiple-constituent Hamming and accumulator codes and the corresponding parallel architectures of the encoder and decoder circuitry (see figure) needed to process the multiple bit streams simultaneously. As in the cases of other parallel-processing schemes, one advantage of this scheme is that the overall data rate could be much greater than the data rate of each encoder and decoder stream and, hence, the encoder and decoder could handle data at an overall rate beyond the capability of the individual encoder and decoder circuits.
Fast, Massively Parallel Data Processors
NASA Technical Reports Server (NTRS)
Heaton, Robert A.; Blevins, Donald W.; Davis, ED
1994-01-01
Proposed fast, massively parallel data processor contains 8x16 array of processing elements with efficient interconnection scheme and options for flexible local control. Processing elements communicate with each other on "X" interconnection grid with external memory via high-capacity input/output bus. This approach to conditional operation nearly doubles speed of various arithmetic operations.
Two schemes for rapid generation of digital video holograms using PC cluster
NASA Astrophysics Data System (ADS)
Park, Hanhoon; Song, Joongseok; Kim, Changseob; Park, Jong-Il
2017-12-01
Computer-generated holography (CGH), which is a process of generating digital holograms, is computationally expensive. Recently, several methods/systems of parallelizing the process using graphic processing units (GPUs) have been proposed. Indeed, use of multiple GPUs or a personal computer (PC) cluster (each PC with GPUs) enabled great improvements in the process speed. However, extant literature has less often explored systems involving rapid generation of multiple digital holograms and specialized systems for rapid generation of a digital video hologram. This study proposes a system that uses a PC cluster and is able to more efficiently generate a video hologram. The proposed system is designed to simultaneously generate multiple frames and accelerate the generation by parallelizing the CGH computations across a number of frames, as opposed to separately generating each individual frame while parallelizing the CGH computations within each frame. The proposed system also enables the subprocesses for generating each frame to execute in parallel through multithreading. With these two schemes, the proposed system significantly reduced the data communication time for generating a digital hologram when compared with that of the state-of-the-art system.
Studies in optical parallel processing. [All optical and electro-optic approaches
NASA Technical Reports Server (NTRS)
Lee, S. H.
1978-01-01
Threshold and A/D devices for converting a gray scale image into a binary one were investigated for all-optical and opto-electronic approaches to parallel processing. Integrated optical logic circuits (IOC) and optical parallel logic devices (OPA) were studied as an approach to processing optical binary signals. In the IOC logic scheme, a single row of an optical image is coupled into the IOC substrate at a time through an array of optical fibers. Parallel processing is carried out out, on each image element of these rows, in the IOC substrate and the resulting output exits via a second array of optical fibers. The OPAL system for parallel processing which uses a Fabry-Perot interferometer for image thresholding and analog-to-digital conversion, achieves a higher degree of parallel processing than is possible with IOC.
[CMACPAR an modified parallel neuro-controller for control processes].
Ramos, E; Surós, R
1999-01-01
CMACPAR is a Parallel Neurocontroller oriented to real time systems as for example Control Processes. Its characteristics are mainly a fast learning algorithm, a reduced number of calculations, great generalization capacity, local learning and intrinsic parallelism. This type of neurocontroller is used in real time applications required by refineries, hydroelectric centers, factories, etc. In this work we present the analysis and the parallel implementation of a modified scheme of the Cerebellar Model CMAC for the n-dimensional space projection using a mean granularity parallel neurocontroller. The proposed memory management allows for a significant memory reduction in training time and required memory size.
Parallel algorithm for computation of second-order sequential best rotations
NASA Astrophysics Data System (ADS)
Redif, Soydan; Kasap, Server
2013-12-01
Algorithms for computing an approximate polynomial matrix eigenvalue decomposition of para-Hermitian systems have emerged as a powerful, generic signal processing tool. A technique that has shown much success in this regard is the sequential best rotation (SBR2) algorithm. Proposed is a scheme for parallelising SBR2 with a view to exploiting the modern architectural features and inherent parallelism of field-programmable gate array (FPGA) technology. Experiments show that the proposed scheme can achieve low execution times while requiring minimal FPGA resources.
Partitioning in parallel processing of production systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Oflazer, K.
1987-01-01
This thesis presents research on certain issues related to parallel processing of production systems. It first presents a parallel production system interpreter that has been implemented on a four-processor multiprocessor. This parallel interpreter is based on Forgy's OPS5 interpreter and exploits production-level parallelism in production systems. Runs on the multiprocessor system indicate that it is possible to obtain speed-up of around 1.7 in the match computation for certain production systems when productions are split into three sets that are processed in parallel. The next issue addressed is that of partitioning a set of rules to processors in a parallel interpretermore » with production-level parallelism, and the extent of additional improvement in performance. The partitioning problem is formulated and an algorithm for approximate solutions is presented. The thesis next presents a parallel processing scheme for OPS5 production systems that allows some redundancy in the match computation. This redundancy enables the processing of a production to be divided into units of medium granularity each of which can be processed in parallel. Subsequently, a parallel processor architecture for implementing the parallel processing algorithm is presented.« less
A novel parallel architecture for local histogram equalization
NASA Astrophysics Data System (ADS)
Ohannessian, Mesrob I.; Choueiter, Ghinwa F.; Diab, Hassan
2005-07-01
Local histogram equalization is an image enhancement algorithm that has found wide application in the pre-processing stage of areas such as computer vision, pattern recognition and medical imaging. The computationally intensive nature of the procedure, however, is a main limitation when real time interactive applications are in question. This work explores the possibility of performing parallel local histogram equalization, using an array of special purpose elementary processors, through an HDL implementation that targets FPGA or ASIC platforms. A novel parallelization scheme is presented and the corresponding architecture is derived. The algorithm is reduced to pixel-level operations. Processing elements are assigned image blocks, to maintain a reasonable performance-cost ratio. To further simplify both processor and memory organizations, a bit-serial access scheme is used. A brief performance assessment is provided to illustrate and quantify the merit of the approach.
Recent Progress on the Parallel Implementation of Moving-Body Overset Grid Schemes
NASA Technical Reports Server (NTRS)
Wissink, Andrew; Allen, Edwin (Technical Monitor)
1998-01-01
Viscous calculations about geometrically complex bodies in which there is relative motion between component parts is one of the most computationally demanding problems facing CFD researchers today. This presentation documents results from the first two years of a CHSSI-funded effort within the U.S. Army AFDD to develop scalable dynamic overset grid methods for unsteady viscous calculations with moving-body problems. The first pan of the presentation will focus on results from OVERFLOW-D1, a parallelized moving-body overset grid scheme that employs traditional Chimera methodology. The two processes that dominate the cost of such problems are the flow solution on each component and the intergrid connectivity solution. Parallel implementations of the OVERFLOW flow solver and DCF3D connectivity software are coupled with a proposed two-part static-dynamic load balancing scheme and tested on the IBM SP and Cray T3E multi-processors. The second part of the presentation will cover some recent results from OVERFLOW-D2, a new flow solver that employs Cartesian grids with various levels of refinement, facilitating solution adaption. A study of the parallel performance of the scheme on large distributed- memory multiprocessor computer architectures will be reported.
NASA Astrophysics Data System (ADS)
Kumari, Komal; Donzis, Diego
2017-11-01
Highly resolved computational simulations on massively parallel machines are critical in understanding the physics of a vast number of complex phenomena in nature governed by partial differential equations. Simulations at extreme levels of parallelism present many challenges with communication between processing elements (PEs) being a major bottleneck. In order to fully exploit the computational power of exascale machines one needs to devise numerical schemes that relax global synchronizations across PEs. This asynchronous computations, however, have a degrading effect on the accuracy of standard numerical schemes.We have developed asynchrony-tolerant (AT) schemes that maintain order of accuracy despite relaxed communications. We show, analytically and numerically, that these schemes retain their numerical properties with multi-step higher order temporal Runge-Kutta schemes. We also show that for a range of optimized parameters,the computation time and error for AT schemes is less than their synchronous counterpart. Stability of the AT schemes which depends upon history and random nature of delays, are also discussed. Support from NSF is gratefully acknowledged.
Hierarchical fractional-step approximations and parallel kinetic Monte Carlo algorithms
DOE Office of Scientific and Technical Information (OSTI.GOV)
Arampatzis, Giorgos, E-mail: garab@math.uoc.gr; Katsoulakis, Markos A., E-mail: markos@math.umass.edu; Plechac, Petr, E-mail: plechac@math.udel.edu
2012-10-01
We present a mathematical framework for constructing and analyzing parallel algorithms for lattice kinetic Monte Carlo (KMC) simulations. The resulting algorithms have the capacity to simulate a wide range of spatio-temporal scales in spatially distributed, non-equilibrium physiochemical processes with complex chemistry and transport micro-mechanisms. Rather than focusing on constructing exactly the stochastic trajectories, our approach relies on approximating the evolution of observables, such as density, coverage, correlations and so on. More specifically, we develop a spatial domain decomposition of the Markov operator (generator) that describes the evolution of all observables according to the kinetic Monte Carlo algorithm. This domain decompositionmore » corresponds to a decomposition of the Markov generator into a hierarchy of operators and can be tailored to specific hierarchical parallel architectures such as multi-core processors or clusters of Graphical Processing Units (GPUs). Based on this operator decomposition, we formulate parallel Fractional step kinetic Monte Carlo algorithms by employing the Trotter Theorem and its randomized variants; these schemes, (a) are partially asynchronous on each fractional step time-window, and (b) are characterized by their communication schedule between processors. The proposed mathematical framework allows us to rigorously justify the numerical and statistical consistency of the proposed algorithms, showing the convergence of our approximating schemes to the original serial KMC. The approach also provides a systematic evaluation of different processor communicating schedules. We carry out a detailed benchmarking of the parallel KMC schemes using available exact solutions, for example, in Ising-type systems and we demonstrate the capabilities of the method to simulate complex spatially distributed reactions at very large scales on GPUs. Finally, we discuss work load balancing between processors and propose a re-balancing scheme based on probabilistic mass transport methods.« less
A parallel graded-mesh FDTD algorithm for human-antenna interaction problems.
Catarinucci, Luca; Tarricone, Luciano
2009-01-01
The finite difference time domain method (FDTD) is frequently used for the numerical solution of a wide variety of electromagnetic (EM) problems and, among them, those concerning human exposure to EM fields. In many practical cases related to the assessment of occupational EM exposure, large simulation domains are modeled and high space resolution adopted, so that strong memory and central processing unit power requirements have to be satisfied. To better afford the computational effort, the use of parallel computing is a winning approach; alternatively, subgridding techniques are often implemented. However, the simultaneous use of subgridding schemes and parallel algorithms is very new. In this paper, an easy-to-implement and highly-efficient parallel graded-mesh (GM) FDTD scheme is proposed and applied to human-antenna interaction problems, demonstrating its appropriateness in dealing with complex occupational tasks and showing its capability to guarantee the advantages of a traditional subgridding technique without affecting the parallel FDTD performance.
A parallel finite-difference method for computational aerodynamics
NASA Technical Reports Server (NTRS)
Swisshelm, Julie M.
1989-01-01
A finite-difference scheme for solving complex three-dimensional aerodynamic flow on parallel-processing supercomputers is presented. The method consists of a basic flow solver with multigrid convergence acceleration, embedded grid refinements, and a zonal equation scheme. Multitasking and vectorization have been incorporated into the algorithm. Results obtained include multiprocessed flow simulations from the Cray X-MP and Cray-2. Speedups as high as 3.3 for the two-dimensional case and 3.5 for segments of the three-dimensional case have been achieved on the Cray-2. The entire solver attained a factor of 2.7 improvement over its unitasked version on the Cray-2. The performance of the parallel algorithm on each machine is analyzed.
NETRA: A parallel architecture for integrated vision systems. 1: Architecture and organization
NASA Technical Reports Server (NTRS)
Choudhary, Alok N.; Patel, Janak H.; Ahuja, Narendra
1989-01-01
Computer vision is regarded as one of the most complex and computationally intensive problems. An integrated vision system (IVS) is considered to be a system that uses vision algorithms from all levels of processing for a high level application (such as object recognition). A model of computation is presented for parallel processing for an IVS. Using the model, desired features and capabilities of a parallel architecture suitable for IVSs are derived. Then a multiprocessor architecture (called NETRA) is presented. This architecture is highly flexible without the use of complex interconnection schemes. The topology of NETRA is recursively defined and hence is easily scalable from small to large systems. Homogeneity of NETRA permits fault tolerance and graceful degradation under faults. It is a recursively defined tree-type hierarchical architecture where each of the leaf nodes consists of a cluster of processors connected with a programmable crossbar with selective broadcast capability to provide for desired flexibility. A qualitative evaluation of NETRA is presented. Then general schemes are described to map parallel algorithms onto NETRA. Algorithms are classified according to their communication requirements for parallel processing. An extensive analysis of inter-cluster communication strategies in NETRA is presented, and parameters affecting performance of parallel algorithms when mapped on NETRA are discussed. Finally, a methodology to evaluate performance of algorithms on NETRA is described.
Development of iterative techniques for the solution of unsteady compressible viscous flows
NASA Technical Reports Server (NTRS)
Hixon, Duane; Sankar, L. N.
1993-01-01
During the past two decades, there has been significant progress in the field of numerical simulation of unsteady compressible viscous flows. At present, a variety of solution techniques exist such as the transonic small disturbance analyses (TSD), transonic full potential equation-based methods, unsteady Euler solvers, and unsteady Navier-Stokes solvers. These advances have been made possible by developments in three areas: (1) improved numerical algorithms; (2) automation of body-fitted grid generation schemes; and (3) advanced computer architectures with vector processing and massively parallel processing features. In this work, the GMRES scheme has been considered as a candidate for acceleration of a Newton iteration time marching scheme for unsteady 2-D and 3-D compressible viscous flow calculation; from preliminary calculations, this will provide up to a 65 percent reduction in the computer time requirements over the existing class of explicit and implicit time marching schemes. The proposed method has ben tested on structured grids, but is flexible enough for extension to unstructured grids. The described scheme has been tested only on the current generation of vector processor architecture of the Cray Y/MP class, but should be suitable for adaptation to massively parallel machines.
Parallelization of implicit finite difference schemes in computational fluid dynamics
NASA Technical Reports Server (NTRS)
Decker, Naomi H.; Naik, Vijay K.; Nicoules, Michel
1990-01-01
Implicit finite difference schemes are often the preferred numerical schemes in computational fluid dynamics, requiring less stringent stability bounds than the explicit schemes. Each iteration in an implicit scheme involves global data dependencies in the form of second and higher order recurrences. Efficient parallel implementations of such iterative methods are considerably more difficult and non-intuitive. The parallelization of the implicit schemes that are used for solving the Euler and the thin layer Navier-Stokes equations and that require inversions of large linear systems in the form of block tri-diagonal and/or block penta-diagonal matrices is discussed. Three-dimensional cases are emphasized and schemes that minimize the total execution time are presented. Partitioning and scheduling schemes for alleviating the effects of the global data dependencies are described. An analysis of the communication and the computation aspects of these methods is presented. The effect of the boundary conditions on the parallel schemes is also discussed.
Sequential Feedback Scheme Outperforms the Parallel Scheme for Hamiltonian Parameter Estimation.
Yuan, Haidong
2016-10-14
Measurement and estimation of parameters are essential for science and engineering, where the main quest is to find the highest achievable precision with the given resources and design schemes to attain it. Two schemes, the sequential feedback scheme and the parallel scheme, are usually studied in the quantum parameter estimation. While the sequential feedback scheme represents the most general scheme, it remains unknown whether it can outperform the parallel scheme for any quantum estimation tasks. In this Letter, we show that the sequential feedback scheme has a threefold improvement over the parallel scheme for Hamiltonian parameter estimations on two-dimensional systems, and an order of O(d+1) improvement for Hamiltonian parameter estimation on d-dimensional systems. We also show that, contrary to the conventional belief, it is possible to simultaneously achieve the highest precision for estimating all three components of a magnetic field, which sets a benchmark on the local precision limit for the estimation of a magnetic field.
Adaptive implicit-explicit and parallel element-by-element iteration schemes
NASA Technical Reports Server (NTRS)
Tezduyar, T. E.; Liou, J.; Nguyen, T.; Poole, S.
1989-01-01
Adaptive implicit-explicit (AIE) and grouped element-by-element (GEBE) iteration schemes are presented for the finite element solution of large-scale problems in computational mechanics and physics. The AIE approach is based on the dynamic arrangement of the elements into differently treated groups. The GEBE procedure, which is a way of rewriting the EBE formulation to make its parallel processing potential and implementation more clear, is based on the static arrangement of the elements into groups with no inter-element coupling within each group. Various numerical tests performed demonstrate the savings in the CPU time and memory.
NASA Astrophysics Data System (ADS)
Shinya, A.; Ishihara, T.; Inoue, K.; Nozaki, K.; Kita, S.; Notomi, M.
2018-02-01
We propose an optical parallel adder based on a binary decision diagram that can calculate simply by propagating light through electrically controlled optical pass gates. The CARRY and CARRY operations are multiplexed in one circuit by a wavelength division multiplexing scheme to reduce the number of optical elements, and only a single gate constitutes the critical path for one digit calculation. The processing time reaches picoseconds per digit when we use a 100-μm-long optical path gates, which is ten times faster than a CMOS circuit.
High order parallel numerical schemes for solving incompressible flows
NASA Technical Reports Server (NTRS)
Lin, Avi; Milner, Edward J.; Liou, May-Fun; Belch, Richard A.
1992-01-01
The use of parallel computers for numerically solving flow fields has gained much importance in recent years. This paper introduces a new high order numerical scheme for computational fluid dynamics (CFD) specifically designed for parallel computational environments. A distributed MIMD system gives the flexibility of treating different elements of the governing equations with totally different numerical schemes in different regions of the flow field. The parallel decomposition of the governing operator to be solved is the primary parallel split. The primary parallel split was studied using a hypercube like architecture having clusters of shared memory processors at each node. The approach is demonstrated using examples of simple steady state incompressible flows. Future studies should investigate the secondary split because, depending on the numerical scheme that each of the processors applies and the nature of the flow in the specific subdomain, it may be possible for a processor to seek better, or higher order, schemes for its particular subcase.
Control of parallel manipulators using force feedback
NASA Technical Reports Server (NTRS)
Nanua, Prabjot
1994-01-01
Two control schemes are compared for parallel robotic mechanisms actuated by hydraulic cylinders. One scheme, the 'rate based scheme', uses the position and rate information only for feedback. The second scheme, the 'force based scheme' feeds back the force information also. The force control scheme is shown to improve the response over the rate control one. It is a simple constant gain control scheme better suited to parallel mechanisms. The force control scheme can be easily modified for the dynamic forces on the end effector. This paper presents the results of a computer simulation of both the rate and force control schemes. The gains in the force based scheme can be individually adjusted in all three directions, whereas the adjustment in just one direction of the rate based scheme directly affects the other two directions.
Zhou, Xian; Chen, Xue
2011-05-09
The digital coherent receivers combine coherent detection with digital signal processing (DSP) to compensate for transmission impairments, and therefore are a promising candidate for future high-speed optical transmission system. However, the maximum symbol rate supported by such real-time receivers is limited by the processing rate of hardware. In order to cope with this difficulty, the parallel processing algorithms is imperative. In this paper, we propose a novel parallel digital timing recovery loop (PDTRL) based on our previous work. Furthermore, for increasing the dynamic dispersion tolerance range of receivers, we embed a parallel adaptive equalizer in the PDTRL. This parallel joint scheme (PJS) can be used to complete synchronization, equalization and polarization de-multiplexing simultaneously. Finally, we demonstrate that PDTRL and PJS allow the hardware to process 112 Gbit/s POLMUX-DQPSK signal at the hundreds MHz range. © 2011 Optical Society of America
Parallelization of a hydrological model using the message passing interface
Wu, Yiping; Li, Tiejian; Sun, Liqun; Chen, Ji
2013-01-01
With the increasing knowledge about the natural processes, hydrological models such as the Soil and Water Assessment Tool (SWAT) are becoming larger and more complex with increasing computation time. Additionally, other procedures such as model calibration, which may require thousands of model iterations, can increase running time and thus further reduce rapid modeling and analysis. Using the widely-applied SWAT as an example, this study demonstrates how to parallelize a serial hydrological model in a Windows® environment using a parallel programing technology—Message Passing Interface (MPI). With a case study, we derived the optimal values for the two parameters (the number of processes and the corresponding percentage of work to be distributed to the master process) of the parallel SWAT (P-SWAT) on an ordinary personal computer and a work station. Our study indicates that model execution time can be reduced by 42%–70% (or a speedup of 1.74–3.36) using multiple processes (two to five) with a proper task-distribution scheme (between the master and slave processes). Although the computation time cost becomes lower with an increasing number of processes (from two to five), this enhancement becomes less due to the accompanied increase in demand for message passing procedures between the master and all slave processes. Our case study demonstrates that the P-SWAT with a five-process run may reach the maximum speedup, and the performance can be quite stable (fairly independent of a project size). Overall, the P-SWAT can help reduce the computation time substantially for an individual model run, manual and automatic calibration procedures, and optimization of best management practices. In particular, the parallelization method we used and the scheme for deriving the optimal parameters in this study can be valuable and easily applied to other hydrological or environmental models.
NASA Astrophysics Data System (ADS)
Samaké, Abdoulaye; Rampal, Pierre; Bouillon, Sylvain; Ólason, Einar
2017-12-01
We present a parallel implementation framework for a new dynamic/thermodynamic sea-ice model, called neXtSIM, based on the Elasto-Brittle rheology and using an adaptive mesh. The spatial discretisation of the model is done using the finite-element method. The temporal discretisation is semi-implicit and the advection is achieved using either a pure Lagrangian scheme or an Arbitrary Lagrangian Eulerian scheme (ALE). The parallel implementation presented here focuses on the distributed-memory approach using the message-passing library MPI. The efficiency and the scalability of the parallel algorithms are illustrated by the numerical experiments performed using up to 500 processor cores of a cluster computing system. The performance obtained by the proposed parallel implementation of the neXtSIM code is shown being sufficient to perform simulations for state-of-the-art sea ice forecasting and geophysical process studies over geographical domain of several millions squared kilometers like the Arctic region.
Katouda, Michio; Naruse, Akira; Hirano, Yukihiko; Nakajima, Takahito
2016-11-15
A new parallel algorithm and its implementation for the RI-MP2 energy calculation utilizing peta-flop-class many-core supercomputers are presented. Some improvements from the previous algorithm (J. Chem. Theory Comput. 2013, 9, 5373) have been performed: (1) a dual-level hierarchical parallelization scheme that enables the use of more than 10,000 Message Passing Interface (MPI) processes and (2) a new data communication scheme that reduces network communication overhead. A multi-node and multi-GPU implementation of the present algorithm is presented for calculations on a central processing unit (CPU)/graphics processing unit (GPU) hybrid supercomputer. Benchmark results of the new algorithm and its implementation using the K computer (CPU clustering system) and TSUBAME 2.5 (CPU/GPU hybrid system) demonstrate high efficiency. The peak performance of 3.1 PFLOPS is attained using 80,199 nodes of the K computer. The peak performance of the multi-node and multi-GPU implementation is 514 TFLOPS using 1349 nodes and 4047 GPUs of TSUBAME 2.5. © 2016 Wiley Periodicals, Inc. © 2016 Wiley Periodicals, Inc.
Multilevel decomposition of complete vehicle configuration in a parallel computing environment
NASA Technical Reports Server (NTRS)
Bhatt, Vinay; Ragsdell, K. M.
1989-01-01
This research summarizes various approaches to multilevel decomposition to solve large structural problems. A linear decomposition scheme based on the Sobieski algorithm is selected as a vehicle for automated synthesis of a complete vehicle configuration in a parallel processing environment. The research is in a developmental state. Preliminary numerical results are presented for several example problems.
Efficient parallel implicit methods for rotary-wing aerodynamics calculations
NASA Astrophysics Data System (ADS)
Wissink, Andrew M.
Euler/Navier-Stokes Computational Fluid Dynamics (CFD) methods are commonly used for prediction of the aerodynamics and aeroacoustics of modern rotary-wing aircraft. However, their widespread application to large complex problems is limited lack of adequate computing power. Parallel processing offers the potential for dramatic increases in computing power, but most conventional implicit solution methods are inefficient in parallel and new techniques must be adopted to realize its potential. This work proposes alternative implicit schemes for Euler/Navier-Stokes rotary-wing calculations which are robust and efficient in parallel. The first part of this work proposes an efficient parallelizable modification of the Lower Upper-Symmetric Gauss Seidel (LU-SGS) implicit operator used in the well-known Transonic Unsteady Rotor Navier Stokes (TURNS) code. The new hybrid LU-SGS scheme couples a point-relaxation approach of the Data Parallel-Lower Upper Relaxation (DP-LUR) algorithm for inter-processor communication with the Symmetric Gauss Seidel algorithm of LU-SGS for on-processor computations. With the modified operator, TURNS is implemented in parallel using Message Passing Interface (MPI) for communication. Numerical performance and parallel efficiency are evaluated on the IBM SP2 and Thinking Machines CM-5 multi-processors for a variety of steady-state and unsteady test cases. The hybrid LU-SGS scheme maintains the numerical performance of the original LU-SGS algorithm in all cases and shows a good degree of parallel efficiency. It experiences a higher degree of robustness than DP-LUR for third-order upwind solutions. The second part of this work examines use of Krylov subspace iterative solvers for the nonlinear CFD solutions. The hybrid LU-SGS scheme is used as a parallelizable preconditioner. Two iterative methods are tested, Generalized Minimum Residual (GMRES) and Orthogonal s-Step Generalized Conjugate Residual (OSGCR). The Newton method demonstrates good parallel performance on the IBM SP2, with OS-GCR giving slightly better performance than GMRES on large numbers of processors. For steady and quasi-steady calculations, the convergence rate is accelerated but the overall solution time remains about the same as the standard hybrid LU-SGS scheme. For unsteady calculations, however, the Newton method maintains a higher degree of time-accuracy which allows tbe use of larger timesteps and results in CPU savings of 20-35%.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zheng, Xiang; Yang, Chao; State Key Laboratory of Computer Science, Chinese Academy of Sciences, Beijing 100190
2015-03-15
We present a numerical algorithm for simulating the spinodal decomposition described by the three dimensional Cahn–Hilliard–Cook (CHC) equation, which is a fourth-order stochastic partial differential equation with a noise term. The equation is discretized in space and time based on a fully implicit, cell-centered finite difference scheme, with an adaptive time-stepping strategy designed to accelerate the progress to equilibrium. At each time step, a parallel Newton–Krylov–Schwarz algorithm is used to solve the nonlinear system. We discuss various numerical and computational challenges associated with the method. The numerical scheme is validated by a comparison with an explicit scheme of high accuracymore » (and unreasonably high cost). We present steady state solutions of the CHC equation in two and three dimensions. The effect of the thermal fluctuation on the spinodal decomposition process is studied. We show that the existence of the thermal fluctuation accelerates the spinodal decomposition process and that the final steady morphology is sensitive to the stochastic noise. We also show the evolution of the energies and statistical moments. In terms of the parallel performance, it is found that the implicit domain decomposition approach scales well on supercomputers with a large number of processors.« less
Update schemes of multi-velocity floor field cellular automaton for pedestrian dynamics
NASA Astrophysics Data System (ADS)
Luo, Lin; Fu, Zhijian; Cheng, Han; Yang, Lizhong
2018-02-01
Modeling pedestrian movement is an interesting problem both in statistical physics and in computational physics. Update schemes of cellular automaton (CA) models for pedestrian dynamics govern the schedule of pedestrian movement. Usually, different update schemes make the models behave in different ways, which should be carefully recalibrated. Thus, in this paper, we investigated the influence of four different update schemes, namely parallel/synchronous scheme, random scheme, order-sequential scheme and shuffled scheme, on pedestrian dynamics. The multi-velocity floor field cellular automaton (FFCA) considering the changes of pedestrians' moving properties along walking paths and heterogeneity of pedestrians' walking abilities was used. As for parallel scheme only, the collisions detection and resolution should be considered, resulting in a great difference from any other update schemes. For pedestrian evacuation, the evacuation time is enlarged, and the difference in pedestrians' walking abilities is better reflected, under parallel scheme. In face of a bottleneck, for example a exit, using a parallel scheme leads to a longer congestion period and a more dispersive density distribution. The exit flow and the space-time distribution of density and velocity have significant discrepancies under four different update schemes when we simulate pedestrian flow with high desired velocity. Update schemes may have no influence on pedestrians in simulation to create tendency to follow others, but sequential and shuffled update scheme may enhance the effect of pedestrians' familiarity with environments.
A 64Cycles/MB, Luma-Chroma Parallelized H.264/AVC Deblocking Filter for 4K × 2K Applications
NASA Astrophysics Data System (ADS)
Shen, Weiwei; Fan, Yibo; Zeng, Xiaoyang
In this paper, a high-throughput debloking filter is presented for H.264/AVC standard, catering video applications with 4K × 2K (4096 × 2304) ultra-definition resolution. In order to strengthen the parallelism without simply increasing the area, we propose a luma-chroma parallel method. Meanwhile, this work reduces the number of processing cycles, the amount of external memory traffic and the working frequency, by using triple four-stage pipeline filters and a luma-chroma interlaced sequence. Furthermore, it eliminates most unnecessary off-chip memory bandwidth with a highly reusable memory scheme, and adopts a “slide window” buffer scheme. As a result, our design can support 4K × 2K at 30fps applications at the working frequency of only 70.8MHz.
Distributed Computing for Signal Processing: Modeling of Asynchronous Parallel Computation.
1986-03-01
the proposed approaches 16, 16, 40 . 451. The conclusion most often reached is that the best scheme to use in a particular design depends highly upon...76. 40 . Siegel, H. J., McMillen. R. J., and Mueller. P. T.. Jr. A survey of interconnection methods for reconligurable parallel processing systems...addressing meehaanm distributed in the network area rimonication% tit reach gigabit./second speeds je g.. PoCoS83 .’ i.V--i the lirO! lk i nitronment is
NASA Technical Reports Server (NTRS)
Krasteva, Denitza T.
1998-01-01
Multidisciplinary design optimization (MDO) for large-scale engineering problems poses many challenges (e.g., the design of an efficient concurrent paradigm for global optimization based on disciplinary analyses, expensive computations over vast data sets, etc.) This work focuses on the application of distributed schemes for massively parallel architectures to MDO problems, as a tool for reducing computation time and solving larger problems. The specific problem considered here is configuration optimization of a high speed civil transport (HSCT), and the efficient parallelization of the embedded paradigm for reasonable design space identification. Two distributed dynamic load balancing techniques (random polling and global round robin with message combining) and two necessary termination detection schemes (global task count and token passing) were implemented and evaluated in terms of effectiveness and scalability to large problem sizes and a thousand processors. The effect of certain parameters on execution time was also inspected. Empirical results demonstrated stable performance and effectiveness for all schemes, and the parametric study showed that the selected algorithmic parameters have a negligible effect on performance.
NASA Astrophysics Data System (ADS)
Andrade, Xavier; Alberdi-Rodriguez, Joseba; Strubbe, David A.; Oliveira, Micael J. T.; Nogueira, Fernando; Castro, Alberto; Muguerza, Javier; Arruabarrena, Agustin; Louie, Steven G.; Aspuru-Guzik, Alán; Rubio, Angel; Marques, Miguel A. L.
2012-06-01
Octopus is a general-purpose density-functional theory (DFT) code, with a particular emphasis on the time-dependent version of DFT (TDDFT). In this paper we present the ongoing efforts to achieve the parallelization of octopus. We focus on the real-time variant of TDDFT, where the time-dependent Kohn-Sham equations are directly propagated in time. This approach has great potential for execution in massively parallel systems such as modern supercomputers with thousands of processors and graphics processing units (GPUs). For harvesting the potential of conventional supercomputers, the main strategy is a multi-level parallelization scheme that combines the inherent scalability of real-time TDDFT with a real-space grid domain-partitioning approach. A scalable Poisson solver is critical for the efficiency of this scheme. For GPUs, we show how using blocks of Kohn-Sham states provides the required level of data parallelism and that this strategy is also applicable for code optimization on standard processors. Our results show that real-time TDDFT, as implemented in octopus, can be the method of choice for studying the excited states of large molecular systems in modern parallel architectures.
Andrade, Xavier; Alberdi-Rodriguez, Joseba; Strubbe, David A; Oliveira, Micael J T; Nogueira, Fernando; Castro, Alberto; Muguerza, Javier; Arruabarrena, Agustin; Louie, Steven G; Aspuru-Guzik, Alán; Rubio, Angel; Marques, Miguel A L
2012-06-13
Octopus is a general-purpose density-functional theory (DFT) code, with a particular emphasis on the time-dependent version of DFT (TDDFT). In this paper we present the ongoing efforts to achieve the parallelization of octopus. We focus on the real-time variant of TDDFT, where the time-dependent Kohn-Sham equations are directly propagated in time. This approach has great potential for execution in massively parallel systems such as modern supercomputers with thousands of processors and graphics processing units (GPUs). For harvesting the potential of conventional supercomputers, the main strategy is a multi-level parallelization scheme that combines the inherent scalability of real-time TDDFT with a real-space grid domain-partitioning approach. A scalable Poisson solver is critical for the efficiency of this scheme. For GPUs, we show how using blocks of Kohn-Sham states provides the required level of data parallelism and that this strategy is also applicable for code optimization on standard processors. Our results show that real-time TDDFT, as implemented in octopus, can be the method of choice for studying the excited states of large molecular systems in modern parallel architectures.
Implementation of DFT application on ternary optical computer
NASA Astrophysics Data System (ADS)
Junjie, Peng; Youyi, Fu; Xiaofeng, Zhang; Shuai, Kong; Xinyu, Wei
2018-03-01
As its characteristics of huge number of data bits and low energy consumption, optical computing may be used in the applications such as DFT etc. which needs a lot of computation and can be implemented in parallel. According to this, DFT implementation methods in full parallel as well as in partial parallel are presented. Based on resources ternary optical computer (TOC), extensive experiments were carried out. Experimental results show that the proposed schemes are correct and feasible. They provide a foundation for further exploration of the applications on TOC that needs a large amount calculation and can be processed in parallel.
NASA Astrophysics Data System (ADS)
Bao, J.; Liu, D.; Lin, Z.
2017-10-01
A conservative scheme of drift kinetic electrons for gyrokinetic simulations of kinetic-magnetohydrodynamic processes in toroidal plasmas has been formulated and verified. Both vector potential and electron perturbed distribution function are decomposed into adiabatic part with analytic solution and non-adiabatic part solved numerically. The adiabatic parallel electric field is solved directly from the electron adiabatic response, resulting in a high degree of accuracy. The consistency between electrostatic potential and parallel vector potential is enforced by using the electron continuity equation. Since particles are only used to calculate the non-adiabatic response, which is used to calculate the non-adiabatic vector potential through Ohm's law, the conservative scheme minimizes the electron particle noise and mitigates the cancellation problem. Linear dispersion relations of the kinetic Alfvén wave and the collisionless tearing mode in cylindrical geometry have been verified in gyrokinetic toroidal code simulations, which show that the perpendicular grid size can be larger than the electron collisionless skin depth when the mode wavelength is longer than the electron skin depth.
A robot arm simulation with a shared memory multiprocessor machine
NASA Technical Reports Server (NTRS)
Kim, Sung-Soo; Chuang, Li-Ping
1989-01-01
A parallel processing scheme for a single chain robot arm is presented for high speed computation on a shared memory multiprocessor. A recursive formulation that is derived from a virtual work form of the d'Alembert equations of motion is utilized for robot arm dynamics. A joint drive system that consists of a motor rotor and gears is included in the arm dynamics model, in order to take into account gyroscopic effects due to the spinning of the rotor. The fine grain parallelism of mechanical and control subsystem models is exploited, based on independent computation associated with bodies, joint drive systems, and controllers. Efficiency and effectiveness of the parallel scheme are demonstrated through simulations of a telerobotic manipulator arm. Two different mechanical subsystem models, i.e., with and without gyroscopic effects, are compared, to show the trade-off between efficiency and accuracy.
Fully parallel write/read in resistive synaptic array for accelerating on-chip learning
NASA Astrophysics Data System (ADS)
Gao, Ligang; Wang, I.-Ting; Chen, Pai-Yu; Vrudhula, Sarma; Seo, Jae-sun; Cao, Yu; Hou, Tuo-Hung; Yu, Shimeng
2015-11-01
A neuro-inspired computing paradigm beyond the von Neumann architecture is emerging and it generally takes advantage of massive parallelism and is aimed at complex tasks that involve intelligence and learning. The cross-point array architecture with synaptic devices has been proposed for on-chip implementation of the weighted sum and weight update in the learning algorithms. In this work, forming-free, silicon-process-compatible Ta/TaO x /TiO2/Ti synaptic devices are fabricated, in which >200 levels of conductance states could be continuously tuned by identical programming pulses. In order to demonstrate the advantages of parallelism of the cross-point array architecture, a novel fully parallel write scheme is designed and experimentally demonstrated in a small-scale crossbar array to accelerate the weight update in the training process, at a speed that is independent of the array size. Compared to the conventional row-by-row write scheme, it achieves >30× speed-up and >30× improvement in energy efficiency as projected in a large-scale array. If realistic synaptic device characteristics such as device variations are taken into an array-level simulation, the proposed array architecture is able to achieve ∼95% recognition accuracy of MNIST handwritten digits, which is close to the accuracy achieved by software using the ideal sparse coding algorithm.
NASA Astrophysics Data System (ADS)
Li, Gaohua; Fu, Xiang; Wang, Fuxin
2017-10-01
The low-dissipation high-order accurate hybrid up-winding/central scheme based on fifth-order weighted essentially non-oscillatory (WENO) and sixth-order central schemes, along with the Spalart-Allmaras (SA)-based delayed detached eddy simulation (DDES) turbulence model, and the flow feature-based adaptive mesh refinement (AMR), are implemented into a dual-mesh overset grid infrastructure with parallel computing capabilities, for the purpose of simulating vortex-dominated unsteady detached wake flows with high spatial resolutions. The overset grid assembly (OGA) process based on collection detection theory and implicit hole-cutting algorithm achieves an automatic coupling for the near-body and off-body solvers, and the error-and-try method is used for obtaining a globally balanced load distribution among the composed multiple codes. The results of flows over high Reynolds cylinder and two-bladed helicopter rotor show that the combination of high-order hybrid scheme, advanced turbulence model, and overset adaptive mesh refinement can effectively enhance the spatial resolution for the simulation of turbulent wake eddies.
Molecular Symmetry in Ab Initio Calculations
NASA Astrophysics Data System (ADS)
Madhavan, P. V.; Written, J. L.
1987-05-01
A scheme is presented for the construction of the Fock matrix in LCAO-SCF calculations and for the transformation of basis integrals to LCAO-MO integrals that can utilize several symmetry unique lists of integrals corresponding to different symmetry groups. The algorithm is fully compatible with vector processing machines and is especially suited for parallel processing machines.
NASA Technical Reports Server (NTRS)
Zhang, Jun; Ge, Lixin; Kouatchou, Jules
2000-01-01
A new fourth order compact difference scheme for the three dimensional convection diffusion equation with variable coefficients is presented. The novelty of this new difference scheme is that it Only requires 15 grid points and that it can be decoupled with two colors. The entire computational grid can be updated in two parallel subsweeps with the Gauss-Seidel type iterative method. This is compared with the known 19 point fourth order compact differenCe scheme which requires four colors to decouple the computational grid. Numerical results, with multigrid methods implemented on a shared memory parallel computer, are presented to compare the 15 point and the 19 point fourth order compact schemes.
Xia, Yidong; Lou, Jialin; Luo, Hong; ...
2015-02-09
Here, an OpenACC directive-based graphics processing unit (GPU) parallel scheme is presented for solving the compressible Navier–Stokes equations on 3D hybrid unstructured grids with a third-order reconstructed discontinuous Galerkin method. The developed scheme requires the minimum code intrusion and algorithm alteration for upgrading a legacy solver with the GPU computing capability at very little extra effort in programming, which leads to a unified and portable code development strategy. A face coloring algorithm is adopted to eliminate the memory contention because of the threading of internal and boundary face integrals. A number of flow problems are presented to verify the implementationmore » of the developed scheme. Timing measurements were obtained by running the resulting GPU code on one Nvidia Tesla K20c GPU card (Nvidia Corporation, Santa Clara, CA, USA) and compared with those obtained by running the equivalent Message Passing Interface (MPI) parallel CPU code on a compute node (consisting of two AMD Opteron 6128 eight-core CPUs (Advanced Micro Devices, Inc., Sunnyvale, CA, USA)). Speedup factors of up to 24× and 1.6× for the GPU code were achieved with respect to one and 16 CPU cores, respectively. The numerical results indicate that this OpenACC-based parallel scheme is an effective and extensible approach to port unstructured high-order CFD solvers to GPU computing.« less
Application of Intel Many Integrated Core (MIC) accelerators to the Pleim-Xiu land surface scheme
NASA Astrophysics Data System (ADS)
Huang, Melin; Huang, Bormin; Huang, Allen H.
2015-10-01
The land-surface model (LSM) is one physics process in the weather research and forecast (WRF) model. The LSM includes atmospheric information from the surface layer scheme, radiative forcing from the radiation scheme, and precipitation forcing from the microphysics and convective schemes, together with internal information on the land's state variables and land-surface properties. The LSM is to provide heat and moisture fluxes over land points and sea-ice points. The Pleim-Xiu (PX) scheme is one LSM. The PX LSM features three pathways for moisture fluxes: evapotranspiration, soil evaporation, and evaporation from wet canopies. To accelerate the computation process of this scheme, we employ Intel Xeon Phi Many Integrated Core (MIC) Architecture as it is a multiprocessor computer structure with merits of efficient parallelization and vectorization essentials. Our results show that the MIC-based optimization of this scheme running on Xeon Phi coprocessor 7120P improves the performance by 2.3x and 11.7x as compared to the original code respectively running on one CPU socket (eight cores) and on one CPU core with Intel Xeon E5-2670.
NASA Astrophysics Data System (ADS)
Jha, Pradeep Kumar
Capturing the effects of detailed-chemistry on turbulent combustion processes is a central challenge faced by the numerical combustion community. However, the inherent complexity and non-linear nature of both turbulence and chemistry require that combustion models rely heavily on engineering approximations to remain computationally tractable. This thesis proposes a computationally efficient algorithm for modelling detailed-chemistry effects in turbulent diffusion flames and numerically predicting the associated flame properties. The cornerstone of this combustion modelling tool is the use of parallel Adaptive Mesh Refinement (AMR) scheme with the recently proposed Flame Prolongation of Intrinsic low-dimensional manifold (FPI) tabulated-chemistry approach for modelling complex chemistry. The effect of turbulence on the mean chemistry is incorporated using a Presumed Conditional Moment (PCM) approach based on a beta-probability density function (PDF). The two-equation k-w turbulence model is used for modelling the effects of the unresolved turbulence on the mean flow field. The finite-rate of methane-air combustion is represented here by using the GRI-Mech 3.0 scheme. This detailed mechanism is used to build the FPI tables. A state of the art numerical scheme based on a parallel block-based solution-adaptive algorithm has been developed to solve the Favre-averaged Navier-Stokes (FANS) and other governing partial-differential equations using a second-order accurate, fully-coupled finite-volume formulation on body-fitted, multi-block, quadrilateral/hexahedral mesh for two-dimensional and three-dimensional flow geometries, respectively. A standard fourth-order Runge-Kutta time-marching scheme is used for time-accurate temporal discretizations. Numerical predictions of three different diffusion flames configurations are considered in the present work: a laminar counter-flow flame; a laminar co-flow diffusion flame; and a Sydney bluff-body turbulent reacting flow. Comparisons are made between the predicted results of the present FPI scheme and Steady Laminar Flamelet Model (SLFM) approach for diffusion flames. The effects of grid resolution on the predicted overall flame solutions are also assessed. Other non-reacting flows have also been considered to further validate other aspects of the numerical scheme. The present schemes predict results which are in good agreement with published experimental results and reduces the computational cost involved in modelling turbulent diffusion flames significantly, both in terms of storage and processing time.
Graph Partitioning for Parallel Applications in Heterogeneous Grid Environments
NASA Technical Reports Server (NTRS)
Bisws, Rupak; Kumar, Shailendra; Das, Sajal K.; Biegel, Bryan (Technical Monitor)
2002-01-01
The problem of partitioning irregular graphs and meshes for parallel computations on homogeneous systems has been extensively studied. However, these partitioning schemes fail when the target system architecture exhibits heterogeneity in resource characteristics. With the emergence of technologies such as the Grid, it is imperative to study the partitioning problem taking into consideration the differing capabilities of such distributed heterogeneous systems. In our model, the heterogeneous system consists of processors with varying processing power and an underlying non-uniform communication network. We present in this paper a novel multilevel partitioning scheme for irregular graphs and meshes, that takes into account issues pertinent to Grid computing environments. Our partitioning algorithm, called MiniMax, generates and maps partitions onto a heterogeneous system with the objective of minimizing the maximum execution time of the parallel distributed application. For experimental performance study, we have considered both a realistic mesh problem from NASA as well as synthetic workloads. Simulation results demonstrate that MiniMax generates high quality partitions for various classes of applications targeted for parallel execution in a distributed heterogeneous environment.
Implicit schemes and parallel computing in unstructured grid CFD
NASA Technical Reports Server (NTRS)
Venkatakrishnam, V.
1995-01-01
The development of implicit schemes for obtaining steady state solutions to the Euler and Navier-Stokes equations on unstructured grids is outlined. Applications are presented that compare the convergence characteristics of various implicit methods. Next, the development of explicit and implicit schemes to compute unsteady flows on unstructured grids is discussed. Next, the issues involved in parallelizing finite volume schemes on unstructured meshes in an MIMD (multiple instruction/multiple data stream) fashion are outlined. Techniques for partitioning unstructured grids among processors and for extracting parallelism in explicit and implicit solvers are discussed. Finally, some dynamic load balancing ideas, which are useful in adaptive transient computations, are presented.
NASA Astrophysics Data System (ADS)
Maiti, Anup Kumar; Nath Roy, Jitendra; Mukhopadhyay, Sourangshu
2007-08-01
In the field of optical computing and parallel information processing, several number systems have been used for different arithmetic and algebraic operations. Therefore an efficient conversion scheme from one number system to another is very important. Modified trinary number (MTN) has already taken a significant role towards carry and borrow free arithmetic operations. In this communication, we propose a tree-net architecture based all optical conversion scheme from binary number to its MTN form. Optical switch using nonlinear material (NLM) plays an important role.
Graphics processing unit based computation for NDE applications
NASA Astrophysics Data System (ADS)
Nahas, C. A.; Rajagopal, Prabhu; Balasubramaniam, Krishnan; Krishnamurthy, C. V.
2012-05-01
Advances in parallel processing in recent years are helping to improve the cost of numerical simulation. Breakthroughs in Graphical Processing Unit (GPU) based computation now offer the prospect of further drastic improvements. The introduction of 'compute unified device architecture' (CUDA) by NVIDIA (the global technology company based in Santa Clara, California, USA) has made programming GPUs for general purpose computing accessible to the average programmer. Here we use CUDA to develop parallel finite difference schemes as applicable to two problems of interest to NDE community, namely heat diffusion and elastic wave propagation. The implementations are for two-dimensions. Performance improvement of the GPU implementation against serial CPU implementation is then discussed.
Massive parallel 3D PIC simulation of negative ion extraction
NASA Astrophysics Data System (ADS)
Revel, Adrien; Mochalskyy, Serhiy; Montellano, Ivar Mauricio; Wünderlich, Dirk; Fantz, Ursel; Minea, Tiberiu
2017-09-01
The 3D PIC-MCC code ONIX is dedicated to modeling Negative hydrogen/deuterium Ion (NI) extraction and co-extraction of electrons from radio-frequency driven, low pressure plasma sources. It provides valuable insight on the complex phenomena involved in the extraction process. In previous calculations, a mesh size larger than the Debye length was used, implying numerical electron heating. Important steps have been achieved in terms of computation performance and parallelization efficiency allowing successful massive parallel calculations (4096 cores), imperative to resolve the Debye length. In addition, the numerical algorithms have been improved in terms of grid treatment, i.e., the electric field near the complex geometry boundaries (plasma grid) is calculated more accurately. The revised model preserves the full 3D treatment, but can take advantage of a highly refined mesh. ONIX was used to investigate the role of the mesh size, the re-injection scheme for lost particles (extracted or wall absorbed), and the electron thermalization process on the calculated extracted current and plasma characteristics. It is demonstrated that all numerical schemes give the same NI current distribution for extracted ions. Concerning the electrons, the pair-injection technique is found well-adapted to simulate the sheath in front of the plasma grid.
Distributed computing for membrane-based modeling of action potential propagation.
Porras, D; Rogers, J M; Smith, W M; Pollard, A E
2000-08-01
Action potential propagation simulations with physiologic membrane currents and macroscopic tissue dimensions are computationally expensive. We, therefore, analyzed distributed computing schemes to reduce execution time in workstation clusters by parallelizing solutions with message passing. Four schemes were considered in two-dimensional monodomain simulations with the Beeler-Reuter membrane equations. Parallel speedups measured with each scheme were compared to theoretical speedups, recognizing the relationship between speedup and code portions that executed serially. A data decomposition scheme based on total ionic current provided the best performance. Analysis of communication latencies in that scheme led to a load-balancing algorithm in which measured speedups at 89 +/- 2% and 75 +/- 8% of theoretical speedups were achieved in homogeneous and heterogeneous clusters of workstations. Speedups in this scheme with the Luo-Rudy dynamic membrane equations exceeded 3.0 with eight distributed workstations. Cluster speedups were comparable to those measured during parallel execution on a shared memory machine.
Decentralized Adaptive Control For Robots
NASA Technical Reports Server (NTRS)
Seraji, Homayoun
1989-01-01
Precise knowledge of dynamics not required. Proposed scheme for control of multijointed robotic manipulator calls for independent control subsystem for each joint, consisting of proportional/integral/derivative feedback controller and position/velocity/acceleration feedforward controller, both with adjustable gains. Independent joint controller compensates for unpredictable effects, gravitation, and dynamic coupling between motions of joints, while forcing joints to track reference trajectories. Scheme amenable to parallel processing in distributed computing system wherein each joint controlled by relatively simple algorithm on dedicated microprocessor.
Variance reduction for Fokker–Planck based particle Monte Carlo schemes
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gorji, M. Hossein, E-mail: gorjih@ifd.mavt.ethz.ch; Andric, Nemanja; Jenny, Patrick
Recently, Fokker–Planck based particle Monte Carlo schemes have been proposed and evaluated for simulations of rarefied gas flows [1–3]. In this paper, the variance reduction for particle Monte Carlo simulations based on the Fokker–Planck model is considered. First, deviational based schemes were derived and reviewed, and it is shown that these deviational methods are not appropriate for practical Fokker–Planck based rarefied gas flow simulations. This is due to the fact that the deviational schemes considered in this study lead either to instabilities in the case of two-weight methods or to large statistical errors if the direct sampling method is applied.more » Motivated by this conclusion, we developed a novel scheme based on correlated stochastic processes. The main idea here is to synthesize an additional stochastic process with a known solution, which is simultaneously solved together with the main one. By correlating the two processes, the statistical errors can dramatically be reduced; especially for low Mach numbers. To assess the methods, homogeneous relaxation, planar Couette and lid-driven cavity flows were considered. For these test cases, it could be demonstrated that variance reduction based on parallel processes is very robust and effective.« less
NASA Astrophysics Data System (ADS)
Sun, Degui; Wang, Na-Xin; He, Li-Ming; Weng, Zhao-Heng; Wang, Daheng; Chen, Ray T.
1996-06-01
A space-position-logic-encoding scheme is proposed and demonstrated. This encoding scheme not only makes the best use of the convenience of binary logic operation, but is also suitable for the trinary property of modified signed- digit (MSD) numbers. Based on the space-position-logic-encoding scheme, a fully parallel modified signed-digit adder and subtractor is built using optoelectronic switch technologies in conjunction with fiber-multistage 3D optoelectronic interconnects. Thus an effective combination of a parallel algorithm and a parallel architecture is implemented. In addition, the performance of the optoelectronic switches used in this system is experimentally studied and verified. Both the 3-bit experimental model and the experimental results of a parallel addition and a parallel subtraction are provided and discussed. Finally, the speed ratio between the MSD adder and binary adders is discussed and the advantage of the MSD in operating speed is demonstrated.
NASA Technical Reports Server (NTRS)
Mulac, Richard A.; Celestina, Mark L.; Adamczyk, John J.; Misegades, Kent P.; Dawson, Jef M.
1987-01-01
A procedure is outlined which utilizes parallel processing to solve the inviscid form of the average-passage equation system for multistage turbomachinery along with a description of its implementation in a FORTRAN computer code, MSTAGE. A scheme to reduce the central memory requirements of the program is also detailed. Both the multitasking and I/O routines referred to are specific to the Cray X-MP line of computers and its associated SSD (Solid-State Disk). Results are presented for a simulation of a two-stage rocket engine fuel pump turbine.
Parallel optoelectronic trinary signed-digit division
NASA Astrophysics Data System (ADS)
Alam, Mohammad S.
1999-03-01
The trinary signed-digit (TSD) number system has been found to be very useful for parallel addition and subtraction of any arbitrary length operands in constant time. Using the TSD addition and multiplication modules as the basic building blocks, we develop an efficient algorithm for performing parallel TSD division in constant time. The proposed division technique uses one TSD subtraction and two TSD multiplication steps. An optoelectronic correlator based architecture is suggested for implementation of the proposed TSD division algorithm, which fully exploits the parallelism and high processing speed of optics. An efficient spatial encoding scheme is used to ensure better utilization of space bandwidth product of the spatial light modulators used in the optoelectronic implementation.
Hierarchical Parallelism in Finite Difference Analysis of Heat Conduction
NASA Technical Reports Server (NTRS)
Padovan, Joseph; Krishna, Lala; Gute, Douglas
1997-01-01
Based on the concept of hierarchical parallelism, this research effort resulted in highly efficient parallel solution strategies for very large scale heat conduction problems. Overall, the method of hierarchical parallelism involves the partitioning of thermal models into several substructured levels wherein an optimal balance into various associated bandwidths is achieved. The details are described in this report. Overall, the report is organized into two parts. Part 1 describes the parallel modelling methodology and associated multilevel direct, iterative and mixed solution schemes. Part 2 establishes both the formal and computational properties of the scheme.
A parallel-pipelined architecture for a multi carrier demodulator
NASA Astrophysics Data System (ADS)
Kwatra, S. C.; Jamali, M. M.; Eugene, Linus P.
1991-03-01
Analog devices have been used for processing the information on board the satellites. Presently, digital devices are being used because they are economical and flexible as compared to their analog counterparts. Several schemes of digital transmission can be used depending on the data rate requirement of the user. An economical scheme of transmission for small earth stations uses single channel per carrier/frequency division multiple access (SCPC/FDMA) on the uplink and time division multiplexing (TDM) on the downlink. This is a typical communication service offered to low data rate users in commercial mass market. These channels usually pertain to either voice or data transmission. An efficient digital demodulator architecture is provided for a large number of law data rate users. A demodulator primarily consists of carrier, clock, and data recovery modules. This design uses principles of parallel processing, pipelining, and time sharing schemes to process large numbers of voice or data channels. It maintains the optimum throughput which is derived from the designed architecture and from the use of high speed components. The design is optimized for reduced power and area requirements. This is essential for satellite applications. The design is also flexible in processing a group of a varying number of channels. The algorithms that are used are verified by the use of a computer aided software engineering (CASE) tool called the Block Oriented System Simulator. The data flow, control circuitry, and interface of the hardware design is simulated in C language. Also, a multiprocessor approach is provided to map, model, and simulate the demodulation algorithms mainly from a speed view point. A hypercude based architecture implementation is provided for such a scheme of operation. The hypercube structure and the demodulation models on hypercubes are simulated in Ada.
NASA Technical Reports Server (NTRS)
Kwatra, S. C.; Jamali, M. M.; Eugene, Linus P.
1991-01-01
Analog devices have been used for processing the information on board the satellites. Presently, digital devices are being used because they are economical and flexible as compared to their analog counterparts. Several schemes of digital transmission can be used depending on the data rate requirement of the user. An economical scheme of transmission for small earth stations uses single channel per carrier/frequency division multiple access (SCPC/FDMA) on the uplink and time division multiplexing (TDM) on the downlink. This is a typical communication service offered to low data rate users in commercial mass market. These channels usually pertain to either voice or data transmission. An efficient digital demodulator architecture is provided for a large number of law data rate users. A demodulator primarily consists of carrier, clock, and data recovery modules. This design uses principles of parallel processing, pipelining, and time sharing schemes to process large numbers of voice or data channels. It maintains the optimum throughput which is derived from the designed architecture and from the use of high speed components. The design is optimized for reduced power and area requirements. This is essential for satellite applications. The design is also flexible in processing a group of a varying number of channels. The algorithms that are used are verified by the use of a computer aided software engineering (CASE) tool called the Block Oriented System Simulator. The data flow, control circuitry, and interface of the hardware design is simulated in C language. Also, a multiprocessor approach is provided to map, model, and simulate the demodulation algorithms mainly from a speed view point. A hypercude based architecture implementation is provided for such a scheme of operation. The hypercube structure and the demodulation models on hypercubes are simulated in Ada.
Box schemes and their implementation on the iPSC/860
NASA Technical Reports Server (NTRS)
Chattot, J. J.; Merriam, M. L.
1991-01-01
Research on algoriths for efficiently solving fluid flow problems on massively parallel computers is continued in the present paper. Attention is given to the implementation of a box scheme on the iPSC/860, a massively parallel computer with a peak speed of 10 Gflops and a memory of 128 Mwords. A domain decomposition approach to parallelism is used.
1987-09-01
Later, when these alloca- :t)il t rate(-ies beconme a p)erfornmance concern, the schieduler can be inolded 4 toN fit the p~ articular appllicationi... distracts attention from the more important points that this example is intended to demonstrate. The implementation, therefore, is described separately in...for the benefit of outsiders. From the object’s point of view the pipeline is nothing but a list of messages that tell it how to mutate its own state
GPU-Accelerated Stony-Brook University 5-class Microphysics Scheme in WRF
NASA Astrophysics Data System (ADS)
Mielikainen, J.; Huang, B.; Huang, A.
2011-12-01
The Weather Research and Forecasting (WRF) model is a next-generation mesoscale numerical weather prediction system. Microphysics plays an important role in weather and climate prediction. Several bulk water microphysics schemes are available within the WRF, with different numbers of simulated hydrometeor classes and methods for estimating their size fall speeds, distributions and densities. Stony-Brook University scheme (SBU-YLIN) is a 5-class scheme with riming intensity predicted to account for mixed-phase processes. In the past few years, co-processing on Graphics Processing Units (GPUs) has been a disruptive technology in High Performance Computing (HPC). GPUs use the ever increasing transistor count for adding more processor cores. Therefore, GPUs are well suited for massively data parallel processing with high floating point arithmetic intensity. Thus, it is imperative to update legacy scientific applications to take advantage of this unprecedented increase in computing power. CUDA is an extension to the C programming language offering programming GPU's directly. It is designed so that its constructs allow for natural expression of data-level parallelism. A CUDA program is organized into two parts: a serial program running on the CPU and a CUDA kernel running on the GPU. The CUDA code consists of three computational phases: transmission of data into the global memory of the GPU, execution of the CUDA kernel, and transmission of results from the GPU into the memory of CPU. CUDA takes a bottom-up point of view of parallelism is which thread is an atomic unit of parallelism. Individual threads are part of groups called warps, within which every thread executes exactly the same sequence of instructions. To test SBU-YLIN, we used a CONtinental United States (CONUS) benchmark data set for 12 km resolution domain for October 24, 2001. A WRF domain is a geographic region of interest discretized into a 2-dimensional grid parallel to the ground. Each grid point has multiple levels, which correspond to various vertical heights in the atmosphere. The size of the CONUS 12 km domain is 433 x 308 horizontal grid points with 35 vertical levels. First, the entire SBU-YLIN Fortran code was rewritten in C in preparation of GPU accelerated version. After that, C code was verified against Fortran code for identical outputs. Default compiler options from WRF were used for gfortran and gcc compilers. The processing time for the original Fortran code is 12274 ms and 12893 ms for C version. The processing times for GPU implementation of SBU-YLIN microphysics scheme with I/O are 57.7 ms and 37.2 ms for 1 and 2 GPUs, respectively. The corresponding speedups are 213x and 330x compared to a Fortran implementation. Without I/O the speedup is 896x on 1 GPU. Obviously, ignoring I/O time speedup scales linearly with GPUs. Thus, 2 GPUs have a speedup of 1788x without I/O. Microphysics computation is just a small part of the whole WRF model. After having completely implemented WRF on GPU, the inputs for SBU-YLIN do not have to be transferred from CPU. Instead they are results of previous WRF modules. Therefore, the role of I/O is greatly diminished once all of WRF have been converted to run on GPUs. In the near future, we expect to have a WRF running completely on GPUs for a superior performance.
A massively parallel computational approach to coupled thermoelastic/porous gas flow problems
NASA Technical Reports Server (NTRS)
Shia, David; Mcmanus, Hugh L.
1995-01-01
A new computational scheme for coupled thermoelastic/porous gas flow problems is presented. Heat transfer, gas flow, and dynamic thermoelastic governing equations are expressed in fully explicit form, and solved on a massively parallel computer. The transpiration cooling problem is used as an example problem. The numerical solutions have been verified by comparison to available analytical solutions. Transient temperature, pressure, and stress distributions have been obtained. Small spatial oscillations in pressure and stress have been observed, which would be impractical to predict with previously available schemes. Comparisons between serial and massively parallel versions of the scheme have also been made. The results indicate that for small scale problems the serial and parallel versions use practically the same amount of CPU time. However, as the problem size increases the parallel version becomes more efficient than the serial version.
PIXIE3D: A Parallel, Implicit, eXtended MHD 3D Code.
NASA Astrophysics Data System (ADS)
Chacon, L.; Knoll, D. A.
2004-11-01
We report on the development of PIXIE3D, a 3D parallel, fully implicit Newton-Krylov extended primitive-variable MHD code in general curvilinear geometry. PIXIE3D employs a second-order, finite-volume-based spatial discretization that satisfies remarkable properties such as being conservative, solenoidal in the magnetic field, non-dissipative, and stable in the absence of physical dissipation.(L. Chacón , phComput. Phys. Comm.) submitted (2004) PIXIE3D employs fully-implicit Newton-Krylov methods for the time advance. Currently, first and second-order implicit schemes are available, although higher-order temporal implicit schemes can be effortlessly implemented within the Newton-Krylov framework. A successful, scalable, MG physics-based preconditioning strategy, similar in concept to previous 2D MHD efforts,(L. Chacón et al., phJ. Comput. Phys). 178 (1), 15- 36 (2002); phJ. Comput. Phys., 188 (2), 573-592 (2003) has been developed. We are currently in the process of parallelizing the code using the PETSc library, and a Newton-Krylov-Schwarz approach for the parallel treatment of the preconditioner. In this poster, we will report on both the serial and parallel performance of PIXIE3D, focusing primarily on scalability and CPU speedup vs. an explicit approach.
NASA Astrophysics Data System (ADS)
Abramov, E. Y.; Sopov, V. I.
2017-10-01
In a given research using the example of traction network area with high asymmetry of power supply parameters, the sequence of comparative assessment of power losses in DC traction network with parallel and traditional separated operating modes of traction substation feeders was shown. Experimental measurements were carried out under these modes of operation. The calculation data results based on statistic processing showed the power losses decrease in contact network and the increase in feeders. The changes proved to be critical ones and this demonstrates the significance of potential effects when converting traction network areas into parallel feeder operation. An analytical method of calculation the average power losses for different feed schemes of the traction network was developed. On its basis, the dependences of the relative losses were obtained by varying the difference in feeder voltages. The calculation results showed unreasonableness transition to a two-sided feed scheme for the considered traction network area. A larger reduction in the total power loss can be obtained with a smaller difference of the feeders’ resistance and / or a more symmetrical sectioning scheme of contact network.
NASA Astrophysics Data System (ADS)
Lee, J.; Kim, K.
A Very Large Scale Integration (VLSI) architecture for robot direct kinematic computation suitable for industrial robot manipulators was investigated. The Denavit-Hartenberg transformations are reviewed to exploit a proper processing element, namely an augmented CORDIC. Specifically, two distinct implementations are elaborated on, such as the bit-serial and parallel. Performance of each scheme is analyzed with respect to the time to compute one location of the end-effector of a 6-links manipulator, and the number of transistors required.
NASA Technical Reports Server (NTRS)
Lee, J.; Kim, K.
1991-01-01
A Very Large Scale Integration (VLSI) architecture for robot direct kinematic computation suitable for industrial robot manipulators was investigated. The Denavit-Hartenberg transformations are reviewed to exploit a proper processing element, namely an augmented CORDIC. Specifically, two distinct implementations are elaborated on, such as the bit-serial and parallel. Performance of each scheme is analyzed with respect to the time to compute one location of the end-effector of a 6-links manipulator, and the number of transistors required.
NASA Astrophysics Data System (ADS)
Nurhasanah, F.; Kusumah, Y. S.; Sabandar, J.; Suryadi, D.
2018-05-01
As one of the non-conventional mathematics concepts, Parallel Coordinates is potential to be learned by pre-service mathematics teachers in order to give them experiences in constructing richer schemes and doing abstraction process. Unfortunately, the study related to this issue is still limited. This study wants to answer a research question “to what extent the abstraction process of pre-service mathematics teachers in learning concept of Parallel Coordinates could indicate their performance in learning Analytic Geometry”. This is a case study that part of a larger study in examining mathematical abstraction of pre-service mathematics teachers in learning non-conventional mathematics concept. Descriptive statistics method is used in this study to analyze the scores from three different tests: Cartesian Coordinate, Parallel Coordinates, and Analytic Geometry. The participants in this study consist of 45 pre-service mathematics teachers. The result shows that there is a linear association between the score on Cartesian Coordinate and Parallel Coordinates. There also found that the higher levels of the abstraction process in learning Parallel Coordinates are linearly associated with higher student achievement in Analytic Geometry. The result of this study shows that the concept of Parallel Coordinates has a significant role for pre-service mathematics teachers in learning Analytic Geometry.
NASA Astrophysics Data System (ADS)
Xu, Jincheng; Liu, Wei; Wang, Jin; Liu, Linong; Zhang, Jianfeng
2018-02-01
De-absorption pre-stack time migration (QPSTM) compensates for the absorption and dispersion of seismic waves by introducing an effective Q parameter, thereby making it an effective tool for 3D, high-resolution imaging of seismic data. Although the optimal aperture obtained via stationary-phase migration reduces the computational cost of 3D QPSTM and yields 3D stationary-phase QPSTM, the associated computational efficiency is still the main problem in the processing of 3D, high-resolution images for real large-scale seismic data. In the current paper, we proposed a division method for large-scale, 3D seismic data to optimize the performance of stationary-phase QPSTM on clusters of graphics processing units (GPU). Then, we designed an imaging point parallel strategy to achieve an optimal parallel computing performance. Afterward, we adopted an asynchronous double buffering scheme for multi-stream to perform the GPU/CPU parallel computing. Moreover, several key optimization strategies of computation and storage based on the compute unified device architecture (CUDA) were adopted to accelerate the 3D stationary-phase QPSTM algorithm. Compared with the initial GPU code, the implementation of the key optimization steps, including thread optimization, shared memory optimization, register optimization and special function units (SFU), greatly improved the efficiency. A numerical example employing real large-scale, 3D seismic data showed that our scheme is nearly 80 times faster than the CPU-QPSTM algorithm. Our GPU/CPU heterogeneous parallel computing framework significant reduces the computational cost and facilitates 3D high-resolution imaging for large-scale seismic data.
AMITIS: A 3D GPU-Based Hybrid-PIC Model for Space and Plasma Physics
NASA Astrophysics Data System (ADS)
Fatemi, Shahab; Poppe, Andrew R.; Delory, Gregory T.; Farrell, William M.
2017-05-01
We have developed, for the first time, an advanced modeling infrastructure in space simulations (AMITIS) with an embedded three-dimensional self-consistent grid-based hybrid model of plasma (kinetic ions and fluid electrons) that runs entirely on graphics processing units (GPUs). The model uses NVIDIA GPUs and their associated parallel computing platform, CUDA, developed for general purpose processing on GPUs. The model uses a single CPU-GPU pair, where the CPU transfers data between the system and GPU memory, executes CUDA kernels, and writes simulation outputs on the disk. All computations, including moving particles, calculating macroscopic properties of particles on a grid, and solving hybrid model equations are processed on a single GPU. We explain various computing kernels within AMITIS and compare their performance with an already existing well-tested hybrid model of plasma that runs in parallel using multi-CPU platforms. We show that AMITIS runs ∼10 times faster than the parallel CPU-based hybrid model. We also introduce an implicit solver for computation of Faraday’s Equation, resulting in an explicit-implicit scheme for the hybrid model equation. We show that the proposed scheme is stable and accurate. We examine the AMITIS energy conservation and show that the energy is conserved with an error < 0.2% after 500,000 timesteps, even when a very low number of particles per cell is used.
Efficient parallel resolution of the simplified transport equations in mixed-dual formulation
NASA Astrophysics Data System (ADS)
Barrault, M.; Lathuilière, B.; Ramet, P.; Roman, J.
2011-03-01
A reactivity computation consists of computing the highest eigenvalue of a generalized eigenvalue problem, for which an inverse power algorithm is commonly used. Very fine modelizations are difficult to treat for our sequential solver, based on the simplified transport equations, in terms of memory consumption and computational time. A first implementation of a Lagrangian based domain decomposition method brings to a poor parallel efficiency because of an increase in the power iterations [1]. In order to obtain a high parallel efficiency, we improve the parallelization scheme by changing the location of the loop over the subdomains in the overall algorithm and by benefiting from the characteristics of the Raviart-Thomas finite element. The new parallel algorithm still allows us to locally adapt the numerical scheme (mesh, finite element order). However, it can be significantly optimized for the matching grid case. The good behavior of the new parallelization scheme is demonstrated for the matching grid case on several hundreds of nodes for computations based on a pin-by-pin discretization.
NASA Technical Reports Server (NTRS)
Mulac, Richard A.; Celestina, Mark L.; Adamczyk, John J.; Misegades, Kent P.; Dawson, Jef M.
1987-01-01
A procedure is outlined which utilizes parallel processing to solve the inviscid form of the average-passage equation system for multistage turbomachinery along with a description of its implementation in a FORTRAN computer code, MSTAGE. A scheme to reduce the central memory requirements of the program is also detailed. Both the multitasking and I/O routines referred to in this paper are specific to the Cray X-MP line of computers and its associated SSD (Solid-state Storage Device). Results are presented for a simulation of a two-stage rocket engine fuel pump turbine.
Information criteria for quantifying loss of reversibility in parallelized KMC
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gourgoulias, Konstantinos, E-mail: gourgoul@math.umass.edu; Katsoulakis, Markos A., E-mail: markos@math.umass.edu; Rey-Bellet, Luc, E-mail: luc@math.umass.edu
Parallel Kinetic Monte Carlo (KMC) is a potent tool to simulate stochastic particle systems efficiently. However, despite literature on quantifying domain decomposition errors of the particle system for this class of algorithms in the short and in the long time regime, no study yet explores and quantifies the loss of time-reversibility in Parallel KMC. Inspired by concepts from non-equilibrium statistical mechanics, we propose the entropy production per unit time, or entropy production rate, given in terms of an observable and a corresponding estimator, as a metric that quantifies the loss of reversibility. Typically, this is a quantity that cannot bemore » computed explicitly for Parallel KMC, which is why we develop a posteriori estimators that have good scaling properties with respect to the size of the system. Through these estimators, we can connect the different parameters of the scheme, such as the communication time step of the parallelization, the choice of the domain decomposition, and the computational schedule, with its performance in controlling the loss of reversibility. From this point of view, the entropy production rate can be seen both as an information criterion to compare the reversibility of different parallel schemes and as a tool to diagnose reversibility issues with a particular scheme. As a demonstration, we use Sandia Lab's SPPARKS software to compare different parallelization schemes and different domain (lattice) decompositions.« less
Li, Chuan; Petukh, Marharyta; Li, Lin; Alexov, Emil
2013-08-15
Due to the enormous importance of electrostatics in molecular biology, calculating the electrostatic potential and corresponding energies has become a standard computational approach for the study of biomolecules and nano-objects immersed in water and salt phase or other media. However, the electrostatics of large macromolecules and macromolecular complexes, including nano-objects, may not be obtainable via explicit methods and even the standard continuum electrostatics methods may not be applicable due to high computational time and memory requirements. Here, we report further development of the parallelization scheme reported in our previous work (Li, et al., J. Comput. Chem. 2012, 33, 1960) to include parallelization of the molecular surface and energy calculations components of the algorithm. The parallelization scheme utilizes different approaches such as space domain parallelization, algorithmic parallelization, multithreading, and task scheduling, depending on the quantity being calculated. This allows for efficient use of the computing resources of the corresponding computer cluster. The parallelization scheme is implemented in the popular software DelPhi and results in speedup of several folds. As a demonstration of the efficiency and capability of this methodology, the electrostatic potential, and electric field distributions are calculated for the bovine mitochondrial supercomplex illustrating their complex topology, which cannot be obtained by modeling the supercomplex components alone. Copyright © 2013 Wiley Periodicals, Inc.
Information criteria for quantifying loss of reversibility in parallelized KMC
NASA Astrophysics Data System (ADS)
Gourgoulias, Konstantinos; Katsoulakis, Markos A.; Rey-Bellet, Luc
2017-01-01
Parallel Kinetic Monte Carlo (KMC) is a potent tool to simulate stochastic particle systems efficiently. However, despite literature on quantifying domain decomposition errors of the particle system for this class of algorithms in the short and in the long time regime, no study yet explores and quantifies the loss of time-reversibility in Parallel KMC. Inspired by concepts from non-equilibrium statistical mechanics, we propose the entropy production per unit time, or entropy production rate, given in terms of an observable and a corresponding estimator, as a metric that quantifies the loss of reversibility. Typically, this is a quantity that cannot be computed explicitly for Parallel KMC, which is why we develop a posteriori estimators that have good scaling properties with respect to the size of the system. Through these estimators, we can connect the different parameters of the scheme, such as the communication time step of the parallelization, the choice of the domain decomposition, and the computational schedule, with its performance in controlling the loss of reversibility. From this point of view, the entropy production rate can be seen both as an information criterion to compare the reversibility of different parallel schemes and as a tool to diagnose reversibility issues with a particular scheme. As a demonstration, we use Sandia Lab's SPPARKS software to compare different parallelization schemes and different domain (lattice) decompositions.
Examining the architecture of cellular computing through a comparative study with a computer
Wang, Degeng; Gribskov, Michael
2005-01-01
The computer and the cell both use information embedded in simple coding, the binary software code and the quadruple genomic code, respectively, to support system operations. A comparative examination of their system architecture as well as their information storage and utilization schemes is performed. On top of the code, both systems display a modular, multi-layered architecture, which, in the case of a computer, arises from human engineering efforts through a combination of hardware implementation and software abstraction. Using the computer as a reference system, a simplistic mapping of the architectural components between the two is easily detected. This comparison also reveals that a cell abolishes the software–hardware barrier through genomic encoding for the constituents of the biochemical network, a cell's ‘hardware’ equivalent to the computer central processing unit (CPU). The information loading (gene expression) process acts as a major determinant of the encoded constituent's abundance, which, in turn, often determines the ‘bandwidth’ of a biochemical pathway. Cellular processes are implemented in biochemical pathways in parallel manners. In a computer, on the other hand, the software provides only instructions and data for the CPU. A process represents just sequentially ordered actions by the CPU and only virtual parallelism can be implemented through CPU time-sharing. Whereas process management in a computer may simply mean job scheduling, coordinating pathway bandwidth through the gene expression machinery represents a major process management scheme in a cell. In summary, a cell can be viewed as a super-parallel computer, which computes through controlled hardware composition. While we have, at best, a very fragmented understanding of cellular operation, we have a thorough understanding of the computer throughout the engineering process. The potential utilization of this knowledge to the benefit of systems biology is discussed. PMID:16849179
Examining the architecture of cellular computing through a comparative study with a computer.
Wang, Degeng; Gribskov, Michael
2005-06-22
The computer and the cell both use information embedded in simple coding, the binary software code and the quadruple genomic code, respectively, to support system operations. A comparative examination of their system architecture as well as their information storage and utilization schemes is performed. On top of the code, both systems display a modular, multi-layered architecture, which, in the case of a computer, arises from human engineering efforts through a combination of hardware implementation and software abstraction. Using the computer as a reference system, a simplistic mapping of the architectural components between the two is easily detected. This comparison also reveals that a cell abolishes the software-hardware barrier through genomic encoding for the constituents of the biochemical network, a cell's "hardware" equivalent to the computer central processing unit (CPU). The information loading (gene expression) process acts as a major determinant of the encoded constituent's abundance, which, in turn, often determines the "bandwidth" of a biochemical pathway. Cellular processes are implemented in biochemical pathways in parallel manners. In a computer, on the other hand, the software provides only instructions and data for the CPU. A process represents just sequentially ordered actions by the CPU and only virtual parallelism can be implemented through CPU time-sharing. Whereas process management in a computer may simply mean job scheduling, coordinating pathway bandwidth through the gene expression machinery represents a major process management scheme in a cell. In summary, a cell can be viewed as a super-parallel computer, which computes through controlled hardware composition. While we have, at best, a very fragmented understanding of cellular operation, we have a thorough understanding of the computer throughout the engineering process. The potential utilization of this knowledge to the benefit of systems biology is discussed.
Parallel Processable Cryptographic Methods with Unbounded Practical Security.
ERIC Educational Resources Information Center
Rothstein, Jerome
Addressing the problem of protecting confidential information and data stored in computer databases from access by unauthorized parties, this paper details coding schemes which present such astronomical work factors to potential code breakers that security breaches are hopeless in any practical sense. Two procedures which can be used to encode for…
Trinary signed-digit arithmetic using an efficient encoding scheme
NASA Astrophysics Data System (ADS)
Salim, W. Y.; Alam, M. S.; Fyath, R. S.; Ali, S. A.
2000-09-01
The trinary signed-digit (TSD) number system is of interest for ultrafast optoelectronic computing systems since it permits parallel carry-free addition and borrow-free subtraction of two arbitrary length numbers in constant time. In this paper, a simple coding scheme is proposed to encode the decimal number directly into the TSD form. The coding scheme enables one to perform parallel one-step TSD arithmetic operation. The proposed coding scheme uses only a 5-combination coding table instead of the 625-combination table reported recently for recoded TSD arithmetic technique.
One-step trinary signed-digit arithmetic using an efficient encoding scheme
NASA Astrophysics Data System (ADS)
Salim, W. Y.; Fyath, R. S.; Ali, S. A.; Alam, Mohammad S.
2000-11-01
The trinary signed-digit (TSD) number system is of interest for ultra fast optoelectronic computing systems since it permits parallel carry-free addition and borrow-free subtraction of two arbitrary length numbers in constant time. In this paper, a simple coding scheme is proposed to encode the decimal number directly into the TSD form. The coding scheme enables one to perform parallel one-step TSD arithmetic operation. The proposed coding scheme uses only a 5-combination coding table instead of the 625-combination table reported recently for recoded TSD arithmetic technique.
Parallel/distributed direct method for solving linear systems
NASA Technical Reports Server (NTRS)
Lin, Avi
1990-01-01
A new family of parallel schemes for directly solving linear systems is presented and analyzed. It is shown that these schemes exhibit a near optimal performance and enjoy several important features: (1) For large enough linear systems, the design of the appropriate paralleled algorithm is insensitive to the number of processors as its performance grows monotonically with them; (2) It is especially good for large matrices, with dimensions large relative to the number of processors in the system; (3) It can be used in both distributed parallel computing environments and tightly coupled parallel computing systems; and (4) This set of algorithms can be mapped onto any parallel architecture without any major programming difficulties or algorithmical changes.
Vectorization for Molecular Dynamics on Intel Xeon Phi Corpocessors
NASA Astrophysics Data System (ADS)
Yi, Hongsuk
2014-03-01
Many modern processors are capable of exploiting data-level parallelism through the use of single instruction multiple data (SIMD) execution. The new Intel Xeon Phi coprocessor supports 512 bit vector registers for the high performance computing. In this paper, we have developed a hierarchical parallelization scheme for accelerated molecular dynamics simulations with the Terfoff potentials for covalent bond solid crystals on Intel Xeon Phi coprocessor systems. The scheme exploits multi-level parallelism computing. We combine thread-level parallelism using a tightly coupled thread-level and task-level parallelism with 512-bit vector register. The simulation results show that the parallel performance of SIMD implementations on Xeon Phi is apparently superior to their x86 CPU architecture.
NASA Astrophysics Data System (ADS)
Han, Qiguo; Zhu, Kai; Shi, Wenming; Wu, Kuayu; Chen, Kai
2018-02-01
In order to solve the problem of low voltage ride through(LVRT) of the major auxiliary equipment’s variable-frequency drive (VFD) in thermal power plant, the scheme of supercapacitor paralleled in the DC link of VFD is put forward, furthermore, two solutions of direct parallel support and voltage boost parallel support of supercapacitor are proposed. The capacitor values for the relevant motor loads are calculated according to the law of energy conservation, and they are verified by Matlab simulation. At last, a set of test prototype is set up, and the test results prove the feasibility of the proposed schemes.
Design and DSP implementation of star image acquisition and star point fast acquiring and tracking
NASA Astrophysics Data System (ADS)
Zhou, Guohui; Wang, Xiaodong; Hao, Zhihang
2006-02-01
Star sensor is a special high accuracy photoelectric sensor. Attitude acquisition time is an important function index of star sensor. In this paper, the design target is to acquire 10 samples per second dynamic performance. On the basis of analyzing CCD signals timing and star image processing, a new design and a special parallel architecture for improving star image processing are presented in this paper. In the design, the operation moving the data in expanded windows including the star to the on-chip memory of DSP is arranged in the invalid period of CCD frame signal. During the CCD saving the star image to memory, DSP processes the data in the on-chip memory. This parallelism greatly improves the efficiency of processing. The scheme proposed here results in enormous savings of memory normally required. In the scheme, DSP HOLD mode and CPLD technology are used to make a shared memory between CCD and DSP. The efficiency of processing is discussed in numerical tests. Only in 3.5ms is acquired the five lightest stars in the star acquisition stage. In 43us, the data in five expanded windows including stars are moved into the internal memory of DSP, and in 1.6ms, five star coordinates are achieved in the star tracking stage.
NASA Astrophysics Data System (ADS)
Yunguo, Gao
1996-12-01
This scheme structure is for positioning 4000 optical fibres of LAMOST telescope. It adopts the swing rods adjusted parallel and simultaneously by many small tables. The problems, for example, positioning accuracy of the optical fibers, the time to readjust all the 4000 optical fibres and error correction, etc. have been considered in the scheme. The structure has no blind area.
NASA Technical Reports Server (NTRS)
Kasahara, Hironori; Honda, Hiroki; Narita, Seinosuke
1989-01-01
Parallel processing of real-time dynamic systems simulation on a multiprocessor system named OSCAR is presented. In the simulation of dynamic systems, generally, the same calculation are repeated every time step. However, we cannot apply to Do-all or the Do-across techniques for parallel processing of the simulation since there exist data dependencies from the end of an iteration to the beginning of the next iteration and furthermore data-input and data-output are required every sampling time period. Therefore, parallelism inside the calculation required for a single time step, or a large basic block which consists of arithmetic assignment statements, must be used. In the proposed method, near fine grain tasks, each of which consists of one or more floating point operations, are generated to extract the parallelism from the calculation and assigned to processors by using optimal static scheduling at compile time in order to reduce large run time overhead caused by the use of near fine grain tasks. The practicality of the scheme is demonstrated on OSCAR (Optimally SCheduled Advanced multiprocessoR) which has been developed to extract advantageous features of static scheduling algorithms to the maximum extent.
ADAPTIVE TETRAHEDRAL GRID REFINEMENT AND COARSENING IN MESSAGE-PASSING ENVIRONMENTS
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hallberg, J.; Stagg, A.
2000-10-01
A grid refinement and coarsening scheme has been developed for tetrahedral and triangular grid-based calculations in message-passing environments. The element adaption scheme is based on an edge bisection of elements marked for refinement by an appropriate error indicator. Hash-table/linked-list data structures are used to store nodal and element formation. The grid along inter-processor boundaries is refined and coarsened consistently with the update of these data structures via MPI calls. The parallel adaption scheme has been applied to the solution of a transient, three-dimensional, nonlinear, groundwater flow problem. Timings indicate efficiency of the grid refinement process relative to the flow solvermore » calculations.« less
NASA Technical Reports Server (NTRS)
Aftosmis, M. J.; Berger, M. J.; Adomavicius, G.
2000-01-01
Preliminary verification and validation of an efficient Euler solver for adaptively refined Cartesian meshes with embedded boundaries is presented. The parallel, multilevel method makes use of a new on-the-fly parallel domain decomposition strategy based upon the use of space-filling curves, and automatically generates a sequence of coarse meshes for processing by the multigrid smoother. The coarse mesh generation algorithm produces grids which completely cover the computational domain at every level in the mesh hierarchy. A series of examples on realistically complex three-dimensional configurations demonstrate that this new coarsening algorithm reliably achieves mesh coarsening ratios in excess of 7 on adaptively refined meshes. Numerical investigations of the scheme's local truncation error demonstrate an achieved order of accuracy between 1.82 and 1.88. Convergence results for the multigrid scheme are presented for both subsonic and transonic test cases and demonstrate W-cycle multigrid convergence rates between 0.84 and 0.94. Preliminary parallel scalability tests on both simple wing and complex complete aircraft geometries shows a computational speedup of 52 on 64 processors using the run-time mesh partitioner.
ERIC Educational Resources Information Center
La Brecque, Mort
1984-01-01
To break the bottleneck inherent in today's linear computer architectures, parallel schemes (which allow computers to perform multiple tasks at one time) are being devised. Several of these schemes are described. Dataflow devices, parallel number-crunchers, programing languages, and a device based on a neurological model are among the areas…
Li, Haiou; Lu, Liyao; Chen, Rong; Quan, Lijun; Xia, Xiaoyan; Lü, Qiang
2014-01-01
Structural information related to protein-peptide complexes can be very useful for novel drug discovery and design. The computational docking of protein and peptide can supplement the structural information available on protein-peptide interactions explored by experimental ways. Protein-peptide docking of this paper can be described as three processes that occur in parallel: ab-initio peptide folding, peptide docking with its receptor, and refinement of some flexible areas of the receptor as the peptide is approaching. Several existing methods have been used to sample the degrees of freedom in the three processes, which are usually triggered in an organized sequential scheme. In this paper, we proposed a parallel approach that combines all the three processes during the docking of a folding peptide with a flexible receptor. This approach mimics the actual protein-peptide docking process in parallel way, and is expected to deliver better performance than sequential approaches. We used 22 unbound protein-peptide docking examples to evaluate our method. Our analysis of the results showed that the explicit refinement of the flexible areas of the receptor facilitated more accurate modeling of the interfaces of the complexes, while combining all of the moves in parallel helped the constructing of energy funnels for predictions.
Optimized Laplacian image sharpening algorithm based on graphic processing unit
NASA Astrophysics Data System (ADS)
Ma, Tinghuai; Li, Lu; Ji, Sai; Wang, Xin; Tian, Yuan; Al-Dhelaan, Abdullah; Al-Rodhaan, Mznah
2014-12-01
In classical Laplacian image sharpening, all pixels are processed one by one, which leads to large amount of computation. Traditional Laplacian sharpening processed on CPU is considerably time-consuming especially for those large pictures. In this paper, we propose a parallel implementation of Laplacian sharpening based on Compute Unified Device Architecture (CUDA), which is a computing platform of Graphic Processing Units (GPU), and analyze the impact of picture size on performance and the relationship between the processing time of between data transfer time and parallel computing time. Further, according to different features of different memory, an improved scheme of our method is developed, which exploits shared memory in GPU instead of global memory and further increases the efficiency. Experimental results prove that two novel algorithms outperform traditional consequentially method based on OpenCV in the aspect of computing speed.
NASA Astrophysics Data System (ADS)
Huang, Melin; Huang, Bormin; Huang, Allen H.
2014-10-01
The Weather Research and Forecasting (WRF) model provided operational services worldwide in many areas and has linked to our daily activity, in particular during severe weather events. The scheme of Yonsei University (YSU) is one of planetary boundary layer (PBL) models in WRF. The PBL is responsible for vertical sub-grid-scale fluxes due to eddy transports in the whole atmospheric column, determines the flux profiles within the well-mixed boundary layer and the stable layer, and thus provide atmospheric tendencies of temperature, moisture (including clouds), and horizontal momentum in the entire atmospheric column. The YSU scheme is very suitable for massively parallel computation as there are no interactions among horizontal grid points. To accelerate the computation process of the YSU scheme, we employ Intel Many Integrated Core (MIC) Architecture as it is a multiprocessor computer structure with merits of efficient parallelization and vectorization essentials. Our results show that the MIC-based optimization improved the performance of the first version of multi-threaded code on Xeon Phi 5110P by a factor of 2.4x. Furthermore, the same CPU-based optimizations improved the performance on Intel Xeon E5-2603 by a factor of 1.6x as compared to the first version of multi-threaded code.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Volkov, M V; Garanin, S G; Dolgopolov, Yu V
2014-11-30
A seven-channel fibre laser system operated by the master oscillator – multichannel power amplifier scheme is the phase locked using a stochastic parallel gradient algorithm. The phase modulators on lithium niobate crystals are controlled by a multichannel electronic unit with the microcontroller processing signals in real time. The dynamic phase locking of the laser system with the bandwidth of 14 kHz is demonstrated, the time of phasing is 3 – 4 ms. (fibre and integrated-optical structures)
Full Parallel Implementation of an All-Electron Four-Component Dirac-Kohn-Sham Program.
Rampino, Sergio; Belpassi, Leonardo; Tarantelli, Francesco; Storchi, Loriano
2014-09-09
A full distributed-memory implementation of the Dirac-Kohn-Sham (DKS) module of the program BERTHA (Belpassi et al., Phys. Chem. Chem. Phys. 2011, 13, 12368-12394) is presented, where the self-consistent field (SCF) procedure is replicated on all the parallel processes, each process working on subsets of the global matrices. The key feature of the implementation is an efficient procedure for switching between two matrix distribution schemes, one (integral-driven) optimal for the parallel computation of the matrix elements and another (block-cyclic) optimal for the parallel linear algebra operations. This approach, making both CPU-time and memory scalable with the number of processors used, virtually overcomes at once both time and memory barriers associated with DKS calculations. Performance, portability, and numerical stability of the code are illustrated on the basis of test calculations on three gold clusters of increasing size, an organometallic compound, and a perovskite model. The calculations are performed on a Beowulf and a BlueGene/Q system.
High-order asynchrony-tolerant finite difference schemes for partial differential equations
NASA Astrophysics Data System (ADS)
Aditya, Konduri; Donzis, Diego A.
2017-12-01
Synchronizations of processing elements (PEs) in massively parallel simulations, which arise due to communication or load imbalances between PEs, significantly affect the scalability of scientific applications. We have recently proposed a method based on finite-difference schemes to solve partial differential equations in an asynchronous fashion - synchronization between PEs is relaxed at a mathematical level. While standard schemes can maintain their stability in the presence of asynchrony, their accuracy is drastically affected. In this work, we present a general methodology to derive asynchrony-tolerant (AT) finite difference schemes of arbitrary order of accuracy, which can maintain their accuracy when synchronizations are relaxed. We show that there are several choices available in selecting a stencil to derive these schemes and discuss their effect on numerical and computational performance. We provide a simple classification of schemes based on the stencil and derive schemes that are representative of different classes. Their numerical error is rigorously analyzed within a statistical framework to obtain the overall accuracy of the solution. Results from numerical experiments are used to validate the performance of the schemes.
Accelerating Computation of DCM for ERP in MATLAB by External Function Calls to the GPU.
Wang, Wei-Jen; Hsieh, I-Fan; Chen, Chun-Chuan
2013-01-01
This study aims to improve the performance of Dynamic Causal Modelling for Event Related Potentials (DCM for ERP) in MATLAB by using external function calls to a graphics processing unit (GPU). DCM for ERP is an advanced method for studying neuronal effective connectivity. DCM utilizes an iterative procedure, the expectation maximization (EM) algorithm, to find the optimal parameters given a set of observations and the underlying probability model. As the EM algorithm is computationally demanding and the analysis faces possible combinatorial explosion of models to be tested, we propose a parallel computing scheme using the GPU to achieve a fast estimation of DCM for ERP. The computation of DCM for ERP is dynamically partitioned and distributed to threads for parallel processing, according to the DCM model complexity and the hardware constraints. The performance efficiency of this hardware-dependent thread arrangement strategy was evaluated using the synthetic data. The experimental data were used to validate the accuracy of the proposed computing scheme and quantify the time saving in practice. The simulation results show that the proposed scheme can accelerate the computation by a factor of 155 for the parallel part. For experimental data, the speedup factor is about 7 per model on average, depending on the model complexity and the data. This GPU-based implementation of DCM for ERP gives qualitatively the same results as the original MATLAB implementation does at the group level analysis. In conclusion, we believe that the proposed GPU-based implementation is very useful for users as a fast screen tool to select the most likely model and may provide implementation guidance for possible future clinical applications such as online diagnosis.
Accelerating Computation of DCM for ERP in MATLAB by External Function Calls to the GPU
Wang, Wei-Jen; Hsieh, I-Fan; Chen, Chun-Chuan
2013-01-01
This study aims to improve the performance of Dynamic Causal Modelling for Event Related Potentials (DCM for ERP) in MATLAB by using external function calls to a graphics processing unit (GPU). DCM for ERP is an advanced method for studying neuronal effective connectivity. DCM utilizes an iterative procedure, the expectation maximization (EM) algorithm, to find the optimal parameters given a set of observations and the underlying probability model. As the EM algorithm is computationally demanding and the analysis faces possible combinatorial explosion of models to be tested, we propose a parallel computing scheme using the GPU to achieve a fast estimation of DCM for ERP. The computation of DCM for ERP is dynamically partitioned and distributed to threads for parallel processing, according to the DCM model complexity and the hardware constraints. The performance efficiency of this hardware-dependent thread arrangement strategy was evaluated using the synthetic data. The experimental data were used to validate the accuracy of the proposed computing scheme and quantify the time saving in practice. The simulation results show that the proposed scheme can accelerate the computation by a factor of 155 for the parallel part. For experimental data, the speedup factor is about 7 per model on average, depending on the model complexity and the data. This GPU-based implementation of DCM for ERP gives qualitatively the same results as the original MATLAB implementation does at the group level analysis. In conclusion, we believe that the proposed GPU-based implementation is very useful for users as a fast screen tool to select the most likely model and may provide implementation guidance for possible future clinical applications such as online diagnosis. PMID:23840507
Parallel protein secondary structure prediction based on neural networks.
Zhong, Wei; Altun, Gulsah; Tian, Xinmin; Harrison, Robert; Tai, Phang C; Pan, Yi
2004-01-01
Protein secondary structure prediction has a fundamental influence on today's bioinformatics research. In this work, binary and tertiary classifiers of protein secondary structure prediction are implemented on Denoeux belief neural network (DBNN) architecture. Hydrophobicity matrix, orthogonal matrix, BLOSUM62 and PSSM (position specific scoring matrix) are experimented separately as the encoding schemes for DBNN. The experimental results contribute to the design of new encoding schemes. New binary classifier for Helix versus not Helix ( approximately H) for DBNN produces prediction accuracy of 87% when PSSM is used for the input profile. The performance of DBNN binary classifier is comparable to other best prediction methods. The good test results for binary classifiers open a new approach for protein structure prediction with neural networks. Due to the time consuming task of training the neural networks, Pthread and OpenMP are employed to parallelize DBNN in the hyperthreading enabled Intel architecture. Speedup for 16 Pthreads is 4.9 and speedup for 16 OpenMP threads is 4 in the 4 processors shared memory architecture. Both speedup performance of OpenMP and Pthread is superior to that of other research. With the new parallel training algorithm, thousands of amino acids can be processed in reasonable amount of time. Our research also shows that hyperthreading technology for Intel architecture is efficient for parallel biological algorithms.
Research on moving object detection based on frog's eyes
NASA Astrophysics Data System (ADS)
Fu, Hongwei; Li, Dongguang; Zhang, Xinyuan
2008-12-01
On the basis of object's information processing mechanism with frog's eyes, this paper discussed a bionic detection technology which suitable for object's information processing based on frog's vision. First, the bionics detection theory by imitating frog vision is established, it is an parallel processing mechanism which including pick-up and pretreatment of object's information, parallel separating of digital image, parallel processing, and information synthesis. The computer vision detection system is described to detect moving objects which has special color, special shape, the experiment indicates that it can scheme out the detecting result in the certain interfered background can be detected. A moving objects detection electro-model by imitating biologic vision based on frog's eyes is established, the video simulative signal is digital firstly in this system, then the digital signal is parallel separated by FPGA. IN the parallel processing, the video information can be caught, processed and displayed in the same time, the information fusion is taken by DSP HPI ports, in order to transmit the data which processed by DSP. This system can watch the bigger visual field and get higher image resolution than ordinary monitor systems. In summary, simulative experiments for edge detection of moving object with canny algorithm based on this system indicate that this system can detect the edge of moving objects in real time, the feasibility of bionic model was fully demonstrated in the engineering system, and it laid a solid foundation for the future study of detection technology by imitating biologic vision.
NASA Technical Reports Server (NTRS)
Nguyen, D. T.; Watson, Willie R. (Technical Monitor)
2005-01-01
The overall objectives of this research work are to formulate and validate efficient parallel algorithms, and to efficiently design/implement computer software for solving large-scale acoustic problems, arised from the unified frameworks of the finite element procedures. The adopted parallel Finite Element (FE) Domain Decomposition (DD) procedures should fully take advantages of multiple processing capabilities offered by most modern high performance computing platforms for efficient parallel computation. To achieve this objective. the formulation needs to integrate efficient sparse (and dense) assembly techniques, hybrid (or mixed) direct and iterative equation solvers, proper pre-conditioned strategies, unrolling strategies, and effective processors' communicating schemes. Finally, the numerical performance of the developed parallel finite element procedures will be evaluated by solving series of structural, and acoustic (symmetrical and un-symmetrical) problems (in different computing platforms). Comparisons with existing "commercialized" and/or "public domain" software are also included, whenever possible.
A 3D staggered-grid finite difference scheme for poroelastic wave equation
NASA Astrophysics Data System (ADS)
Zhang, Yijie; Gao, Jinghuai
2014-10-01
Three dimensional numerical modeling has been a viable tool for understanding wave propagation in real media. The poroelastic media can better describe the phenomena of hydrocarbon reservoirs than acoustic and elastic media. However, the numerical modeling in 3D poroelastic media demands significantly more computational capacity, including both computational time and memory. In this paper, we present a 3D poroelastic staggered-grid finite difference (SFD) scheme. During the procedure, parallel computing is implemented to reduce the computational time. Parallelization is based on domain decomposition, and communication between processors is performed using message passing interface (MPI). Parallel analysis shows that the parallelized SFD scheme significantly improves the simulation efficiency and 3D decomposition in domain is the most efficient. We also analyze the numerical dispersion and stability condition of the 3D poroelastic SFD method. Numerical results show that the 3D numerical simulation can provide a real description of wave propagation.
A privacy-preserving parallel and homomorphic encryption scheme
NASA Astrophysics Data System (ADS)
Min, Zhaoe; Yang, Geng; Shi, Jingqi
2017-04-01
In order to protect data privacy whilst allowing efficient access to data in multi-nodes cloud environments, a parallel homomorphic encryption (PHE) scheme is proposed based on the additive homomorphism of the Paillier encryption algorithm. In this paper we propose a PHE algorithm, in which plaintext is divided into several blocks and blocks are encrypted with a parallel mode. Experiment results demonstrate that the encryption algorithm can reach a speed-up ratio at about 7.1 in the MapReduce environment with 16 cores and 4 nodes.
Organizing Compression of Hyperspectral Imagery to Allow Efficient Parallel Decompression
NASA Technical Reports Server (NTRS)
Klimesh, Matthew A.; Kiely, Aaron B.
2014-01-01
family of schemes has been devised for organizing the output of an algorithm for predictive data compression of hyperspectral imagery so as to allow efficient parallelization in both the compressor and decompressor. In these schemes, the compressor performs a number of iterations, during each of which a portion of the data is compressed via parallel threads operating on independent portions of the data. The general idea is that for each iteration it is predetermined how much compressed data will be produced from each thread.
Newton-like methods for Navier-Stokes solution
NASA Astrophysics Data System (ADS)
Qin, N.; Xu, X.; Richards, B. E.
1992-12-01
The paper reports on Newton-like methods called SFDN-alpha-GMRES and SQN-alpha-GMRES methods that have been devised and proven as powerful schemes for large nonlinear problems typical of viscous compressible Navier-Stokes solutions. They can be applied using a partially converged solution from a conventional explicit or approximate implicit method. Developments have included the efficient parallelization of the schemes on a distributed memory parallel computer. The methods are illustrated using a RISC workstation and a transputer parallel system respectively to solve a hypersonic vortical flow.
Particle-in-cell simulations with charge-conserving current deposition on graphic processing units
NASA Astrophysics Data System (ADS)
Ren, Chuang; Kong, Xianglong; Huang, Michael; Decyk, Viktor; Mori, Warren
2011-10-01
Recently using CUDA, we have developed an electromagnetic Particle-in-Cell (PIC) code with charge-conserving current deposition for Nvidia graphic processing units (GPU's) (Kong et al., Journal of Computational Physics 230, 1676 (2011). On a Tesla M2050 (Fermi) card, the GPU PIC code can achieve a one-particle-step process time of 1.2 - 3.2 ns in 2D and 2.3 - 7.2 ns in 3D, depending on plasma temperatures. In this talk we will discuss novel algorithms for GPU-PIC including charge-conserving current deposition scheme with few branching and parallel particle sorting. These algorithms have made efficient use of the GPU shared memory. We will also discuss how to replace the computation kernels of existing parallel CPU codes while keeping their parallel structures. This work was supported by U.S. Department of Energy under Grant Nos. DE-FG02-06ER54879 and DE-FC02-04ER54789 and by NSF under Grant Nos. PHY-0903797 and CCF-0747324.
Comparative Implementation of High Performance Computing for Power System Dynamic Simulations
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jin, Shuangshuang; Huang, Zhenyu; Diao, Ruisheng
Dynamic simulation for transient stability assessment is one of the most important, but intensive, computations for power system planning and operation. Present commercial software is mainly designed for sequential computation to run a single simulation, which is very time consuming with a single processer. The application of High Performance Computing (HPC) to dynamic simulations is very promising in accelerating the computing process by parallelizing its kernel algorithms while maintaining the same level of computation accuracy. This paper describes the comparative implementation of four parallel dynamic simulation schemes in two state-of-the-art HPC environments: Message Passing Interface (MPI) and Open Multi-Processing (OpenMP).more » These implementations serve to match the application with dedicated multi-processor computing hardware and maximize the utilization and benefits of HPC during the development process.« less
Cache-Oblivious parallel SIMD Viterbi decoding for sequence search in HMMER.
Ferreira, Miguel; Roma, Nuno; Russo, Luis M S
2014-05-30
HMMER is a commonly used bioinformatics tool based on Hidden Markov Models (HMMs) to analyze and process biological sequences. One of its main homology engines is based on the Viterbi decoding algorithm, which was already highly parallelized and optimized using Farrar's striped processing pattern with Intel SSE2 instruction set extension. A new SIMD vectorization of the Viterbi decoding algorithm is proposed, based on an SSE2 inter-task parallelization approach similar to the DNA alignment algorithm proposed by Rognes. Besides this alternative vectorization scheme, the proposed implementation also introduces a new partitioning of the Markov model that allows a significantly more efficient exploitation of the cache locality. Such optimization, together with an improved loading of the emission scores, allows the achievement of a constant processing throughput, regardless of the innermost-cache size and of the dimension of the considered model. The proposed optimized vectorization of the Viterbi decoding algorithm was extensively evaluated and compared with the HMMER3 decoder to process DNA and protein datasets, proving to be a rather competitive alternative implementation. Being always faster than the already highly optimized ViterbiFilter implementation of HMMER3, the proposed Cache-Oblivious Parallel SIMD Viterbi (COPS) implementation provides a constant throughput and offers a processing speedup as high as two times faster, depending on the model's size.
A parallel adaptive mesh refinement algorithm
NASA Technical Reports Server (NTRS)
Quirk, James J.; Hanebutte, Ulf R.
1993-01-01
Over recent years, Adaptive Mesh Refinement (AMR) algorithms which dynamically match the local resolution of the computational grid to the numerical solution being sought have emerged as powerful tools for solving problems that contain disparate length and time scales. In particular, several workers have demonstrated the effectiveness of employing an adaptive, block-structured hierarchical grid system for simulations of complex shock wave phenomena. Unfortunately, from the parallel algorithm developer's viewpoint, this class of scheme is quite involved; these schemes cannot be distilled down to a small kernel upon which various parallelizing strategies may be tested. However, because of their block-structured nature such schemes are inherently parallel, so all is not lost. In this paper we describe the method by which Quirk's AMR algorithm has been parallelized. This method is built upon just a few simple message passing routines and so it may be implemented across a broad class of MIMD machines. Moreover, the method of parallelization is such that the original serial code is left virtually intact, and so we are left with just a single product to support. The importance of this fact should not be underestimated given the size and complexity of the original algorithm.
Leap Frog and Time Step Sub-Cycle Scheme for Coupled Neutronics and Thermal-Hydraulic Codes
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lu, S.
2002-07-01
As the result of the advancing TCP/IP based inter-process communication technology, more and more legacy thermal-hydraulic codes have been coupled with neutronics codes to provide best-estimate capabilities for reactivity related reactor transient analysis. Most of the coupling schemes are based on closely coupled serial or parallel approaches. Therefore, the execution of the coupled codes usually requires significant CPU time, when a complicated system is analyzed. Leap Frog scheme has been used to reduce the run time. The extent of the decoupling is usually determined based on a trial and error process for a specific analysis. It is the intent ofmore » this paper to develop a set of general criteria, which can be used to invoke the automatic Leap Frog algorithm. The algorithm will not only provide the run time reduction but also preserve the accuracy. The criteria will also serve as the base of an automatic time step sub-cycle scheme when a sudden reactivity change is introduced and the thermal-hydraulic code is marching with a relatively large time step. (authors)« less
Flexible All-Digital Receiver for Bandwidth Efficient Modulations
NASA Technical Reports Server (NTRS)
Gray, Andrew; Srinivasan, Meera; Simon, Marvin; Yan, Tsun-Yee
2000-01-01
An all-digital high data rate parallel receiver architecture developed jointly by Goddard Space Flight Center and the Jet Propulsion Laboratory is presented. This receiver utilizes only a small number of high speed components along with a majority of lower speed components operating in a parallel frequency domain structure implementable in CMOS, and can currently process up to 600 Mbps with standard QPSK modulation. Performance results for this receiver for bandwidth efficient QPSK modulation schemes such as square-root raised cosine pulse shaped QPSK and Feher's patented QPSK are presented, demonstrating the flexibility of the receiver architecture.
Holographic Associative Memory Employing Phase Conjugation
NASA Astrophysics Data System (ADS)
Soffer, B. H.; Marom, E.; Owechko, Y.; Dunning, G.
1986-12-01
The principle of information retrieval by association has been suggested as a basis for parallel computing and as the process by which human memory functions.1 Various associative processors have been proposed that use electronic or optical means. Optical schemes,2-7 in particular, those based on holographic principles,8'8' are well suited to associative processing because of their high parallelism and information throughput. Previous workers8 demonstrated that holographically stored images can be recalled by using relatively complicated reference images but did not utilize nonlinear feedback to reduce the large cross talk that results when multiple objects are stored and a partial or distorted input is used for retrieval. These earlier approaches were limited in their ability to reconstruct the output object faithfully from a partial input.
Jeong, Kyeong-Min; Kim, Hee-Seung; Hong, Sung-In; Lee, Sung-Keun; Jo, Na-Young; Kim, Yong-Soo; Lim, Hong-Gi; Park, Jae-Hyeung
2012-10-08
Speed enhancement of integral imaging based incoherent Fourier hologram capture using a graphic processing unit is reported. Integral imaging based method enables exact hologram capture of real-existing three-dimensional objects under regular incoherent illumination. In our implementation, we apply parallel computation scheme using the graphic processing unit, accelerating the processing speed. Using enhanced speed of hologram capture, we also implement a pseudo real-time hologram capture and optical reconstruction system. The overall operation speed is measured to be 1 frame per second.
NASA Astrophysics Data System (ADS)
Konduri, Aditya
Many natural and engineering systems are governed by nonlinear partial differential equations (PDEs) which result in a multiscale phenomena, e.g. turbulent flows. Numerical simulations of these problems are computationally very expensive and demand for extreme levels of parallelism. At realistic conditions, simulations are being carried out on massively parallel computers with hundreds of thousands of processing elements (PEs). It has been observed that communication between PEs as well as their synchronization at these extreme scales take up a significant portion of the total simulation time and result in poor scalability of codes. This issue is likely to pose a bottleneck in scalability of codes on future Exascale systems. In this work, we propose an asynchronous computing algorithm based on widely used finite difference methods to solve PDEs in which synchronization between PEs due to communication is relaxed at a mathematical level. We show that while stability is conserved when schemes are used asynchronously, accuracy is greatly degraded. Since message arrivals at PEs are random processes, so is the behavior of the error. We propose a new statistical framework in which we show that average errors drop always to first-order regardless of the original scheme. We propose new asynchrony-tolerant schemes that maintain accuracy when synchronization is relaxed. The quality of the solution is shown to depend, not only on the physical phenomena and numerical schemes, but also on the characteristics of the computing machine. A novel algorithm using remote memory access communications has been developed to demonstrate excellent scalability of the method for large-scale computing. Finally, we present a path to extend this method in solving complex multi-scale problems on Exascale machines.
Computational aspects of helicopter trim analysis and damping levels from Floquet theory
NASA Technical Reports Server (NTRS)
Gaonkar, Gopal H.; Achar, N. S.
1992-01-01
Helicopter trim settings of periodic initial state and control inputs are investigated for convergence of Newton iteration in computing the settings sequentially and in parallel. The trim analysis uses a shooting method and a weak version of two temporal finite element methods with displacement formulation and with mixed formulation of displacements and momenta. These three methods broadly represent two main approaches of trim analysis: adaptation of initial-value and finite element boundary-value codes to periodic boundary conditions, particularly for unstable and marginally stable systems. In each method, both the sequential and in-parallel schemes are used and the resulting nonlinear algebraic equations are solved by damped Newton iteration with an optimally selected damping parameter. The impact of damped Newton iteration, including earlier-observed divergence problems in trim analysis, is demonstrated by the maximum condition number of the Jacobian matrices of the iterative scheme and by virtual elimination of divergence. The advantages of the in-parallel scheme over the conventional sequential scheme are also demonstrated.
Computer Science Techniques Applied to Parallel Atomistic Simulation
NASA Astrophysics Data System (ADS)
Nakano, Aiichiro
1998-03-01
Recent developments in parallel processing technology and multiresolution numerical algorithms have established large-scale molecular dynamics (MD) simulations as a new research mode for studying materials phenomena such as fracture. However, this requires large system sizes and long simulated times. We have developed: i) Space-time multiresolution schemes; ii) fuzzy-clustering approach to hierarchical dynamics; iii) wavelet-based adaptive curvilinear-coordinate load balancing; iv) multilevel preconditioned conjugate gradient method; and v) spacefilling-curve-based data compression for parallel I/O. Using these techniques, million-atom parallel MD simulations are performed for the oxidation dynamics of nanocrystalline Al. The simulations take into account the effect of dynamic charge transfer between Al and O using the electronegativity equalization scheme. The resulting long-range Coulomb interaction is calculated efficiently with the fast multipole method. Results for temperature and charge distributions, residual stresses, bond lengths and bond angles, and diffusivities of Al and O will be presented. The oxidation of nanocrystalline Al is elucidated through immersive visualization in virtual environments. A unique dual-degree education program at Louisiana State University will also be discussed in which students can obtain a Ph.D. in Physics & Astronomy and a M.S. from the Department of Computer Science in five years. This program fosters interdisciplinary research activities for interfacing High Performance Computing and Communications with large-scale atomistic simulations of advanced materials. This work was supported by NSF (CAREER Program), ARO, PRF, and Louisiana LEQSF.
Feasibility study: Liquid hydrogen plant, 30 tons per day
NASA Technical Reports Server (NTRS)
1975-01-01
The design considerations of the plant are discussed in detail along with management planning, objective schedules, and cost estimates. The processing scheme is aimed at ultimate use of coal as the basic raw material. For back-up, and to provide assurance of a dependable and steady supply of hydrogen, a parallel and redundant facility for gasifying heavy residual oil will be installed. Both the coal and residual oil gasifiers will use the partial oxidation process.
NASA Technical Reports Server (NTRS)
Liu, Kuojuey Ray
1990-01-01
Least-squares (LS) estimations and spectral decomposition algorithms constitute the heart of modern signal processing and communication problems. Implementations of recursive LS and spectral decomposition algorithms onto parallel processing architectures such as systolic arrays with efficient fault-tolerant schemes are the major concerns of this dissertation. There are four major results in this dissertation. First, we propose the systolic block Householder transformation with application to the recursive least-squares minimization. It is successfully implemented on a systolic array with a two-level pipelined implementation at the vector level as well as at the word level. Second, a real-time algorithm-based concurrent error detection scheme based on the residual method is proposed for the QRD RLS systolic array. The fault diagnosis, order degraded reconfiguration, and performance analysis are also considered. Third, the dynamic range, stability, error detection capability under finite-precision implementation, order degraded performance, and residual estimation under faulty situations for the QRD RLS systolic array are studied in details. Finally, we propose the use of multi-phase systolic algorithms for spectral decomposition based on the QR algorithm. Two systolic architectures, one based on triangular array and another based on rectangular array, are presented for the multiphase operations with fault-tolerant considerations. Eigenvectors and singular vectors can be easily obtained by using the multi-pase operations. Performance issues are also considered.
Integrated optical 3D digital imaging based on DSP scheme
NASA Astrophysics Data System (ADS)
Wang, Xiaodong; Peng, Xiang; Gao, Bruce Z.
2008-03-01
We present a scheme of integrated optical 3-D digital imaging (IO3DI) based on digital signal processor (DSP), which can acquire range images independently without PC support. This scheme is based on a parallel hardware structure with aid of DSP and field programmable gate array (FPGA) to realize 3-D imaging. In this integrated scheme of 3-D imaging, the phase measurement profilometry is adopted. To realize the pipeline processing of the fringe projection, image acquisition and fringe pattern analysis, we present a multi-threads application program that is developed under the environment of DSP/BIOS RTOS (real-time operating system). Since RTOS provides a preemptive kernel and powerful configuration tool, with which we are able to achieve a real-time scheduling and synchronization. To accelerate automatic fringe analysis and phase unwrapping, we make use of the technique of software optimization. The proposed scheme can reach a performance of 39.5 f/s (frames per second), so it may well fit into real-time fringe-pattern analysis and can implement fast 3-D imaging. Experiment results are also presented to show the validity of proposed scheme.
Sixth- and eighth-order Hermite integrator for N-body simulations
NASA Astrophysics Data System (ADS)
Nitadori, Keigo; Makino, Junichiro
2008-10-01
We present sixth- and eighth-order Hermite integrators for astrophysical N-body simulations, which use the derivatives of accelerations up to second-order ( snap) and third-order ( crackle). These schemes do not require previous values for the corrector, and require only one previous value to construct the predictor. Thus, they are fairly easy to implement. The additional cost of the calculation of the higher-order derivatives is not very high. Even for the eighth-order scheme, the number of floating-point operations for force calculation is only about two times larger than that for traditional fourth-order Hermite scheme. The sixth-order scheme is better than the traditional fourth-order scheme for most cases. When the required accuracy is very high, the eighth-order one is the best. These high-order schemes have several practical advantages. For example, they allow a larger number of particles to be integrated in parallel than the fourth-order scheme does, resulting in higher execution efficiency in both general-purpose parallel computers and GRAPE systems.
NASA Astrophysics Data System (ADS)
Rodrigues, Manuel J.; Fernandes, David E.; Silveirinha, Mário G.; Falcão, Gabriel
2018-01-01
This work introduces a parallel computing framework to characterize the propagation of electron waves in graphene-based nanostructures. The electron wave dynamics is modeled using both "microscopic" and effective medium formalisms and the numerical solution of the two-dimensional massless Dirac equation is determined using a Finite-Difference Time-Domain scheme. The propagation of electron waves in graphene superlattices with localized scattering centers is studied, and the role of the symmetry of the microscopic potential in the electron velocity is discussed. The computational methodologies target the parallel capabilities of heterogeneous multi-core CPU and multi-GPU environments and are built with the OpenCL parallel programming framework which provides a portable, vendor agnostic and high throughput-performance solution. The proposed heterogeneous multi-GPU implementation achieves speedup ratios up to 75x when compared to multi-thread and multi-core CPU execution, reducing simulation times from several hours to a couple of minutes.
On nonlinear finite element analysis in single-, multi- and parallel-processors
NASA Technical Reports Server (NTRS)
Utku, S.; Melosh, R.; Islam, M.; Salama, M.
1982-01-01
Numerical solution of nonlinear equilibrium problems of structures by means of Newton-Raphson type iterations is reviewed. Each step of the iteration is shown to correspond to the solution of a linear problem, therefore the feasibility of the finite element method for nonlinear analysis is established. Organization and flow of data for various types of digital computers, such as single-processor/single-level memory, single-processor/two-level-memory, vector-processor/two-level-memory, and parallel-processors, with and without sub-structuring (i.e. partitioning) are given. The effect of the relative costs of computation, memory and data transfer on substructuring is shown. The idea of assigning comparable size substructures to parallel processors is exploited. Under Cholesky type factorization schemes, the efficiency of parallel processing is shown to decrease due to the occasional shared data, just as that due to the shared facilities.
Essential issues in multiprocessor systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gajski, D.D.; Peir, J.K.
1985-06-01
During the past several years, a great number of proposals have been made with the objective to increase supercomputer performance by an order of magnitude on the basis of a utilization of new computer architectures. The present paper is concerned with a suitable classification scheme for comparing these architectures. It is pointed out that there are basically four schools of thought as to the most important factor for an enhancement of computer performance. According to one school, the development of faster circuits will make it possible to retain present architectures, except, possibly, for a mechanism providing synchronization of parallel processes.more » A second school assigns priority to the optimization and vectorization of compilers, which will detect parallelism and help users to write better parallel programs. A third school believes in the predominant importance of new parallel algorithms, while the fourth school supports new models of computation. The merits of the four approaches are critically evaluated. 50 references.« less
Liu, Yu; Hong, Yang; Lin, Chun-Yuan; Hung, Che-Lun
2015-01-01
The Smith-Waterman (SW) algorithm has been widely utilized for searching biological sequence databases in bioinformatics. Recently, several works have adopted the graphic card with Graphic Processing Units (GPUs) and their associated CUDA model to enhance the performance of SW computations. However, these works mainly focused on the protein database search by using the intertask parallelization technique, and only using the GPU capability to do the SW computations one by one. Hence, in this paper, we will propose an efficient SW alignment method, called CUDA-SWfr, for the protein database search by using the intratask parallelization technique based on a CPU-GPU collaborative system. Before doing the SW computations on GPU, a procedure is applied on CPU by using the frequency distance filtration scheme (FDFS) to eliminate the unnecessary alignments. The experimental results indicate that CUDA-SWfr runs 9.6 times and 96 times faster than the CPU-based SW method without and with FDFS, respectively.
NASA Technical Reports Server (NTRS)
Wang, P.; Li, P.
1998-01-01
A high-resolution numerical study on parallel systems is reported on three-dimensional, time-dependent, thermal convective flows. A parallel implentation on the finite volume method with a multigrid scheme is discussed, and a parallel visualization systemm is developed on distributed systems for visualizing the flow.
Communication library for run-time visualization of distributed, asynchronous data
DOE Office of Scientific and Technical Information (OSTI.GOV)
Rowlan, J.; Wightman, B.T.
1994-04-01
In this paper we present a method for collecting and visualizing data generated by a parallel computational simulation during run time. Data distributed across multiple processes is sent across parallel communication lines to a remote workstation, which sorts and queues the data for visualization. We have implemented our method in a set of tools called PORTAL (for Parallel aRchitecture data-TrAnsfer Library). The tools comprise generic routines for sending data from a parallel program (callable from either C or FORTRAN), a semi-parallel communication scheme currently built upon Unix Sockets, and a real-time connection to the scientific visualization program AVS. Our methodmore » is most valuable when used to examine large datasets that can be efficiently generated and do not need to be stored on disk. The PORTAL source libraries, detailed documentation, and a working example can be obtained by anonymous ftp from info.mcs.anl.gov from the file portal.tar.Z from the directory pub/portal.« less
Cache-Oblivious parallel SIMD Viterbi decoding for sequence search in HMMER
2014-01-01
Background HMMER is a commonly used bioinformatics tool based on Hidden Markov Models (HMMs) to analyze and process biological sequences. One of its main homology engines is based on the Viterbi decoding algorithm, which was already highly parallelized and optimized using Farrar’s striped processing pattern with Intel SSE2 instruction set extension. Results A new SIMD vectorization of the Viterbi decoding algorithm is proposed, based on an SSE2 inter-task parallelization approach similar to the DNA alignment algorithm proposed by Rognes. Besides this alternative vectorization scheme, the proposed implementation also introduces a new partitioning of the Markov model that allows a significantly more efficient exploitation of the cache locality. Such optimization, together with an improved loading of the emission scores, allows the achievement of a constant processing throughput, regardless of the innermost-cache size and of the dimension of the considered model. Conclusions The proposed optimized vectorization of the Viterbi decoding algorithm was extensively evaluated and compared with the HMMER3 decoder to process DNA and protein datasets, proving to be a rather competitive alternative implementation. Being always faster than the already highly optimized ViterbiFilter implementation of HMMER3, the proposed Cache-Oblivious Parallel SIMD Viterbi (COPS) implementation provides a constant throughput and offers a processing speedup as high as two times faster, depending on the model’s size. PMID:24884826
NASA Astrophysics Data System (ADS)
Li, Gen; Tang, Chun-An; Liang, Zheng-Zhao
2017-01-01
Multi-scale high-resolution modeling of rock failure process is a powerful means in modern rock mechanics studies to reveal the complex failure mechanism and to evaluate engineering risks. However, multi-scale continuous modeling of rock, from deformation, damage to failure, has raised high requirements on the design, implementation scheme and computation capacity of the numerical software system. This study is aimed at developing the parallel finite element procedure, a parallel rock failure process analysis (RFPA) simulator that is capable of modeling the whole trans-scale failure process of rock. Based on the statistical meso-damage mechanical method, the RFPA simulator is able to construct heterogeneous rock models with multiple mechanical properties, deal with and represent the trans-scale propagation of cracks, in which the stress and strain fields are solved for the damage evolution analysis of representative volume element by the parallel finite element method (FEM) solver. This paper describes the theoretical basis of the approach and provides the details of the parallel implementation on a Windows - Linux interactive platform. A numerical model is built to test the parallel performance of FEM solver. Numerical simulations are then carried out on a laboratory-scale uniaxial compression test, and field-scale net fracture spacing and engineering-scale rock slope examples, respectively. The simulation results indicate that relatively high speedup and computation efficiency can be achieved by the parallel FEM solver with a reasonable boot process. In laboratory-scale simulation, the well-known physical phenomena, such as the macroscopic fracture pattern and stress-strain responses, can be reproduced. In field-scale simulation, the formation process of net fracture spacing from initiation, propagation to saturation can be revealed completely. In engineering-scale simulation, the whole progressive failure process of the rock slope can be well modeled. It is shown that the parallel FE simulator developed in this study is an efficient tool for modeling the whole trans-scale failure process of rock from meso- to engineering-scale.
Big-BOE: Fusing Spanish Official Gazette with Big Data Technology.
Basanta-Val, Pablo; Sánchez-Fernández, Luis
2018-06-01
The proliferation of new data sources, stemmed from the adoption of open-data schemes, in combination with an increasing computing capacity causes the inception of new type of analytics that process Internet of things with low-cost engines to speed up data processing using parallel computing. In this context, the article presents an initiative, called BIG-Boletín Oficial del Estado (BOE), designed to process the Spanish official government gazette (BOE) with state-of-the-art processing engines, to reduce computation time and to offer additional speed up for big data analysts. The goal of including a big data infrastructure is to be able to process different BOE documents in parallel with specific analytics, to search for several issues in different documents. The application infrastructure processing engine is described from an architectural perspective and from performance, showing evidence on how this type of infrastructure improves the performance of different types of simple analytics as several machines cooperate.
Fast quantum Monte Carlo on a GPU
NASA Astrophysics Data System (ADS)
Lutsyshyn, Y.
2015-02-01
We present a scheme for the parallelization of quantum Monte Carlo method on graphical processing units, focusing on variational Monte Carlo simulation of bosonic systems. We use asynchronous execution schemes with shared memory persistence, and obtain an excellent utilization of the accelerator. The CUDA code is provided along with a package that simulates liquid helium-4. The program was benchmarked on several models of Nvidia GPU, including Fermi GTX560 and M2090, and the Kepler architecture K20 GPU. Special optimization was developed for the Kepler cards, including placement of data structures in the register space of the Kepler GPUs. Kepler-specific optimization is discussed.
Comparing the Performance of Two Dynamic Load Distribution Methods
NASA Technical Reports Server (NTRS)
Kale, L. V.
1987-01-01
Parallel processing of symbolic computations on a message-passing multi-processor presents one challenge: To effectively utilize the available processors, the load must be distributed uniformly to all the processors. However, the structure of these computations cannot be predicted in advance. go, static scheduling methods are not applicable. In this paper, we compare the performance of two dynamic, distributed load balancing methods with extensive simulation studies. The two schemes are: the Contracting Within a Neighborhood (CWN) scheme proposed by us, and the Gradient Model proposed by Lin and Keller. We conclude that although simpler, the CWN is significantly more effective at distributing the work than the Gradient model.
NASA Astrophysics Data System (ADS)
Singh, Santosh Kumar; Ghatak Choudhuri, Sumit
2018-05-01
Parallel connection of UPS inverters to enhance power rating is a widely accepted practice. Inter-modular circulating currents appear when multiple inverter modules are connected in parallel to supply variable critical load. Interfacing of modules henceforth requires an intensive design, using proper control strategy. The potentiality of human intuitive Fuzzy Logic (FL) control with imprecise system model is well known and thus can be utilised in parallel-connected UPS systems. Conventional FL controller is computational intensive, especially with higher number of input variables. This paper proposes application of Hierarchical-Fuzzy Logic control for parallel connected Multi-modular inverters system for reduced computational burden on the processor for a given switching frequency. Simulated results in MATLAB environment and experimental verification using Texas TMS320F2812 DSP are included to demonstrate feasibility of the proposed control scheme.
Advanced Numerical Techniques of Performance Evaluation. Volume 1
1990-06-01
system scheduling3thread. The scheduling thread then runs any other ready thread that can be found. A thread can only sleep or switch out on itself...Polychronopoulos and D.J. Kuck. Guided Self- Scheduling : A Practical Scheduling Scheme for Parallel Supercomputers. IEEE Transactions on Computers C...Kuck 1987] C.D. Polychronopoulos and D.J. Kuck. Guided Self- Scheduling : A Practical Scheduling Scheme for Parallel Supercomputers. IEEE Trans. on Comp
Associative Memory In A Phase Conjugate Resonator Cavity Utilizing A Hologram
NASA Astrophysics Data System (ADS)
Owechko, Y.; Marom, E.; Soffer, B. H.; Dunning, G.
1987-01-01
The principle of information retrieval by association has been suggested as a basis for parallel computing and as the process by which human memory functions.1 Various associative processors have been proposed that use electronic or optical means. Optical schemes,2-7 in particular, those based on holographic principles,3,6,7 are well suited to associative processing because of their high parallelism and information throughput. Previous workers8 demonstrated that holographically stored images can be recalled by using relatively complicated reference images but did not utilize nonlinear feedback to reduce the large cross talk that results when multiple objects are stored and a partial or distorted input is used for retrieval. These earlier approaches were limited in their ability to reconstruct the output object faithfully from a partial input.
Scalable isosurface visualization of massive datasets on commodity off-the-shelf clusters
Bajaj, Chandrajit
2009-01-01
Tomographic imaging and computer simulations are increasingly yielding massive datasets. Interactive and exploratory visualizations have rapidly become indispensable tools to study large volumetric imaging and simulation data. Our scalable isosurface visualization framework on commodity off-the-shelf clusters is an end-to-end parallel and progressive platform, from initial data access to the final display. Interactive browsing of extracted isosurfaces is made possible by using parallel isosurface extraction, and rendering in conjunction with a new specialized piece of image compositing hardware called Metabuffer. In this paper, we focus on the back end scalability by introducing a fully parallel and out-of-core isosurface extraction algorithm. It achieves scalability by using both parallel and out-of-core processing and parallel disks. It statically partitions the volume data to parallel disks with a balanced workload spectrum, and builds I/O-optimal external interval trees to minimize the number of I/O operations of loading large data from disk. We also describe an isosurface compression scheme that is efficient for progress extraction, transmission and storage of isosurfaces. PMID:19756231
Quantum supercharger library: hyper-parallelism of the Hartree-Fock method.
Fernandes, Kyle D; Renison, C Alicia; Naidoo, Kevin J
2015-07-05
We present here a set of algorithms that completely rewrites the Hartree-Fock (HF) computations common to many legacy electronic structure packages (such as GAMESS-US, GAMESS-UK, and NWChem) into a massively parallel compute scheme that takes advantage of hardware accelerators such as Graphical Processing Units (GPUs). The HF compute algorithm is core to a library of routines that we name the Quantum Supercharger Library (QSL). We briefly evaluate the QSL's performance and report that it accelerates a HF 6-31G Self-Consistent Field (SCF) computation by up to 20 times for medium sized molecules (such as a buckyball) when compared with mature Central Processing Unit algorithms available in the legacy codes in regular use by researchers. It achieves this acceleration by massive parallelization of the one- and two-electron integrals and optimization of the SCF and Direct Inversion in the Iterative Subspace routines through the use of GPU linear algebra libraries. © 2015 Wiley Periodicals, Inc. © 2015 Wiley Periodicals, Inc.
Particle-in-cell simulations on graphic processing units
NASA Astrophysics Data System (ADS)
Ren, C.; Zhou, X.; Li, J.; Huang, M. C.; Zhao, Y.
2014-10-01
We will show our recent progress in using GPU's to accelerate the PIC code OSIRIS [Fonseca et al. LNCS 2331, 342 (2002)]. The OISRIS parallel structure is retained and the computation-intensive kernels are shipped to GPU's. Algorithms for the kernels are adapted for the GPU, including high-order charge-conserving current deposition schemes with few branching and parallel particle sorting [Kong et al., JCP 230, 1676 (2011)]. These algorithms make efficient use of the GPU shared memory. This work was supported by U.S. Department of Energy under Grant No. DE-FC02-04ER54789 and by NSF under Grant No. PHY-1314734.
A survey of GPU-based acceleration techniques in MRI reconstructions
Wang, Haifeng; Peng, Hanchuan; Chang, Yuchou
2018-01-01
Image reconstruction in magnetic resonance imaging (MRI) clinical applications has become increasingly more complicated. However, diagnostic and treatment require very fast computational procedure. Modern competitive platforms of graphics processing unit (GPU) have been used to make high-performance parallel computations available, and attractive to common consumers for computing massively parallel reconstruction problems at commodity price. GPUs have also become more and more important for reconstruction computations, especially when deep learning starts to be applied into MRI reconstruction. The motivation of this survey is to review the image reconstruction schemes of GPU computing for MRI applications and provide a summary reference for researchers in MRI community. PMID:29675361
A survey of GPU-based acceleration techniques in MRI reconstructions.
Wang, Haifeng; Peng, Hanchuan; Chang, Yuchou; Liang, Dong
2018-03-01
Image reconstruction in magnetic resonance imaging (MRI) clinical applications has become increasingly more complicated. However, diagnostic and treatment require very fast computational procedure. Modern competitive platforms of graphics processing unit (GPU) have been used to make high-performance parallel computations available, and attractive to common consumers for computing massively parallel reconstruction problems at commodity price. GPUs have also become more and more important for reconstruction computations, especially when deep learning starts to be applied into MRI reconstruction. The motivation of this survey is to review the image reconstruction schemes of GPU computing for MRI applications and provide a summary reference for researchers in MRI community.
NASA Technical Reports Server (NTRS)
Long, Junsheng
1994-01-01
This thesis studies a forward recovery strategy using checkpointing and optimistic execution in parallel and distributed systems. The approach uses replicated tasks executing on different processors for forwared recovery and checkpoint comparison for error detection. To reduce overall redundancy, this approach employs a lower static redundancy in the common error-free situation to detect error than the standard N Module Redundancy scheme (NMR) does to mask off errors. For the rare occurrence of an error, this approach uses some extra redundancy for recovery. To reduce the run-time recovery overhead, look-ahead processes are used to advance computation speculatively and a rollback process is used to produce a diagnosis for correct look-ahead processes without rollback of the whole system. Both analytical and experimental evaluation have shown that this strategy can provide a nearly error-free execution time even under faults with a lower average redundancy than NMR.
NASA Astrophysics Data System (ADS)
Sarkar, Debdeep; Srivastava, Kumar Vaibhav
2017-02-01
In this paper, the concept of cross-correlation Green's functions (CGF) is used in conjunction with the finite difference time domain (FDTD) technique for calculation of envelope correlation coefficient (ECC) of any arbitrary MIMO antenna system over wide frequency band. Both frequency-domain (FD) and time-domain (TD) post-processing techniques are proposed for possible application with this FDTD-CGF scheme. The FDTD-CGF time-domain (FDTD-CGF-TD) scheme utilizes time-domain signal processing methods and exhibits significant reduction in ECC computation time as compared to the FDTD-CGF frequency domain (FDTD-CGF-FD) scheme, for high frequency-resolution requirements. The proposed FDTD-CGF based schemes can be applied for accurate and fast prediction of wideband ECC response, instead of the conventional scattering parameter based techniques which have several limitations. Numerical examples of the proposed FDTD-CGF techniques are provided for two-element MIMO systems involving thin-wire half-wavelength dipoles in parallel side-by-side as well as orthogonal arrangements. The results obtained from the FDTD-CGF techniques are compared with results from commercial electromagnetic solver Ansys HFSS, to verify the validity of proposed approach.
On the equivalence of Gaussian elimination and Gauss-Jordan reduction in solving linear equations
NASA Technical Reports Server (NTRS)
Tsao, Nai-Kuan
1989-01-01
A novel general approach to round-off error analysis using the error complexity concepts is described. This is applied to the analysis of the Gaussian Elimination and Gauss-Jordan scheme for solving linear equations. The results show that the two algorithms are equivalent in terms of our error complexity measures. Thus the inherently parallel Gauss-Jordan scheme can be implemented with confidence if parallel computers are available.
Optical measurement of sound using time-varying laser speckle patterns
NASA Astrophysics Data System (ADS)
Leung, Terence S.; Jiang, Shihong; Hebden, Jeremy
2011-02-01
In this work, we introduce an optical technique to measure sound. The technique involves pointing a coherent pulsed laser beam on the surface of the measurement site and capturing the time-varying speckle patterns using a CCD camera. Sound manifests itself as vibrations on the surface which induce a periodic translation of the speckle pattern over time. Using a parallel speckle detection scheme, the dynamics of the time-varying speckle patterns can be captured and processed to produce spectral information of the sound. One potential clinical application is to measure pathological sounds from the brain as a screening test. We performed experiments to demonstrate the principle of the detection scheme using head phantoms. The results show that the detection scheme can measure the spectra of single frequency sounds between 100 and 2000 Hz. The detection scheme worked equally well in both a flat geometry and an anatomical head geometry. However, the current detection scheme is too slow for use in living biological tissues which has a decorrelation time of a few milliseconds. Further improvements have been suggested.
A conservative scheme for electromagnetic simulation of magnetized plasmas with kinetic electrons
NASA Astrophysics Data System (ADS)
Bao, J.; Lin, Z.; Lu, Z. X.
2018-02-01
A conservative scheme has been formulated and verified for gyrokinetic particle simulations of electromagnetic waves and instabilities in magnetized plasmas. An electron continuity equation derived from the drift kinetic equation is used to time advance the electron density perturbation by using the perturbed mechanical flow calculated from the parallel vector potential, and the parallel vector potential is solved by using the perturbed canonical flow from the perturbed distribution function. In gyrokinetic particle simulations using this new scheme, the shear Alfvén wave dispersion relation in the shearless slab and continuum damping in the sheared cylinder have been recovered. The new scheme overcomes the stringent requirement in the conventional perturbative simulation method that perpendicular grid size needs to be as small as electron collisionless skin depth even for the long wavelength Alfvén waves. The new scheme also avoids the problem in the conventional method that an unphysically large parallel electric field arises due to the inconsistency between electrostatic potential calculated from the perturbed density and vector potential calculated from the perturbed canonical flow. Finally, the gyrokinetic particle simulations of the Alfvén waves in sheared cylinder have superior numerical properties compared with the fluid simulations, which suffer from numerical difficulties associated with singular mode structures.
Weak beacon detection for air-to-ground optical wireless link establishment.
Han, Yaoqiang; Dang, Anhong; Tang, Junxiong; Guo, Hong
2010-02-01
In an air-to-ground free-space optical communication system, strong background interference seriously affects the beacon detection, which makes it difficult to establish the optical link. In this paper, we propose a correlation beacon detection scheme under strong background interference conditions. As opposed to traditional beacon detection schemes, the beacon is modulated by an m-sequence at the transmitting terminal with a digital differential matched filter (DDMF) array introduced at the receiving end to detect the modulated beacon. This scheme is capable of suppressing both strong interference and noise by correlation reception of the received image sequence. In addition, the DDMF array enables each pixel of the image sensor to have its own DDMF of the same structure to process its received image sequence in parallel, thus it makes fast beacon detection possible. Theoretical analysis and an outdoor experiment have been demonstrated and show that the proposed scheme can realize fast and effective beacon detection under strong background interference conditions. Consequently, the required beacon transmission power can also be reduced dramatically.
Adaptive independent joint control of manipulators - Theory and experiment
NASA Technical Reports Server (NTRS)
Seraji, H.
1988-01-01
The author presents a simple decentralized adaptive control scheme for multijoint robot manipulators based on the independent joint control concept. The proposed control scheme for each joint consists of a PID (proportional integral and differential) feedback controller and a position-velocity-acceleration feedforward controller, both with adjustable gains. The static and dynamic couplings that exist between the joint motions are compensated by the adaptive independent joint controllers while ensuring trajectory tracking. The proposed scheme is implemented on a MicroVAX II computer for motion control of the first three joints of a PUMA 560 arm. Experimental results are presented to demonstrate that trajectory tracking is achieved despite strongly coupled, highly nonlinear joint dynamics. The results confirm that the proposed decentralized adaptive control of manipulators is feasible, in spite of strong interactions between joint motions. The control scheme presented is computationally very fast and is amenable to parallel processing implementation within a distributed computing architecture, where each joint is controlled independently by a simple algorithm on a dedicated microprocessor.
NASA Technical Reports Server (NTRS)
Schunk, Richard Gregory; Chung, T. J.
2001-01-01
A parallelized version of the Flowfield Dependent Variation (FDV) Method is developed to analyze a problem of current research interest, the flowfield resulting from a triple shock/boundary layer interaction. Such flowfields are often encountered in the inlets of high speed air-breathing vehicles including the NASA Hyper-X research vehicle. In order to resolve the complex shock structure and to provide adequate resolution for boundary layer computations of the convective heat transfer from surfaces inside the inlet, models containing over 500,000 nodes are needed. Efficient parallelization of the computation is essential to achieving results in a timely manner. Results from a parallelization scheme, based upon multi-threading, as implemented on multiple processor supercomputers and workstations is presented.
CFD Analysis and Design Optimization Using Parallel Computers
NASA Technical Reports Server (NTRS)
Martinelli, Luigi; Alonso, Juan Jose; Jameson, Antony; Reuther, James
1997-01-01
A versatile and efficient multi-block method is presented for the simulation of both steady and unsteady flow, as well as aerodynamic design optimization of complete aircraft configurations. The compressible Euler and Reynolds Averaged Navier-Stokes (RANS) equations are discretized using a high resolution scheme on body-fitted structured meshes. An efficient multigrid implicit scheme is implemented for time-accurate flow calculations. Optimum aerodynamic shape design is achieved at very low cost using an adjoint formulation. The method is implemented on parallel computing systems using the MPI message passing interface standard to ensure portability. The results demonstrate that, by combining highly efficient algorithms with parallel computing, it is possible to perform detailed steady and unsteady analysis as well as automatic design for complex configurations using the present generation of parallel computers.
Parallel grid generation algorithm for distributed memory computers
NASA Technical Reports Server (NTRS)
Moitra, Stuti; Moitra, Anutosh
1994-01-01
A parallel grid-generation algorithm and its implementation on the Intel iPSC/860 computer are described. The grid-generation scheme is based on an algebraic formulation of homotopic relations. Methods for utilizing the inherent parallelism of the grid-generation scheme are described, and implementation of multiple levELs of parallelism on multiple instruction multiple data machines are indicated. The algorithm is capable of providing near orthogonality and spacing control at solid boundaries while requiring minimal interprocessor communications. Results obtained on the Intel hypercube for a blended wing-body configuration are used to demonstrate the effectiveness of the algorithm. Fortran implementations bAsed on the native programming model of the iPSC/860 computer and the Express system of software tools are reported. Computational gains in execution time speed-up ratios are given.
NASA Astrophysics Data System (ADS)
Huang, Melin; Huang, Bormin; Huang, Allen H.-L.
2015-10-01
The schemes of cumulus parameterization are responsible for the sub-grid-scale effects of convective and/or shallow clouds, and intended to represent vertical fluxes due to unresolved updrafts and downdrafts and compensating motion outside the clouds. Some schemes additionally provide cloud and precipitation field tendencies in the convective column, and momentum tendencies due to convective transport of momentum. The schemes all provide the convective component of surface rainfall. Betts-Miller-Janjic (BMJ) is one scheme to fulfill such purposes in the weather research and forecast (WRF) model. National Centers for Environmental Prediction (NCEP) has tried to optimize the BMJ scheme for operational application. As there are no interactions among horizontal grid points, this scheme is very suitable for parallel computation. With the advantage of Intel Xeon Phi Many Integrated Core (MIC) architecture, efficient parallelization and vectorization essentials, it allows us to optimize the BMJ scheme. If compared to the original code respectively running on one CPU socket (eight cores) and on one CPU core with Intel Xeon E5-2670, the MIC-based optimization of this scheme running on Xeon Phi coprocessor 7120P improves the performance by 2.4x and 17.0x, respectively.
A survey of packages for large linear systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wu, Kesheng; Milne, Brent
2000-02-11
This paper evaluates portable software packages for the iterative solution of very large sparse linear systems on parallel architectures. While we cannot hope to tell individual users which package will best suit their needs, we do hope that our systematic evaluation provides essential unbiased information about the packages and the evaluation process may serve as an example on how to evaluate these packages. The information contained here include feature comparisons, usability evaluations and performance characterizations. This review is primarily focused on self-contained packages that can be easily integrated into an existing program and are capable of computing solutions to verymore » large sparse linear systems of equations. More specifically, it concentrates on portable parallel linear system solution packages that provide iterative solution schemes and related preconditioning schemes because iterative methods are more frequently used than competing schemes such as direct methods. The eight packages evaluated are: Aztec, BlockSolve,ISIS++, LINSOL, P-SPARSLIB, PARASOL, PETSc, and PINEAPL. Among the eight portable parallel iterative linear system solvers reviewed, we recommend PETSc and Aztec for most application programmers because they have well designed user interface, extensive documentation and very responsive user support. Both PETSc and Aztec are written in the C language and are callable from Fortran. For those users interested in using Fortran 90, PARASOL is a good alternative. ISIS++is a good alternative for those who prefer the C++ language. Both PARASOL and ISIS++ are relatively new and are continuously evolving. Thus their user interface may change. In general, those packages written in Fortran 77 are more cumbersome to use because the user may need to directly deal with a number of arrays of varying sizes. Languages like C++ and Fortran 90 offer more convenient data encapsulation mechanisms which make it easier to implement a clean and intuitive user interface. In addition to reviewing these portable parallel iterative solver packages, we also provide a more cursory assessment of a range of related packages, from specialized parallel preconditioners to direct methods for sparse linear systems.« less
Quantum image pseudocolor coding based on the density-stratified method
NASA Astrophysics Data System (ADS)
Jiang, Nan; Wu, Wenya; Wang, Luo; Zhao, Na
2015-05-01
Pseudocolor processing is a branch of image enhancement. It dyes grayscale images to color images to make the images more beautiful or to highlight some parts on the images. This paper proposes a quantum image pseudocolor coding scheme based on the density-stratified method which defines a colormap and changes the density value from gray to color parallel according to the colormap. Firstly, two data structures: quantum image GQIR and quantum colormap QCR are reviewed or proposed. Then, the quantum density-stratified algorithm is presented. Based on them, the quantum realization in the form of circuits is given. The main advantages of the quantum version for pseudocolor processing over the classical approach are that it needs less memory and can speed up the computation. Two kinds of examples help us to describe the scheme further. Finally, the future work are analyzed.
Andrade, Xavier; Aspuru-Guzik, Alán
2013-10-08
We discuss the application of graphical processing units (GPUs) to accelerate real-space density functional theory (DFT) calculations. To make our implementation efficient, we have developed a scheme to expose the data parallelism available in the DFT approach; this is applied to the different procedures required for a real-space DFT calculation. We present results for current-generation GPUs from AMD and Nvidia, which show that our scheme, implemented in the free code Octopus, can reach a sustained performance of up to 90 GFlops for a single GPU, representing a significant speed-up when compared to the CPU version of the code. Moreover, for some systems, our implementation can outperform a GPU Gaussian basis set code, showing that the real-space approach is a competitive alternative for DFT simulations on GPUs.
NASA Astrophysics Data System (ADS)
Zerr, Robert Joseph
2011-12-01
The integral transport matrix method (ITMM) has been used as the kernel of new parallel solution methods for the discrete ordinates approximation of the within-group neutron transport equation. The ITMM abandons the repetitive mesh sweeps of the traditional source iterations (SI) scheme in favor of constructing stored operators that account for the direct coupling factors among all the cells and between the cells and boundary surfaces. The main goals of this work were to develop the algorithms that construct these operators and employ them in the solution process, determine the most suitable way to parallelize the entire procedure, and evaluate the behavior and performance of the developed methods for increasing number of processes. This project compares the effectiveness of the ITMM with the SI scheme parallelized with the Koch-Baker-Alcouffe (KBA) method. The primary parallel solution method involves a decomposition of the domain into smaller spatial sub-domains, each with their own transport matrices, and coupled together via interface boundary angular fluxes. Each sub-domain has its own set of ITMM operators and represents an independent transport problem. Multiple iterative parallel solution methods have investigated, including parallel block Jacobi (PBJ), parallel red/black Gauss-Seidel (PGS), and parallel GMRES (PGMRES). The fastest observed parallel solution method, PGS, was used in a weak scaling comparison with the PARTISN code. Compared to the state-of-the-art SI-KBA with diffusion synthetic acceleration (DSA), this new method without acceleration/preconditioning is not competitive for any problem parameters considered. The best comparisons occur for problems that are difficult for SI DSA, namely highly scattering and optically thick. SI DSA execution time curves are generally steeper than the PGS ones. However, until further testing is performed it cannot be concluded that SI DSA does not outperform the ITMM with PGS even on several thousand or tens of thousands of processors. The PGS method does outperform SI DSA for the periodic heterogeneous layers (PHL) configuration problems. Although this demonstrates a relative strength/weakness between the two methods, the practicality of these problems is much less, further limiting instances where it would be beneficial to select ITMM over SI DSA. The results strongly indicate a need for a robust, stable, and efficient acceleration method (or preconditioner for PGMRES). The spatial multigrid (SMG) method is currently incomplete in that it does not work for all cases considered and does not effectively improve the convergence rate for all values of scattering ratio c or cell dimension h. Nevertheless, it does display the desired trend for highly scattering, optically thin problems. That is, it tends to lower the rate of growth of number of iterations with increasing number of processes, P, while not increasing the number of additional operations per iteration to the extent that the total execution time of the rapidly converging accelerated iterations exceeds that of the slower unaccelerated iterations. A predictive parallel performance model has been developed for the PBJ method. Timing tests were performed such that trend lines could be fitted to the data for the different components and used to estimate the execution times. Applied to the weak scaling results, the model notably underestimates construction time, but combined with a slight overestimation in iterative solution time, the model predicts total execution time very well for large P. It also does a decent job with the strong scaling results, closely predicting the construction time and time per iteration, especially as P increases. Although not shown to be competitive up to 1,024 processing elements with the current state of the art, the parallelized ITMM exhibits promising scaling trends. Ultimately, compared to the KBA method, the parallelized ITMM may be found to be a very attractive option for transport calculations spatially decomposed over several tens of thousands of processes. Acceleration/preconditioning of the parallelized ITMM once developed will improve the convergence rate and improve its competitiveness. (Abstract shortened by UMI.)
PRATHAM: Parallel Thermal Hydraulics Simulations using Advanced Mesoscopic Methods
DOE Office of Scientific and Technical Information (OSTI.GOV)
Joshi, Abhijit S; Jain, Prashant K; Mudrich, Jaime A
2012-01-01
At the Oak Ridge National Laboratory, efforts are under way to develop a 3D, parallel LBM code called PRATHAM (PaRAllel Thermal Hydraulic simulations using Advanced Mesoscopic Methods) to demonstrate the accuracy and scalability of LBM for turbulent flow simulations in nuclear applications. The code has been developed using FORTRAN-90, and parallelized using the message passing interface MPI library. Silo library is used to compact and write the data files, and VisIt visualization software is used to post-process the simulation data in parallel. Both the single relaxation time (SRT) and multi relaxation time (MRT) LBM schemes have been implemented in PRATHAM.more » To capture turbulence without prohibitively increasing the grid resolution requirements, an LES approach [5] is adopted allowing large scale eddies to be numerically resolved while modeling the smaller (subgrid) eddies. In this work, a Smagorinsky model has been used, which modifies the fluid viscosity by an additional eddy viscosity depending on the magnitude of the rate-of-strain tensor. In LBM, this is achieved by locally varying the relaxation time of the fluid.« less
Optimizing the Performance of Reactive Molecular Dynamics Simulations for Multi-core Architectures
DOE Office of Scientific and Technical Information (OSTI.GOV)
Aktulga, Hasan Metin; Coffman, Paul; Shan, Tzu-Ray
2015-12-01
Hybrid parallelism allows high performance computing applications to better leverage the increasing on-node parallelism of modern supercomputers. In this paper, we present a hybrid parallel implementation of the widely used LAMMPS/ReaxC package, where the construction of bonded and nonbonded lists and evaluation of complex ReaxFF interactions are implemented efficiently using OpenMP parallelism. Additionally, the performance of the QEq charge equilibration scheme is examined and a dual-solver is implemented. We present the performance of the resulting ReaxC-OMP package on a state-of-the-art multi-core architecture Mira, an IBM BlueGene/Q supercomputer. For system sizes ranging from 32 thousand to 16.6 million particles, speedups inmore » the range of 1.5-4.5x are observed using the new ReaxC-OMP software. Sustained performance improvements have been observed for up to 262,144 cores (1,048,576 processes) of Mira with a weak scaling efficiency of 91.5% in larger simulations containing 16.6 million particles.« less
Parallel-Vector Algorithm For Rapid Structural Anlysis
NASA Technical Reports Server (NTRS)
Agarwal, Tarun R.; Nguyen, Duc T.; Storaasli, Olaf O.
1993-01-01
New algorithm developed to overcome deficiency of skyline storage scheme by use of variable-band storage scheme. Exploits both parallel and vector capabilities of modern high-performance computers. Gives engineers and designers opportunity to include more design variables and constraints during optimization of structures. Enables use of more refined finite-element meshes to obtain improved understanding of complex behaviors of aerospace structures leading to better, safer designs. Not only attractive for current supercomputers but also for next generation of shared-memory supercomputers.
Low Temperature Performance of High-Speed Neural Network Circuits
NASA Technical Reports Server (NTRS)
Duong, T.; Tran, M.; Daud, T.; Thakoor, A.
1995-01-01
Artificial neural networks, derived from their biological counterparts, offer a new and enabling computing paradigm specially suitable for such tasks as image and signal processing with feature classification/object recognition, global optimization, and adaptive control. When implemented in fully parallel electronic hardware, it offers orders of magnitude speed advantage. Basic building blocks of the new architecture are the processing elements called neurons implemented as nonlinear operational amplifiers with sigmoidal transfer function, interconnected through weighted connections called synapses implemented using circuitry for weight storage and multiply functions either in an analog, digital, or hybrid scheme.
Efficient parallelization of analytic bond-order potentials for large-scale atomistic simulations
NASA Astrophysics Data System (ADS)
Teijeiro, C.; Hammerschmidt, T.; Drautz, R.; Sutmann, G.
2016-07-01
Analytic bond-order potentials (BOPs) provide a way to compute atomistic properties with controllable accuracy. For large-scale computations of heterogeneous compounds at the atomistic level, both the computational efficiency and memory demand of BOP implementations have to be optimized. Since the evaluation of BOPs is a local operation within a finite environment, the parallelization concepts known from short-range interacting particle simulations can be applied to improve the performance of these simulations. In this work, several efficient parallelization methods for BOPs that use three-dimensional domain decomposition schemes are described. The schemes are implemented into the bond-order potential code BOPfox, and their performance is measured in a series of benchmarks. Systems of up to several millions of atoms are simulated on a high performance computing system, and parallel scaling is demonstrated for up to thousands of processors.
Discontinuous Galerkin Finite Element Method for Parabolic Problems
NASA Technical Reports Server (NTRS)
Kaneko, Hideaki; Bey, Kim S.; Hou, Gene J. W.
2004-01-01
In this paper, we develop a time and its corresponding spatial discretization scheme, based upon the assumption of a certain weak singularity of parallel ut(t) parallel Lz(omega) = parallel ut parallel2, for the discontinuous Galerkin finite element method for one-dimensional parabolic problems. Optimal convergence rates in both time and spatial variables are obtained. A discussion of automatic time-step control method is also included.
NASA Astrophysics Data System (ADS)
Bai, Guang-Fu; Hu, Lin; Jiang, Yang; Tian, Jing; Zi, Yue-Jiao; Wu, Ting-Wei; Huang, Feng-Qin
2017-08-01
In this paper, a photonic microwave waveform generator based on a dual-parallel Mach-Zehnder modulator is proposed and experimentally demonstrated. In this reported scheme, only one radio frequency signal is used to drive the dual-parallel Mach-Zehnder modulator. Meanwhile, dispersive elements or filters are not required in the proposed scheme, which make the scheme simpler and more stable. In this way, six variables can be adjusted. Through the different combinations of these variables, basic waveforms with full duty and small duty cycle can be generated. Tunability of the generator can be achieved by adjusting the frequency of the RF signal and the optical carrier. The corresponding theoretical analysis and simulation have been conducted. With guidance of theory and simulation, proof-of-concept experiments are carried out. The basic waveforms, including Gaussian, saw-up, and saw-down waveforms, with full duty and small duty cycle are generated at the repetition rate of 2 GHz. The theoretical and simulation results agree with the experimental results very well.
Real-time multi-mode neutron multiplicity counter
Rowland, Mark S; Alvarez, Raymond A
2013-02-26
Embodiments are directed to a digital data acquisition method that collects data regarding nuclear fission at high rates and performs real-time preprocessing of large volumes of data into directly useable forms for use in a system that performs non-destructive assaying of nuclear material and assemblies for mass and multiplication of special nuclear material (SNM). Pulses from a multi-detector array are fed in parallel to individual inputs that are tied to individual bits in a digital word. Data is collected by loading a word at the individual bit level in parallel, to reduce the latency associated with current shift-register systems. The word is read at regular intervals, all bits simultaneously, with no manipulation. The word is passed to a number of storage locations for subsequent processing, thereby removing the front-end problem of pulse pileup. The word is used simultaneously in several internal processing schemes that assemble the data in a number of more directly useable forms. The detector includes a multi-mode counter that executes a number of different count algorithms in parallel to determine different attributes of the count data.
Fuzzy Matching Based on Gray-scale Difference for Quantum Images
NASA Astrophysics Data System (ADS)
Luo, GaoFeng; Zhou, Ri-Gui; Liu, XingAo; Hu, WenWen; Luo, Jia
2018-05-01
Quantum image processing has recently emerged as an essential problem in practical tasks, e.g. real-time image matching. Previous studies have shown that the superposition and entanglement of quantum can greatly improve the efficiency of complex image processing. In this paper, a fuzzy quantum image matching scheme based on gray-scale difference is proposed to find out the target region in a reference image, which is very similar to the template image. Firstly, we employ the proposed enhanced quantum representation (NEQR) to store digital images. Then some certain quantum operations are used to evaluate the gray-scale difference between two quantum images by thresholding. If all of the obtained gray-scale differences are not greater than the threshold value, it indicates a successful fuzzy matching of quantum images. Theoretical analysis and experiments show that the proposed scheme performs fuzzy matching at a low cost and also enables exponentially significant speedup via quantum parallel computation.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zhao, Luning; Neuscamman, Eric
We present a modification to variational Monte Carlo’s linear method optimization scheme that addresses a critical memory bottleneck while maintaining compatibility with both the traditional ground state variational principle and our recently-introduced variational principle for excited states. For wave function ansatzes with tens of thousands of variables, our modification reduces the required memory per parallel process from tens of gigabytes to hundreds of megabytes, making the methodology a much better fit for modern supercomputer architectures in which data communication and per-process memory consumption are primary concerns. We verify the efficacy of the new optimization scheme in small molecule tests involvingmore » both the Hilbert space Jastrow antisymmetric geminal power ansatz and real space multi-Slater Jastrow expansions. Satisfied with its performance, we have added the optimizer to the QMCPACK software package, with which we demonstrate on a hydrogen ring a prototype approach for making systematically convergent, non-perturbative predictions of Mott-insulators’ optical band gaps.« less
NASA Astrophysics Data System (ADS)
Ji, X.; Shen, C.
2017-12-01
Flood inundation presents substantial societal hazards and also changes biogeochemistry for systems like the Amazon. It is often expensive to simulate high-resolution flood inundation and propagation in a long-term watershed-scale model. Due to the Courant-Friedrichs-Lewy (CFL) restriction, high resolution and large local flow velocity both demand prohibitively small time steps even for parallel codes. Here we develop a parallel surface-subsurface process-based model enhanced by multi-resolution meshes that are adaptively switched on or off. The high-resolution overland flow meshes are enabled only when the flood wave invades to floodplains. This model applies semi-implicit, semi-Lagrangian (SISL) scheme in solving dynamic wave equations, and with the assistant of the multi-mesh method, it also adaptively chooses the dynamic wave equation only in the area of deep inundation. Therefore, the model achieves a balance between accuracy and computational cost.
A method for real-time implementation of HOG feature extraction
NASA Astrophysics Data System (ADS)
Luo, Hai-bo; Yu, Xin-rong; Liu, Hong-mei; Ding, Qing-hai
2011-08-01
Histogram of oriented gradient (HOG) is an efficient feature extraction scheme, and HOG descriptors are feature descriptors which is widely used in computer vision and image processing for the purpose of biometrics, target tracking, automatic target detection(ATD) and automatic target recognition(ATR) etc. However, computation of HOG feature extraction is unsuitable for hardware implementation since it includes complicated operations. In this paper, the optimal design method and theory frame for real-time HOG feature extraction based on FPGA were proposed. The main principle is as follows: firstly, the parallel gradient computing unit circuit based on parallel pipeline structure was designed. Secondly, the calculation of arctangent and square root operation was simplified. Finally, a histogram generator based on parallel pipeline structure was designed to calculate the histogram of each sub-region. Experimental results showed that the HOG extraction can be implemented in a pixel period by these computing units.
Parallelization of elliptic solver for solving 1D Boussinesq model
NASA Astrophysics Data System (ADS)
Tarwidi, D.; Adytia, D.
2018-03-01
In this paper, a parallel implementation of an elliptic solver in solving 1D Boussinesq model is presented. Numerical solution of Boussinesq model is obtained by implementing a staggered grid scheme to continuity, momentum, and elliptic equation of Boussinesq model. Tridiagonal system emerging from numerical scheme of elliptic equation is solved by cyclic reduction algorithm. The parallel implementation of cyclic reduction is executed on multicore processors with shared memory architectures using OpenMP. To measure the performance of parallel program, large number of grids is varied from 28 to 214. Two test cases of numerical experiment, i.e. propagation of solitary and standing wave, are proposed to evaluate the parallel program. The numerical results are verified with analytical solution of solitary and standing wave. The best speedup of solitary and standing wave test cases is about 2.07 with 214 of grids and 1.86 with 213 of grids, respectively, which are executed by using 8 threads. Moreover, the best efficiency of parallel program is 76.2% and 73.5% for solitary and standing wave test cases, respectively.
The implementation of an aeronautical CFD flow code onto distributed memory parallel systems
NASA Astrophysics Data System (ADS)
Ierotheou, C. S.; Forsey, C. R.; Leatham, M.
2000-04-01
The parallelization of an industrially important in-house computational fluid dynamics (CFD) code for calculating the airflow over complex aircraft configurations using the Euler or Navier-Stokes equations is presented. The code discussed is the flow solver module of the SAUNA CFD suite. This suite uses a novel grid system that may include block-structured hexahedral or pyramidal grids, unstructured tetrahedral grids or a hybrid combination of both. To assist in the rapid convergence to a solution, a number of convergence acceleration techniques are employed including implicit residual smoothing and a multigrid full approximation storage scheme (FAS). Key features of the parallelization approach are the use of domain decomposition and encapsulated message passing to enable the execution in parallel using a single programme multiple data (SPMD) paradigm. In the case where a hybrid grid is used, a unified grid partitioning scheme is employed to define the decomposition of the mesh. The parallel code has been tested using both structured and hybrid grids on a number of different distributed memory parallel systems and is now routinely used to perform industrial scale aeronautical simulations. Copyright
2014-02-01
idle waiting for the wavefront to reach it. To overcome this, Reeve et al. (2001) 3 developed a scheme in analogy to the red-black Gauss - Seidel iterative ...understandable procedure calls. Parallelization of the SIMPLE iterative scheme with SIP used a red-black scheme similar to the red-black Gauss - Seidel ...scheme, the SIMPLE method, for pressure-velocity coupling. The result is a slowing convergence of the outer iterations . The red-black scheme excites a 2
GRAVIDY, a GPU modular, parallel direct-summation N-body integrator: dynamics with softening
NASA Astrophysics Data System (ADS)
Maureira-Fredes, Cristián; Amaro-Seoane, Pau
2018-01-01
A wide variety of outstanding problems in astrophysics involve the motion of a large number of particles under the force of gravity. These include the global evolution of globular clusters, tidal disruptions of stars by a massive black hole, the formation of protoplanets and sources of gravitational radiation. The direct-summation of N gravitational forces is a complex problem with no analytical solution and can only be tackled with approximations and numerical methods. To this end, the Hermite scheme is a widely used integration method. With different numerical techniques and special-purpose hardware, it can be used to speed up the calculations. But these methods tend to be computationally slow and cumbersome to work with. We present a new graphics processing unit (GPU), direct-summation N-body integrator written from scratch and based on this scheme, which includes relativistic corrections for sources of gravitational radiation. GRAVIDY has high modularity, allowing users to readily introduce new physics, it exploits available computational resources and will be maintained by regular updates. GRAVIDY can be used in parallel on multiple CPUs and GPUs, with a considerable speed-up benefit. The single-GPU version is between one and two orders of magnitude faster than the single-CPU version. A test run using four GPUs in parallel shows a speed-up factor of about 3 as compared to the single-GPU version. The conception and design of this first release is aimed at users with access to traditional parallel CPU clusters or computational nodes with one or a few GPU cards.
NASA Astrophysics Data System (ADS)
Rybakin, B.; Bogatencov, P.; Secrieru, G.; Iliuha, N.
2013-10-01
The paper deals with a parallel algorithm for calculations on multiprocessor computers and GPU accelerators. The calculations of shock waves interaction with low-density bubble results and the problem of the gas flow with the forces of gravity are presented. This algorithm combines a possibility to capture a high resolution of shock waves, the second-order accuracy for TVD schemes, and a possibility to observe a low-level diffusion of the advection scheme. Many complex problems of continuum mechanics are numerically solved on structured or unstructured grids. To improve the accuracy of the calculations is necessary to choose a sufficiently small grid (with a small cell size). This leads to the drawback of a substantial increase of computation time. Therefore, for the calculations of complex problems it is reasonable to use the method of Adaptive Mesh Refinement. That is, the grid refinement is performed only in the areas of interest of the structure, where, e.g., the shock waves are generated, or a complex geometry or other such features exist. Thus, the computing time is greatly reduced. In addition, the execution of the application on the resulting sequence of nested, decreasing nets can be parallelized. Proposed algorithm is based on the AMR method. Utilization of AMR method can significantly improve the resolution of the difference grid in areas of high interest, and from other side to accelerate the processes of the multi-dimensional problems calculating. Parallel algorithms of the analyzed difference models realized for the purpose of calculations on graphic processors using the CUDA technology [1].
Parallel Dynamics Simulation Using a Krylov-Schwarz Linear Solution Scheme
Abhyankar, Shrirang; Constantinescu, Emil M.; Smith, Barry F.; ...
2016-11-07
Fast dynamics simulation of large-scale power systems is a computational challenge because of the need to solve a large set of stiff, nonlinear differential-algebraic equations at every time step. The main bottleneck in dynamic simulations is the solution of a linear system during each nonlinear iteration of Newton’s method. In this paper, we present a parallel Krylov- Schwarz linear solution scheme that uses the Krylov subspacebased iterative linear solver GMRES with an overlapping restricted additive Schwarz preconditioner. As a result, performance tests of the proposed Krylov-Schwarz scheme for several large test cases ranging from 2,000 to 20,000 buses, including amore » real utility network, show good scalability on different computing architectures.« less
Parallel Dynamics Simulation Using a Krylov-Schwarz Linear Solution Scheme
DOE Office of Scientific and Technical Information (OSTI.GOV)
Abhyankar, Shrirang; Constantinescu, Emil M.; Smith, Barry F.
Fast dynamics simulation of large-scale power systems is a computational challenge because of the need to solve a large set of stiff, nonlinear differential-algebraic equations at every time step. The main bottleneck in dynamic simulations is the solution of a linear system during each nonlinear iteration of Newton’s method. In this paper, we present a parallel Krylov- Schwarz linear solution scheme that uses the Krylov subspacebased iterative linear solver GMRES with an overlapping restricted additive Schwarz preconditioner. As a result, performance tests of the proposed Krylov-Schwarz scheme for several large test cases ranging from 2,000 to 20,000 buses, including amore » real utility network, show good scalability on different computing architectures.« less
NASA Technical Reports Server (NTRS)
Mccormick, S.; Quinlan, D.
1989-01-01
The fast adaptive composite grid method (FAC) is an algorithm that uses various levels of uniform grids (global and local) to provide adaptive resolution and fast solution of PDEs. Like all such methods, it offers parallelism by using possibly many disconnected patches per level, but is hindered by the need to handle these levels sequentially. The finest levels must therefore wait for processing to be essentially completed on all the coarser ones. A recently developed asynchronous version of FAC, called AFAC, completely eliminates this bottleneck to parallelism. This paper describes timing results for AFAC, coupled with a simple load balancing scheme, applied to the solution of elliptic PDEs on an Intel iPSC hypercube. These tests include performance of certain processes necessary in adaptive methods, including moving grids and changing refinement. A companion paper reports on numerical and analytical results for estimating convergence factors of AFAC applied to very large scale examples.
NASA Astrophysics Data System (ADS)
Pinsker, R. I.
2014-10-01
In hot magnetized plasmas, two types of linear collisionless absorption processes are used to heat and drive noninductive current: absorption at ion or electron cyclotron resonances and their harmonics, and absorption by Landau damping and the transit-time-magnetic-pumping (TTMP) interactions. This tutorial discusses the latter process, i.e., parallel interactions between rf waves and electrons in which cyclotron resonance is not involved. Electron damping by the parallel interactions can be important in the ICRF, particularly in the higher harmonic region where competing ion cyclotron damping is weak, as well as in the Lower Hybrid Range of Frequencies (LHRF), which is in the neighborhood of the geometric mean of the ion and electron cyclotron frequencies. On the other hand, absorption by parallel processes is not significant in conventional ECRF schemes. Parallel interactions are especially important for the realization of high current drive efficiency with rf waves, and an application of particular recent interest is current drive with the whistler or helicon wave at high to very high (i.e., the LHRF) ion cyclotron harmonics. The scaling of absorption by parallel interactions with wave frequency is examined and the advantages and disadvantages of fast (helicons/whistlers) and slow (lower hybrid) waves in the LHRF in the context of reactor-grade tokamak plasmas are compared. In this frequency range, both wave modes can propagate in a significant fraction of the discharge volume; the ways in which the two waves can interact with each other are considered. The use of parallel interactions to heat and drive current in practice will be illustrated with examples from past experiments; also looking forward, this tutorial will provide an overview of potential applications in tokamak reactors. Supported by the US Department of Energy under DE-FC02-04ER54698.
NASA Astrophysics Data System (ADS)
Watanabe, Shuji; Takano, Hiroshi; Fukuda, Hiroya; Hiraki, Eiji; Nakaoka, Mutsuo
This paper deals with a digital control scheme of multiple paralleled high frequency switching current amplifier with four-quadrant chopper for generating gradient magnetic fields in MRI (Magnetic Resonance Imaging) systems. In order to track high precise current pattern in Gradient Coils (GC), the proposal current amplifier cancels the switching current ripples in GC with each other and designed optimum switching gate pulse patterns without influences of the large filter current ripple amplitude. The optimal control implementation and the linear control theory in GC current amplifiers have affinity to each other with excellent characteristics. The digital control system can be realized easily through the digital control implementation, DSPs or microprocessors. Multiple-parallel operational microprocessors realize two or higher paralleled GC current pattern tracking amplifier with optimal control design and excellent results are given for improving the image quality of MRI systems.
Optical vector network analyzer based on double-sideband modulation.
Jun, Wen; Wang, Ling; Yang, Chengwu; Li, Ming; Zhu, Ning Hua; Guo, Jinjin; Xiong, Liangming; Li, Wei
2017-11-01
We report an optical vector network analyzer (OVNA) based on double-sideband (DSB) modulation using a dual-parallel Mach-Zehnder modulator. The device under test (DUT) is measured twice with different modulation schemes. By post-processing the measurement results, the response of the DUT can be obtained accurately. Since DSB modulation is used in our approach, the measurement range is doubled compared with conventional single-sideband (SSB) modulation-based OVNA. Moreover, the measurement accuracy is improved by eliminating the even-order sidebands. The key advantage of the proposed scheme is that the measurement of a DUT with bandpass response can also be simply realized, which is a big challenge for the SSB-based OVNA. The proposed method is theoretically and experimentally demonstrated.
GPU accelerated FDTD solver and its application in MRI.
Chi, J; Liu, F; Jin, J; Mason, D G; Crozier, S
2010-01-01
The finite difference time domain (FDTD) method is a popular technique for computational electromagnetics (CEM). The large computational power often required, however, has been a limiting factor for its applications. In this paper, we will present a graphics processing unit (GPU)-based parallel FDTD solver and its successful application to the investigation of a novel B1 shimming scheme for high-field magnetic resonance imaging (MRI). The optimized shimming scheme exhibits considerably improved transmit B(1) profiles. The GPU implementation dramatically shortened the runtime of FDTD simulation of electromagnetic field compared with its CPU counterpart. The acceleration in runtime has made such investigation possible, and will pave the way for other studies of large-scale computational electromagnetic problems in modern MRI which were previously impractical.
Single product lot-sizing on unrelated parallel machines with non-decreasing processing times
NASA Astrophysics Data System (ADS)
Eremeev, A.; Kovalyov, M.; Kuznetsov, P.
2018-01-01
We consider a problem in which at least a given quantity of a single product has to be partitioned into lots, and lots have to be assigned to unrelated parallel machines for processing. In one version of the problem, the maximum machine completion time should be minimized, in another version of the problem, the sum of machine completion times is to be minimized. Machine-dependent lower and upper bounds on the lot size are given. The product is either assumed to be continuously divisible or discrete. The processing time of each machine is defined by an increasing function of the lot volume, given as an oracle. Setup times and costs are assumed to be negligibly small, and therefore, they are not considered. We derive optimal polynomial time algorithms for several special cases of the problem. An NP-hard case is shown to admit a fully polynomial time approximation scheme. An application of the problem in energy efficient processors scheduling is considered.
NASA Astrophysics Data System (ADS)
Tanikawa, Ataru; Yoshikawa, Kohji; Okamoto, Takashi; Nitadori, Keigo
2012-02-01
We present a high-performance N-body code for self-gravitating collisional systems accelerated with the aid of a new SIMD instruction set extension of the x86 architecture: Advanced Vector eXtensions (AVX), an enhanced version of the Streaming SIMD Extensions (SSE). With one processor core of Intel Core i7-2600 processor (8 MB cache and 3.40 GHz) based on Sandy Bridge micro-architecture, we implemented a fourth-order Hermite scheme with individual timestep scheme ( Makino and Aarseth, 1992), and achieved the performance of ˜20 giga floating point number operations per second (GFLOPS) for double-precision accuracy, which is two times and five times higher than that of the previously developed code implemented with the SSE instructions ( Nitadori et al., 2006b), and that of a code implemented without any explicit use of SIMD instructions with the same processor core, respectively. We have parallelized the code by using so-called NINJA scheme ( Nitadori et al., 2006a), and achieved ˜90 GFLOPS for a system containing more than N = 8192 particles with 8 MPI processes on four cores. We expect to achieve about 10 tera FLOPS (TFLOPS) for a self-gravitating collisional system with N ˜ 10 5 on massively parallel systems with at most 800 cores with Sandy Bridge micro-architecture. This performance will be comparable to that of Graphic Processing Unit (GPU) cluster systems, such as the one with about 200 Tesla C1070 GPUs ( Spurzem et al., 2010). This paper offers an alternative to collisional N-body simulations with GRAPEs and GPUs.
Parallel computation of fluid-structural interactions using high resolution upwind schemes
NASA Astrophysics Data System (ADS)
Hu, Zongjun
An efficient and accurate solver is developed to simulate the non-linear fluid-structural interactions in turbomachinery flutter flows. A new low diffusion E-CUSP scheme, Zha CUSP scheme, is developed to improve the efficiency and accuracy of the inviscid flux computation. The 3D unsteady Navier-Stokes equations with the Baldwin-Lomax turbulence model are solved using the finite volume method with the dual-time stepping scheme. The linearized equations are solved with Gauss-Seidel line iterations. The parallel computation is implemented using MPI protocol. The solver is validated with 2D cases for its turbulence modeling, parallel computation and unsteady calculation. The Zha CUSP scheme is validated with 2D cases, including a supersonic flat plate boundary layer, a transonic converging-diverging nozzle and a transonic inlet diffuser. The Zha CUSP2 scheme is tested with 3D cases, including a circular-to-rectangular nozzle, a subsonic compressor cascade and a transonic channel. The Zha CUSP schemes are proved to be accurate, robust and efficient in these tests. The steady and unsteady separation flows in a 3D stationary cascade under high incidence and three inlet Mach numbers are calculated to study the steady state separation flow patterns and their unsteady oscillation characteristics. The leading edge vortex shedding is the mechanism behind the unsteady characteristics of the high incidence separated flows. The separation flow characteristics is affected by the inlet Mach number. The blade aeroelasticity of a linear cascade with forced oscillating blades is studied using parallel computation. A simplified two-passage cascade with periodic boundary condition is first calculated under a medium frequency and a low incidence. The full scale cascade with 9 blades and two end walls is then studied more extensively under three oscillation frequencies and two incidence angles. The end wall influence and the blade stability are studied and compared under different frequencies and incidence angles. The Zha CUSP schemes are the first time to be applied in moving grid systems and 2D and 3D calculations. The implicit Gauss-Seidel iteration with dual time stepping is the first time to be used for moving grid systems. The NASA flutter cascade is the first time to be calculated in full scale.
Parallel adaptive discontinuous Galerkin approximation for thin layer avalanche modeling
NASA Astrophysics Data System (ADS)
Patra, A. K.; Nichita, C. C.; Bauer, A. C.; Pitman, E. B.; Bursik, M.; Sheridan, M. F.
2006-08-01
This paper describes the development of highly accurate adaptive discontinuous Galerkin schemes for the solution of the equations arising from a thin layer type model of debris flows. Such flows have wide applicability in the analysis of avalanches induced by many natural calamities, e.g. volcanoes, earthquakes, etc. These schemes are coupled with special parallel solution methodologies to produce a simulation tool capable of very high-order numerical accuracy. The methodology successfully replicates cold rock avalanches at Mount Rainier, Washington and hot volcanic particulate flows at Colima Volcano, Mexico.
Adaptive mesh fluid simulations on GPU
NASA Astrophysics Data System (ADS)
Wang, Peng; Abel, Tom; Kaehler, Ralf
2010-10-01
We describe an implementation of compressible inviscid fluid solvers with block-structured adaptive mesh refinement on Graphics Processing Units using NVIDIA's CUDA. We show that a class of high resolution shock capturing schemes can be mapped naturally on this architecture. Using the method of lines approach with the second order total variation diminishing Runge-Kutta time integration scheme, piecewise linear reconstruction, and a Harten-Lax-van Leer Riemann solver, we achieve an overall speedup of approximately 10 times faster execution on one graphics card as compared to a single core on the host computer. We attain this speedup in uniform grid runs as well as in problems with deep AMR hierarchies. Our framework can readily be applied to more general systems of conservation laws and extended to higher order shock capturing schemes. This is shown directly by an implementation of a magneto-hydrodynamic solver and comparing its performance to the pure hydrodynamic case. Finally, we also combined our CUDA parallel scheme with MPI to make the code run on GPU clusters. Close to ideal speedup is observed on up to four GPUs.
Amako, Jun; Shinozaki, Yu
2016-07-11
We report on a dual-wavelength diffractive beam splitter designed for use in parallel laser processing. This novel optical element generates two beam arrays of different wavelengths and allows their overlap at the process points on a workpiece. To design the deep surface-relief profile of a splitter using a simulated annealing algorithm, we introduce a heuristic but practical scheme to determine the maximum depth and the number of quantization levels. The designed corrugations were fabricated in a photoresist by maskless grayscale exposure using a high-resolution spatial light modulator. We characterized the photoresist splitter, thereby validating the proposed beam-splitting concept.
Lazy checkpoint coordination for bounding rollback propagation
NASA Technical Reports Server (NTRS)
Wang, Yi-Min; Fuchs, W. Kent
1992-01-01
Independent checkpointing allows maximum process autonomy but suffers from potential domino effects. Coordinated checkpointing eliminates the domino effect by sacrificing a certain degree of process autonomy. In this paper, we propose the technique of lazy checkpoint coordination which preserves process autonomy while employing communication-induced checkpoint coordination for bounding rollback propagation. The introduction of the notion of laziness allows a flexible trade-off between the cost for checkpoint coordination and the average rollback distance. Worst-case overhead analysis provides a means for estimating the extra checkpoint overhead. Communication trace-driven simulation for several parallel programs is used to evaluate the benefits of the proposed scheme for real applications.
New Bandwidth Efficient Parallel Concatenated Coding Schemes
NASA Technical Reports Server (NTRS)
Denedetto, S.; Divsalar, D.; Montorsi, G.; Pollara, F.
1996-01-01
We propose a new solution to parallel concatenation of trellis codes with multilevel amplitude/phase modulations and a suitable iterative decoding structure. Examples are given for throughputs 2 bits/sec/Hz with 8PSK and 16QAM signal constellations.
NASA Astrophysics Data System (ADS)
Hasan, Mehedi; Maldonado-Basilio, Ramón; Hall, Trevor J.
2015-04-01
Yin et al. have described an innovative filter-less optical millimeter-wave generation scheme for octotupling of a 10 GHz RF oscillator, or sedecimtupling of a 5 GHz RF oscillator using two parallel dual-parallel Mach-Zehnder modulators (DP-MZMs). The great merit of their design is the suppression of all harmonics except those of order ? (octotupling) or all harmonics except those of order ? (sedecimtupling), where ? is an integer. A demerit of their scheme is the requirement to set a precise RF signal modulation index in order to suppress the zeroth order optical carrier. The purpose of this comment is to show that, in the case of the octotupling function, all harmonics may be suppressed except those of order ?, where ? is an odd integer, by the simple addition of an optical ? phase shift between the two DP-MZMs and an adjustment of the RF drive phases. Since the carrier is suppressed in the modified architecture, the octotupling circuit is thereby released of the strict requirement to set the drive level to a precise value without any significant increase in circuit complexity.
Characterization of robotics parallel algorithms and mapping onto a reconfigurable SIMD machine
NASA Technical Reports Server (NTRS)
Lee, C. S. G.; Lin, C. T.
1989-01-01
The kinematics, dynamics, Jacobian, and their corresponding inverse computations are six essential problems in the control of robot manipulators. Efficient parallel algorithms for these computations are discussed and analyzed. Their characteristics are identified and a scheme on the mapping of these algorithms to a reconfigurable parallel architecture is presented. Based on the characteristics including type of parallelism, degree of parallelism, uniformity of the operations, fundamental operations, data dependencies, and communication requirement, it is shown that most of the algorithms for robotic computations possess highly regular properties and some common structures, especially the linear recursive structure. Moreover, they are well-suited to be implemented on a single-instruction-stream multiple-data-stream (SIMD) computer with reconfigurable interconnection network. The model of a reconfigurable dual network SIMD machine with internal direct feedback is introduced. A systematic procedure internal direct feedback is introduced. A systematic procedure to map these computations to the proposed machine is presented. A new scheduling problem for SIMD machines is investigated and a heuristic algorithm, called neighborhood scheduling, that reorders the processing sequence of subtasks to reduce the communication time is described. Mapping results of a benchmark algorithm are illustrated and discussed.
Park, Hyeong-Gyu; Shin, Yeong-Gil; Lee, Ho
2015-12-01
A ray-driven backprojector is based on ray-tracing, which computes the length of the intersection between the ray paths and each voxel to be reconstructed. To reduce the computational burden caused by these exhaustive intersection tests, we propose a fully graphics processing unit (GPU)-based ray-driven backprojector in conjunction with a ray-culling scheme that enables straightforward parallelization without compromising the high computing performance of a GPU. The purpose of the ray-culling scheme is to reduce the number of ray-voxel intersection tests by excluding rays irrelevant to a specific voxel computation. This rejection step is based on an axis-aligned bounding box (AABB) enclosing a region of voxel projection, where eight vertices of each voxel are projected onto the detector plane. The range of the rectangular-shaped AABB is determined by min/max operations on the coordinates in the region. Using the indices of pixels inside the AABB, the rays passing through the voxel can be identified and the voxel is weighted as the length of intersection between the voxel and the ray. This procedure makes it possible to reflect voxel-level parallelization, allowing an independent calculation at each voxel, which is feasible for a GPU implementation. To eliminate redundant calculations during ray-culling, a shared-memory optimization is applied to exploit the GPU memory hierarchy. In experimental results using real measurement data with phantoms, the proposed GPU-based ray-culling scheme reconstructed a volume of resolution 28032803176 in 77 seconds from 680 projections of resolution 10243768 , which is 26 times and 7.5 times faster than standard CPU-based and GPU-based ray-driven backprojectors, respectively. Qualitative and quantitative analyses showed that the ray-driven backprojector provides high-quality reconstruction images when compared with those generated by the Feldkamp-Davis-Kress algorithm using a pixel-driven backprojector, with an average of 2.5 times higher contrast-to-noise ratio, 1.04 times higher universal quality index, and 1.39 times higher normalized mutual information. © The Author(s) 2014.
Implementation and analysis of a Navier-Stokes algorithm on parallel computers
NASA Technical Reports Server (NTRS)
Fatoohi, Raad A.; Grosch, Chester E.
1988-01-01
The results of the implementation of a Navier-Stokes algorithm on three parallel/vector computers are presented. The object of this research is to determine how well, or poorly, a single numerical algorithm would map onto three different architectures. The algorithm is a compact difference scheme for the solution of the incompressible, two-dimensional, time-dependent Navier-Stokes equations. The computers were chosen so as to encompass a variety of architectures. They are the following: the MPP, an SIMD machine with 16K bit serial processors; Flex/32, an MIMD machine with 20 processors; and Cray/2. The implementation of the algorithm is discussed in relation to these architectures and measures of the performance on each machine are given. The basic comparison is among SIMD instruction parallelism on the MPP, MIMD process parallelism on the Flex/32, and vectorization of a serial code on the Cray/2. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Finally, conclusions are presented.
Design of on-board parallel computer on nano-satellite
NASA Astrophysics Data System (ADS)
You, Zheng; Tian, Hexiang; Yu, Shijie; Meng, Li
2007-11-01
This paper provides one scheme of the on-board parallel computer system designed for the Nano-satellite. Based on the development request that the Nano-satellite should have a small volume, low weight, low power cost, and intelligence, this scheme gets rid of the traditional one-computer system and dual-computer system with endeavor to improve the dependability, capability and intelligence simultaneously. According to the method of integration design, it employs the parallel computer system with shared memory as the main structure, connects the telemetric system, attitude control system, and the payload system by the intelligent bus, designs the management which can deal with the static tasks and dynamic task-scheduling, protect and recover the on-site status and so forth in light of the parallel algorithms, and establishes the fault diagnosis, restoration and system restructure mechanism. It accomplishes an on-board parallel computer system with high dependability, capability and intelligence, a flexible management on hardware resources, an excellent software system, and a high ability in extension, which satisfies with the conception and the tendency of the integration electronic design sufficiently.
Use of general purpose graphics processing units with MODFLOW
Hughes, Joseph D.; White, Jeremy T.
2013-01-01
To evaluate the use of general-purpose graphics processing units (GPGPUs) to improve the performance of MODFLOW, an unstructured preconditioned conjugate gradient (UPCG) solver has been developed. The UPCG solver uses a compressed sparse row storage scheme and includes Jacobi, zero fill-in incomplete, and modified-incomplete lower-upper (LU) factorization, and generalized least-squares polynomial preconditioners. The UPCG solver also includes options for sequential and parallel solution on the central processing unit (CPU) using OpenMP. For simulations utilizing the GPGPU, all basic linear algebra operations are performed on the GPGPU; memory copies between the central processing unit CPU and GPCPU occur prior to the first iteration of the UPCG solver and after satisfying head and flow criteria or exceeding a maximum number of iterations. The efficiency of the UPCG solver for GPGPU and CPU solutions is benchmarked using simulations of a synthetic, heterogeneous unconfined aquifer with tens of thousands to millions of active grid cells. Testing indicates GPGPU speedups on the order of 2 to 8, relative to the standard MODFLOW preconditioned conjugate gradient (PCG) solver, can be achieved when (1) memory copies between the CPU and GPGPU are optimized, (2) the percentage of time performing memory copies between the CPU and GPGPU is small relative to the calculation time, (3) high-performance GPGPU cards are utilized, and (4) CPU-GPGPU combinations are used to execute sequential operations that are difficult to parallelize. Furthermore, UPCG solver testing indicates GPGPU speedups exceed parallel CPU speedups achieved using OpenMP on multicore CPUs for preconditioners that can be easily parallelized.
NASA Astrophysics Data System (ADS)
Hill, Peter; Shanahan, Brendan; Dudson, Ben
2017-04-01
We present a technique for handling Dirichlet boundary conditions with the Flux Coordinate Independent (FCI) parallel derivative operator with arbitrary-shaped material geometry in general 3D magnetic fields. The FCI method constructs a finite difference scheme for ∇∥ by following field lines between poloidal planes and interpolating within planes. Doing so removes the need for field-aligned coordinate systems that suffer from singularities in the metric tensor at null points in the magnetic field (or equivalently, when q → ∞). One cost of this method is that as the field lines are not on the mesh, they may leave the domain at any point between neighbouring planes, complicating the application of boundary conditions. The Leg Value Fill (LVF) boundary condition scheme presented here involves an extrapolation/interpolation of the boundary value onto the field line end point. The usual finite difference scheme can then be used unmodified. We implement the LVF scheme in BOUT++ and use the Method of Manufactured Solutions to verify the implementation in a rectangular domain, and show that it does not modify the error scaling of the finite difference scheme. The use of LVF for arbitrary wall geometry is outlined. We also demonstrate the feasibility of using the FCI approach in no n-axisymmetric configurations for a simple diffusion model in a "straight stellarator" magnetic field. A Gaussian blob diffuses along the field lines, tracing out flux surfaces. Dirichlet boundary conditions impose a last closed flux surface (LCFS) that confines the density. Including a poloidal limiter moves the LCFS to a smaller radius. The expected scaling of the numerical perpendicular diffusion, which is a consequence of the FCI method, in stellarator-like geometry is recovered. A novel technique for increasing the parallel resolution during post-processing, in order to reduce artefacts in visualisations, is described.
Implementation of density-based solver for all speeds in the framework of OpenFOAM
NASA Astrophysics Data System (ADS)
Shen, Chun; Sun, Fengxian; Xia, Xinlin
2014-10-01
In the framework of open source CFD code OpenFOAM, a density-based solver for all speeds flow field is developed. In this solver the preconditioned all speeds AUSM+(P) scheme is adopted and the dual time scheme is implemented to complete the unsteady process. Parallel computation could be implemented to accelerate the solving process. Different interface reconstruction algorithms are implemented, and their accuracy with respect to convection is compared. Three benchmark tests of lid-driven cavity flow, flow crossing over a bump, and flow over a forward-facing step are presented to show the accuracy of the AUSM+(P) solver for low-speed incompressible flow, transonic flow, and supersonic/hypersonic flow. Firstly, for the lid driven cavity flow, the computational results obtained by different interface reconstruction algorithms are compared. It is indicated that the one dimensional reconstruction scheme adopted in this solver possesses high accuracy and the solver developed in this paper can effectively catch the features of low incompressible flow. Then via the test cases regarding the flow crossing over bump and over forward step, the ability to capture characteristics of the transonic and supersonic/hypersonic flows are confirmed. The forward-facing step proves to be the most challenging for the preconditioned solvers with and without the dual time scheme. Nonetheless, the solvers described in this paper reproduce the main features of this flow, including the evolution of the initial transient.
Ensuring correct rollback recovery in distributed shared memory systems
NASA Technical Reports Server (NTRS)
Janssens, Bob; Fuchs, W. Kent
1995-01-01
Distributed shared memory (DSM) implemented on a cluster of workstations is an increasingly attractive platform for executing parallel scientific applications. Checkpointing and rollback techniques can be used in such a system to allow the computation to progress in spite of the temporary failure of one or more processing nodes. This paper presents the design of an independent checkpointing method for DSM that takes advantage of DSM's specific properties to reduce error-free and rollback overhead. The scheme reduces the dependencies that need to be considered for correct rollback to those resulting from transfers of pages. Furthermore, in-transit messages can be recovered without the use of logging. We extend the scheme to a DSM implementation using lazy release consistency, where the frequency of dependencies is further reduced.
A New Numerical Scheme for Cosmic-Ray Transport
NASA Astrophysics Data System (ADS)
Jiang, Yan-Fei; Oh, S. Peng
2018-02-01
Numerical solutions of the cosmic-ray (CR) magnetohydrodynamic equations are dogged by a powerful numerical instability, which arises from the constraint that CRs can only stream down their gradient. The standard cure is to regularize by adding artificial diffusion. Besides introducing ad hoc smoothing, this has a significant negative impact on either computational cost or complexity and parallel scalings. We describe a new numerical algorithm for CR transport, with close parallels to two-moment methods for radiative transfer under the reduced speed of light approximation. It stably and robustly handles CR streaming without any artificial diffusion. It allows for both isotropic and field-aligned CR streaming and diffusion, with arbitrary streaming and diffusion coefficients. CR transport is handled explicitly, while source terms are handled implicitly. The overall time step scales linearly with resolution (even when computing CR diffusion) and has a perfect parallel scaling. It is given by the standard Courant condition with respect to a constant maximum velocity over the entire simulation domain. The computational cost is comparable to that of solving the ideal MHD equation. We demonstrate the accuracy and stability of this new scheme with a wide variety of tests, including anisotropic streaming and diffusion tests, CR-modified shocks, CR-driven blast waves, and CR transport in multiphase media. The new algorithm opens doors to much more ambitious and hitherto intractable calculations of CR physics in galaxies and galaxy clusters. It can also be applied to other physical processes with similar mathematical structure, such as saturated, anisotropic heat conduction.
Parallel processing using an optical delay-based reservoir computer
NASA Astrophysics Data System (ADS)
Van der Sande, Guy; Nguimdo, Romain Modeste; Verschaffelt, Guy
2016-04-01
Delay systems subject to delayed optical feedback have recently shown great potential in solving computationally hard tasks. By implementing a neuro-inspired computational scheme relying on the transient response to optical data injection, high processing speeds have been demonstrated. However, reservoir computing systems based on delay dynamics discussed in the literature are designed by coupling many different stand-alone components which lead to bulky, lack of long-term stability, non-monolithic systems. Here we numerically investigate the possibility of implementing reservoir computing schemes based on semiconductor ring lasers. Semiconductor ring lasers are semiconductor lasers where the laser cavity consists of a ring-shaped waveguide. SRLs are highly integrable and scalable, making them ideal candidates for key components in photonic integrated circuits. SRLs can generate light in two counterpropagating directions between which bistability has been demonstrated. We demonstrate that two independent machine learning tasks , even with different nature of inputs with different input data signals can be simultaneously computed using a single photonic nonlinear node relying on the parallelism offered by photonics. We illustrate the performance on simultaneous chaotic time series prediction and a classification of the Nonlinear Channel Equalization. We take advantage of different directional modes to process individual tasks. Each directional mode processes one individual task to mitigate possible crosstalk between the tasks. Our results indicate that prediction/classification with errors comparable to the state-of-the-art performance can be obtained even with noise despite the two tasks being computed simultaneously. We also find that a good performance is obtained for both tasks for a broad range of the parameters. The results are discussed in detail in [Nguimdo et al., IEEE Trans. Neural Netw. Learn. Syst. 26, pp. 3301-3307, 2015
Parallel algorithms for quantum chemistry. I. Integral transformations on a hypercube multiprocessor
DOE Office of Scientific and Technical Information (OSTI.GOV)
Whiteside, R.A.; Binkley, J.S.; Colvin, M.E.
1987-02-15
For many years it has been recognized that fundamental physical constraints such as the speed of light will limit the ultimate speed of single processor computers to less than about three billion floating point operations per second (3 GFLOPS). This limitation is becoming increasingly restrictive as commercially available machines are now within an order of magnitude of this asymptotic limit. A natural way to avoid this limit is to harness together many processors to work on a single computational problem. In principle, these parallel processing computers have speeds limited only by the number of processors one chooses to acquire. Themore » usefulness of potentially unlimited processing speed to a computationally intensive field such as quantum chemistry is obvious. If these methods are to be applied to significantly larger chemical systems, parallel schemes will have to be employed. For this reason we have developed distributed-memory algorithms for a number of standard quantum chemical methods. We are currently implementing these on a 32 processor Intel hypercube. In this paper we present our algorithm and benchmark results for one of the bottleneck steps in quantum chemical calculations: the four index integral transformation.« less
A mixed parallel strategy for the solution of coupled multi-scale problems at finite strains
NASA Astrophysics Data System (ADS)
Lopes, I. A. Rodrigues; Pires, F. M. Andrade; Reis, F. J. P.
2018-02-01
A mixed parallel strategy for the solution of homogenization-based multi-scale constitutive problems undergoing finite strains is proposed. The approach aims to reduce the computational time and memory requirements of non-linear coupled simulations that use finite element discretization at both scales (FE^2). In the first level of the algorithm, a non-conforming domain decomposition technique, based on the FETI method combined with a mortar discretization at the interface of macroscopic subdomains, is employed. A master-slave scheme, which distributes tasks by macroscopic element and adopts dynamic scheduling, is then used for each macroscopic subdomain composing the second level of the algorithm. This strategy allows the parallelization of FE^2 simulations in computers with either shared memory or distributed memory architectures. The proposed strategy preserves the quadratic rates of asymptotic convergence that characterize the Newton-Raphson scheme. Several examples are presented to demonstrate the robustness and efficiency of the proposed parallel strategy.
A more secure parallel keyed hash function based on chaotic neural network
NASA Astrophysics Data System (ADS)
Huang, Zhongquan
2011-08-01
Although various hash functions based on chaos or chaotic neural network were proposed, most of them can not work efficiently in parallel computing environment. Recently, an algorithm for parallel keyed hash function construction based on chaotic neural network was proposed [13]. However, there is a strict limitation in this scheme that its secret keys must be nonce numbers. In other words, if the keys are used more than once in this scheme, there will be some potential security flaw. In this paper, we analyze the cause of vulnerability of the original one in detail, and then propose the corresponding enhancement measures, which can remove the limitation on the secret keys. Theoretical analysis and computer simulation indicate that the modified hash function is more secure and practical than the original one. At the same time, it can keep the parallel merit and satisfy the other performance requirements of hash function, such as good statistical properties, high message and key sensitivity, and strong collision resistance, etc.
NASA Astrophysics Data System (ADS)
Chan, Chia-Hsin; Tu, Chun-Chuan; Tsai, Wen-Jiin
2017-01-01
High efficiency video coding (HEVC) not only improves the coding efficiency drastically compared to the well-known H.264/AVC but also introduces coding tools for parallel processing, one of which is tiles. Tile partitioning is allowed to be arbitrary in HEVC, but how to decide tile boundaries remains an open issue. An adaptive tile boundary (ATB) method is proposed to select a better tile partitioning to improve load balancing (ATB-LoadB) and coding efficiency (ATB-Gain) with a unified scheme. Experimental results show that, compared to ordinary uniform-space partitioning, the proposed ATB can save up to 17.65% of encoding times in parallel encoding scenarios and can reduce up to 0.8% of total bit rates for coding efficiency.
Hidri, Lotfi; Gharbi, Anis; Louly, Mohamed Aly
2014-01-01
We focus on the two-center hybrid flow shop scheduling problem with identical parallel machines and removal times. The job removal time is the required duration to remove it from a machine after its processing. The objective is to minimize the maximum completion time (makespan). A heuristic and a lower bound are proposed for this NP-Hard problem. These procedures are based on the optimal solution of the parallel machine scheduling problem with release dates and delivery times. The heuristic is composed of two phases. The first one is a constructive phase in which an initial feasible solution is provided, while the second phase is an improvement one. Intensive computational experiments have been conducted to confirm the good performance of the proposed procedures.
Efficient Bounding Schemes for the Two-Center Hybrid Flow Shop Scheduling Problem with Removal Times
2014-01-01
We focus on the two-center hybrid flow shop scheduling problem with identical parallel machines and removal times. The job removal time is the required duration to remove it from a machine after its processing. The objective is to minimize the maximum completion time (makespan). A heuristic and a lower bound are proposed for this NP-Hard problem. These procedures are based on the optimal solution of the parallel machine scheduling problem with release dates and delivery times. The heuristic is composed of two phases. The first one is a constructive phase in which an initial feasible solution is provided, while the second phase is an improvement one. Intensive computational experiments have been conducted to confirm the good performance of the proposed procedures. PMID:25610911
A parallel algorithm for step- and chain-growth polymerization in molecular dynamics.
de Buyl, Pierre; Nies, Erik
2015-04-07
Classical Molecular Dynamics (MD) simulations provide insight into the properties of many soft-matter systems. In some situations, it is interesting to model the creation of chemical bonds, a process that is not part of the MD framework. In this context, we propose a parallel algorithm for step- and chain-growth polymerization that is based on a generic reaction scheme, works at a given intrinsic rate and produces continuous trajectories. We present an implementation in the ESPResSo++ simulation software and compare it with the corresponding feature in LAMMPS. For chain growth, our results are compared to the existing simulation literature. For step growth, a rate equation is proposed for the evolution of the crosslinker population that compares well to the simulations for low crosslinker functionality or for short times.
A parallel algorithm for step- and chain-growth polymerization in molecular dynamics
NASA Astrophysics Data System (ADS)
de Buyl, Pierre; Nies, Erik
2015-04-01
Classical Molecular Dynamics (MD) simulations provide insight into the properties of many soft-matter systems. In some situations, it is interesting to model the creation of chemical bonds, a process that is not part of the MD framework. In this context, we propose a parallel algorithm for step- and chain-growth polymerization that is based on a generic reaction scheme, works at a given intrinsic rate and produces continuous trajectories. We present an implementation in the ESPResSo++ simulation software and compare it with the corresponding feature in LAMMPS. For chain growth, our results are compared to the existing simulation literature. For step growth, a rate equation is proposed for the evolution of the crosslinker population that compares well to the simulations for low crosslinker functionality or for short times.
Error Control Coding Techniques for Space and Satellite Communications
NASA Technical Reports Server (NTRS)
Costello, Daniel J., Jr.; Takeshita, Oscar Y.; Cabral, Hermano A.
1998-01-01
It is well known that the BER performance of a parallel concatenated turbo-code improves roughly as 1/N, where N is the information block length. However, it has been observed by Benedetto and Montorsi that for most parallel concatenated turbo-codes, the FER performance does not improve monotonically with N. In this report, we study the FER of turbo-codes, and the effects of their concatenation with an outer code. Two methods of concatenation are investigated: across several frames and within each frame. Some asymmetric codes are shown to have excellent FER performance with an information block length of 16384. We also show that the proposed outer coding schemes can improve the BER performance as well by eliminating pathological frames generated by the iterative MAP decoding process.
Deducing trapdoor primitives in public key encryption schemes
NASA Astrophysics Data System (ADS)
Pandey, Chandra
2005-03-01
Semantic security of public key encryption schemes is often interchangeable with the art of building trapdoors. In the frame of reference of Random Oracle methodology, the "Key Privacy" and "Anonymity" has often been discussed. However to a certain degree the security of most public key encryption schemes is required to be analyzed with formal proofs using one-way functions. This paper evaluates the design of El Gamal and RSA based schemes and attempts to parallelize the trapdoor primitives used in the computation of the cipher text, thereby magnifying the decryption error δp in the above schemes.
Analysis of composite ablators using massively parallel computation
NASA Technical Reports Server (NTRS)
Shia, David
1995-01-01
In this work, the feasibility of using massively parallel computation to study the response of ablative materials is investigated. Explicit and implicit finite difference methods are used on a massively parallel computer, the Thinking Machines CM-5. The governing equations are a set of nonlinear partial differential equations. The governing equations are developed for three sample problems: (1) transpiration cooling, (2) ablative composite plate, and (3) restrained thermal growth testing. The transpiration cooling problem is solved using a solution scheme based solely on the explicit finite difference method. The results are compared with available analytical steady-state through-thickness temperature and pressure distributions and good agreement between the numerical and analytical solutions is found. It is also found that a solution scheme based on the explicit finite difference method has the following advantages: incorporates complex physics easily, results in a simple algorithm, and is easily parallelizable. However, a solution scheme of this kind needs very small time steps to maintain stability. A solution scheme based on the implicit finite difference method has the advantage that it does not require very small times steps to maintain stability. However, this kind of solution scheme has the disadvantages that complex physics cannot be easily incorporated into the algorithm and that the solution scheme is difficult to parallelize. A hybrid solution scheme is then developed to combine the strengths of the explicit and implicit finite difference methods and minimize their weaknesses. This is achieved by identifying the critical time scale associated with the governing equations and applying the appropriate finite difference method according to this critical time scale. The hybrid solution scheme is then applied to the ablative composite plate and restrained thermal growth problems. The gas storage term is included in the explicit pressure calculation of both problems. Results from ablative composite plate problems are compared with previous numerical results which did not include the gas storage term. It is found that the through-thickness temperature distribution is not affected much by the gas storage term. However, the through-thickness pressure and stress distributions, and the extent of chemical reactions are different from the previous numerical results. Two types of chemical reaction models are used in the restrained thermal growth testing problem: (1) pressure-independent Arrhenius type rate equations and (2) pressure-dependent Arrhenius type rate equations. The numerical results are compared to experimental results and the pressure-dependent model is able to capture the trend better than the pressure-independent one. Finally, a performance study is done on the hybrid algorithm using the ablative composite plate problem. It is found that there is a good speedup of performance on the CM-5. For 32 CPU's, the speedup of performance is 20. The efficiency of the algorithm is found to be a function of the size and execution time of a given problem and the effective parallelization of the algorithm. It also seems that there is an optimum number of CPU's to use for a given problem.
Particle simulation of plasmas on the massively parallel processor
NASA Technical Reports Server (NTRS)
Gledhill, I. M. A.; Storey, L. R. O.
1987-01-01
Particle simulations, in which collective phenomena in plasmas are studied by following the self consistent motions of many discrete particles, involve several highly repetitive sets of calculations that are readily adaptable to SIMD parallel processing. A fully electromagnetic, relativistic plasma simulation for the massively parallel processor is described. The particle motions are followed in 2 1/2 dimensions on a 128 x 128 grid, with periodic boundary conditions. The two dimensional simulation space is mapped directly onto the processor network; a Fast Fourier Transform is used to solve the field equations. Particle data are stored according to an Eulerian scheme, i.e., the information associated with each particle is moved from one local memory to another as the particle moves across the spatial grid. The method is applied to the study of the nonlinear development of the whistler instability in a magnetospheric plasma model, with an anisotropic electron temperature. The wave distribution function is included as a new diagnostic to allow simulation results to be compared with satellite observations.
Apparatus for and method of simulating turbulence
Dimas, Athanassios; Lottati, Isaac; Bernard, Peter; Collins, James; Geiger, James C.
2003-01-01
In accordance with a preferred embodiment of the invention, a novel apparatus for and method of simulating physical processes such as fluid flow is provided. Fluid flow near a boundary or wall of an object is represented by a collection of vortex sheet layers. The layers are composed of a grid or mesh of one or more geometrically shaped space filling elements. In the preferred embodiment, the space filling elements take on a triangular shape. An Eulerian approach is employed for the vortex sheets, where a finite-volume scheme is used on the prismatic grid formed by the vortex sheet layers. A Lagrangian approach is employed for the vortical elements (e.g., vortex tubes or filaments) found in the remainder of the flow domain. To reduce the computational time, a hairpin removal scheme is employed to reduce the number of vortex filaments, and a Fast Multipole Method (FMM), preferably implemented using parallel processing techniques, reduces the computation of the velocity field.
A Blocked Linear Method for Optimizing Large Parameter Sets in Variational Monte Carlo
Zhao, Luning; Neuscamman, Eric
2017-05-17
We present a modification to variational Monte Carlo’s linear method optimization scheme that addresses a critical memory bottleneck while maintaining compatibility with both the traditional ground state variational principle and our recently-introduced variational principle for excited states. For wave function ansatzes with tens of thousands of variables, our modification reduces the required memory per parallel process from tens of gigabytes to hundreds of megabytes, making the methodology a much better fit for modern supercomputer architectures in which data communication and per-process memory consumption are primary concerns. We verify the efficacy of the new optimization scheme in small molecule tests involvingmore » both the Hilbert space Jastrow antisymmetric geminal power ansatz and real space multi-Slater Jastrow expansions. Satisfied with its performance, we have added the optimizer to the QMCPACK software package, with which we demonstrate on a hydrogen ring a prototype approach for making systematically convergent, non-perturbative predictions of Mott-insulators’ optical band gaps.« less
Lee, Tian-Fu; Liu, Chuan-Ming
2013-06-01
A smart-card based authentication scheme for telecare medicine information systems enables patients, doctors, nurses, health visitors and the medicine information systems to establish a secure communication platform through public networks. Zhu recently presented an improved authentication scheme in order to solve the weakness of the authentication scheme of Wei et al., where the off-line password guessing attacks cannot be resisted. This investigation indicates that the improved scheme of Zhu has some faults such that the authentication scheme cannot execute correctly and is vulnerable to the attack of parallel sessions. Additionally, an enhanced authentication scheme based on the scheme of Zhu is proposed. The enhanced scheme not only avoids the weakness in the original scheme, but also provides users' anonymity and authenticated key agreements for secure data communications.
Parallel block schemes for large scale least squares computations
DOE Office of Scientific and Technical Information (OSTI.GOV)
Golub, G.H.; Plemmons, R.J.; Sameh, A.
1986-04-01
Large scale least squares computations arise in a variety of scientific and engineering problems, including geodetic adjustments and surveys, medical image analysis, molecular structures, partial differential equations and substructuring methods in structural engineering. In each of these problems, matrices often arise which possess a block structure which reflects the local connection nature of the underlying physical problem. For example, such super-large nonlinear least squares computations arise in geodesy. Here the coordinates of positions are calculated by iteratively solving overdetermined systems of nonlinear equations by the Gauss-Newton method. The US National Geodetic Survey will complete this year (1986) the readjustment ofmore » the North American Datum, a problem which involves over 540 thousand unknowns and over 6.5 million observations (equations). The observation matrix for these least squares computations has a block angular form with 161 diagnonal blocks, each containing 3 to 4 thousand unknowns. In this paper parallel schemes are suggested for the orthogonal factorization of matrices in block angular form and for the associated backsubstitution phase of the least squares computations. In addition, a parallel scheme for the calculation of certain elements of the covariance matrix for such problems is described. It is shown that these algorithms are ideally suited for multiprocessors with three levels of parallelism such as the Cedar system at the University of Illinois. 20 refs., 7 figs.« less
NASA Astrophysics Data System (ADS)
Ku, Seung-Hoe; Hager, R.; Chang, C. S.; Chacon, L.; Chen, G.; EPSI Team
2016-10-01
The cancelation problem has been a long-standing issue for long wavelengths modes in electromagnetic gyrokinetic PIC simulations in toroidal geometry. As an attempt of resolving this issue, we implemented a fully implicit time integration scheme in the full-f, gyrokinetic PIC code XGC1. The new scheme - based on the implicit Vlasov-Darwin PIC algorithm by G. Chen and L. Chacon - can potentially resolve cancelation problem. The time advance for the field and the particle equations is space-time-centered, with particle sub-cycling. The resulting system of equations is solved by a Picard iteration solver with fixed-point accelerator. The algorithm is implemented in the parallel velocity formalism instead of the canonical parallel momentum formalism. XGC1 specializes in simulating the tokamak edge plasma with magnetic separatrix geometry. A fully implicit scheme could be a way to accurate and efficient gyrokinetic simulations. We will test if this numerical scheme overcomes the cancelation problem, and reproduces the dispersion relation of Alfven waves and tearing modes in cylindrical geometry. Funded by US DOE FES and ASCR, and computing resources provided by OLCF through ALCC.
Highly Parallel Alternating Directions Algorithm for Time Dependent Problems
NASA Astrophysics Data System (ADS)
Ganzha, M.; Georgiev, K.; Lirkov, I.; Margenov, S.; Paprzycki, M.
2011-11-01
In our work, we consider the time dependent Stokes equation on a finite time interval and on a uniform rectangular mesh, written in terms of velocity and pressure. For this problem, a parallel algorithm based on a novel direction splitting approach is developed. Here, the pressure equation is derived from a perturbed form of the continuity equation, in which the incompressibility constraint is penalized in a negative norm induced by the direction splitting. The scheme used in the algorithm is composed of two parts: (i) velocity prediction, and (ii) pressure correction. This is a Crank-Nicolson-type two-stage time integration scheme for two and three dimensional parabolic problems in which the second-order derivative, with respect to each space variable, is treated implicitly while the other variable is made explicit at each time sub-step. In order to achieve a good parallel performance the solution of the Poison problem for the pressure correction is replaced by solving a sequence of one-dimensional second order elliptic boundary value problems in each spatial direction. The parallel code is implemented using the standard MPI functions and tested on two modern parallel computer systems. The performed numerical tests demonstrate good level of parallel efficiency and scalability of the studied direction-splitting-based algorithm.
Effect of asynchrony on numerical simulations of fluid flow phenomena
NASA Astrophysics Data System (ADS)
Konduri, Aditya; Mahoney, Bryan; Donzis, Diego
2015-11-01
Designing scalable CFD codes on massively parallel computers is a challenge. This is mainly due to the large number of communications between processing elements (PEs) and their synchronization, leading to idling of PEs. Indeed, communication will likely be the bottleneck in the scalability of codes on Exascale machines. Our recent work on asynchronous computing for PDEs based on finite-differences has shown that it is possible to relax synchronization between PEs at a mathematical level. Computations then proceed regardless of the status of communication, reducing the idle time of PEs and improving the scalability. However, accuracy of the schemes is greatly affected. We have proposed asynchrony-tolerant (AT) schemes to address this issue. In this work, we study the effect of asynchrony on the solution of fluid flow problems using standard and AT schemes. We show that asynchrony creates additional scales with low energy content. The specific wavenumbers affected can be shown to be due to two distinct effects: the randomness in the arrival of messages and the corresponding switching between schemes. Understanding these errors allow us to effectively control them, rendering the method's feasibility in solving turbulent flows at realistic conditions on future computing systems.
Adaptive Intuitionistic Fuzzy Enhancement of Brain Tumor MR Images
NASA Astrophysics Data System (ADS)
Deng, He; Deng, Wankai; Sun, Xianping; Ye, Chaohui; Zhou, Xin
2016-10-01
Image enhancement techniques are able to improve the contrast and visual quality of magnetic resonance (MR) images. However, conventional methods cannot make up some deficiencies encountered by respective brain tumor MR imaging modes. In this paper, we propose an adaptive intuitionistic fuzzy sets-based scheme, called as AIFE, which takes information provided from different MR acquisitions and tries to enhance the normal and abnormal structural regions of the brain while displaying the enhanced results as a single image. The AIFE scheme firstly separates an input image into several sub images, then divides each sub image into object and background areas. After that, different novel fuzzification, hyperbolization and defuzzification operations are implemented on each object/background area, and finally an enhanced result is achieved via nonlinear fusion operators. The fuzzy implementations can be processed in parallel. Real data experiments demonstrate that the AIFE scheme is not only effectively useful to have information from images acquired with different MR sequences fused in a single image, but also has better enhancement performance when compared to conventional baseline algorithms. This indicates that the proposed AIFE scheme has potential for improving the detection and diagnosis of brain tumors.
NASA Astrophysics Data System (ADS)
Kwon, Deuk-Chul; Shin, Sung-Sik; Yu, Dong-Hun
2017-10-01
In order to reduce the computing time in simulation of radio frequency (rf) plasma sources, various numerical schemes were developed. It is well known that the upwind, exponential, and power-law schemes can efficiently overcome the limitation on the grid size for fluid transport simulations of high density plasma discharges. Also, the semi-implicit method is a well-known numerical scheme to overcome on the simulation time step. However, despite remarkable advances in numerical techniques and computing power over the last few decades, efficient multi-dimensional modeling of low temperature plasma discharges has remained a considerable challenge. In particular, there was a difficulty on parallelization in time for the time periodic steady state problems such as capacitively coupled plasma discharges and rf sheath dynamics because values of plasma parameters in previous time step are used to calculate new values each time step. Therefore, we present a parallelization method for the time periodic steady state problems by using period-slices. In order to evaluate the efficiency of the developed method, one-dimensional fluid simulations are conducted for describing rf sheath dynamics. The result shows that speedup can be achieved by using a multithreading method.
OpenMP parallelization of a gridded SWAT (SWATG)
NASA Astrophysics Data System (ADS)
Zhang, Ying; Hou, Jinliang; Cao, Yongpan; Gu, Juan; Huang, Chunlin
2017-12-01
Large-scale, long-term and high spatial resolution simulation is a common issue in environmental modeling. A Gridded Hydrologic Response Unit (HRU)-based Soil and Water Assessment Tool (SWATG) that integrates grid modeling scheme with different spatial representations also presents such problems. The time-consuming problem affects applications of very high resolution large-scale watershed modeling. The OpenMP (Open Multi-Processing) parallel application interface is integrated with SWATG (called SWATGP) to accelerate grid modeling based on the HRU level. Such parallel implementation takes better advantage of the computational power of a shared memory computer system. We conducted two experiments at multiple temporal and spatial scales of hydrological modeling using SWATG and SWATGP on a high-end server. At 500-m resolution, SWATGP was found to be up to nine times faster than SWATG in modeling over a roughly 2000 km2 watershed with 1 CPU and a 15 thread configuration. The study results demonstrate that parallel models save considerable time relative to traditional sequential simulation runs. Parallel computations of environmental models are beneficial for model applications, especially at large spatial and temporal scales and at high resolutions. The proposed SWATGP model is thus a promising tool for large-scale and high-resolution water resources research and management in addition to offering data fusion and model coupling ability.
Microfluidic Platform for Parallel Single Cell Analysis for Diagnostic Applications.
Le Gac, Séverine
2017-01-01
Cell populations are heterogeneous: they can comprise different cell types or even cells at different stages of the cell cycle and/or of biological processes. Furthermore, molecular processes taking place in cells are stochastic in nature. Therefore, cellular analysis must be brought down to the single cell level to get useful insight into biological processes, and to access essential molecular information that would be lost when using a cell population analysis approach. Furthermore, to fully characterize a cell population, ideally, information both at the single cell level and on the whole cell population is required, which calls for analyzing each individual cell in a population in a parallel manner. This single cell level analysis approach is particularly important for diagnostic applications to unravel molecular perturbations at the onset of a disease, to identify biomarkers, and for personalized medicine, not only because of the heterogeneity of the cell sample, but also due to the availability of a reduced amount of cells, or even unique cells. This chapter presents a versatile platform meant for the parallel analysis of individual cells, with a particular focus on diagnostic applications and the analysis of cancer cells. We first describe one essential step of this parallel single cell analysis protocol, which is the trapping of individual cells in dedicated structures. Following this, we report different steps of a whole analytical process, including on-chip cell staining and imaging, cell membrane permeabilization and/or lysis using either chemical or physical means, and retrieval of the cell molecular content in dedicated channels for further analysis. This series of experiments illustrates the versatility of the herein-presented platform and its suitability for various analysis schemes and different analytical purposes.
NASA Astrophysics Data System (ADS)
Lasher, Mark E.; Henderson, Thomas B.; Drake, Barry L.; Bocker, Richard P.
1986-09-01
The modified signed-digit (MSD) number representation offers full parallel, carry-free addition. A MSD adder has been described by the authors. This paper describes how the adder can be used in a tree structure to implement an optical multiply algorithm. Three different optical schemes, involving position, polarization, and intensity encoding, are proposed for realizing the trinary logic system. When configured in the generic multiplier architecture, these schemes yield the combinatorial logic necessary to carry out the multiplication algorithm. The optical systems are essentially three dimensional arrangements composed of modular units. Of course, this modularity is important for design considerations, while the parallelism and noninterfering communication channels of optical systems are important from the standpoint of reduced complexity. The authors have also designed electronic hardware to demonstrate and model the combinatorial logic required to carry out the algorithm. The electronic and proposed optical systems will be compared in terms of complexity and speed.
NASA Astrophysics Data System (ADS)
Hegde, Ganapathi; Vaya, Pukhraj
2013-10-01
This article presents a parallel architecture for 3-D discrete wavelet transform (3-DDWT). The proposed design is based on the 1-D pipelined lifting scheme. The architecture is fully scalable beyond the present coherent Daubechies filter bank (9, 7). This 3-DDWT architecture has advantages such as no group of pictures restriction and reduced memory referencing. It offers low power consumption, low latency and high throughput. The computing technique is based on the concept that lifting scheme minimises the storage requirement. The application specific integrated circuit implementation of the proposed architecture is done by synthesising it using 65 nm Taiwan Semiconductor Manufacturing Company standard cell library. It offers a speed of 486 MHz with a power consumption of 2.56 mW. This architecture is suitable for real-time video compression even with large frame dimensions.
AZTEC: A parallel iterative package for the solving linear systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hutchinson, S.A.; Shadid, J.N.; Tuminaro, R.S.
1996-12-31
We describe a parallel linear system package, AZTEC. The package incorporates a number of parallel iterative methods (e.g. GMRES, biCGSTAB, CGS, TFQMR) and preconditioners (e.g. Jacobi, Gauss-Seidel, polynomial, domain decomposition with LU or ILU within subdomains). Additionally, AZTEC allows for the reuse of previous preconditioning factorizations within Newton schemes for nonlinear methods. Currently, a number of different users are using this package to solve a variety of PDE applications.
Mozaffari, Brian
2014-01-01
Based on the notion that the brain is equipped with a hierarchical organization, which embodies environmental contingencies across many time scales, this paper suggests that the medial temporal lobe (MTL)-located deep in the hierarchy-serves as a bridge connecting supra- to infra-MTL levels. Bridging the upper and lower regions of the hierarchy provides a parallel architecture that optimizes information flow between upper and lower regions to aid attention, encoding, and processing of quick complex visual phenomenon. Bypassing intermediate hierarchy levels, information conveyed through the MTL "bridge" allows upper levels to make educated predictions about the prevailing context and accordingly select lower representations to increase the efficiency of predictive coding throughout the hierarchy. This selection or activation/deactivation is associated with endogenous attention. In the event that these "bridge" predictions are inaccurate, this architecture enables the rapid encoding of novel contingencies. A review of hierarchical models in relation to memory is provided along with a new theory, Medial-temporal-lobe Conduit for Parallel Connectivity (MCPC). In this scheme, consolidation is considered as a secondary process, occurring after a MTL-bridged connection, which eventually allows upper and lower levels to access each other directly. With repeated reactivations, as contingencies become consolidated, less MTL activity is predicted. Finally, MTL bridging may aid processing transient but structured perceptual events, by allowing communication between upper and lower levels without calling on intermediate levels of representation.
A Dual Super-Element Domain Decomposition Approach for Parallel Nonlinear Finite Element Analysis
NASA Astrophysics Data System (ADS)
Jokhio, G. A.; Izzuddin, B. A.
2015-05-01
This article presents a new domain decomposition method for nonlinear finite element analysis introducing the concept of dual partition super-elements. The method extends ideas from the displacement frame method and is ideally suited for parallel nonlinear static/dynamic analysis of structural systems. In the new method, domain decomposition is realized by replacing one or more subdomains in a "parent system," each with a placeholder super-element, where the subdomains are processed separately as "child partitions," each wrapped by a dual super-element along the partition boundary. The analysis of the overall system, including the satisfaction of equilibrium and compatibility at all partition boundaries, is realized through direct communication between all pairs of placeholder and dual super-elements. The proposed method has particular advantages for matrix solution methods based on the frontal scheme, and can be readily implemented for existing finite element analysis programs to achieve parallelization on distributed memory systems with minimal intervention, thus overcoming memory bottlenecks typically faced in the analysis of large-scale problems. Several examples are presented in this article which demonstrate the computational benefits of the proposed parallel domain decomposition approach and its applicability to the nonlinear structural analysis of realistic structural systems.
NASA Astrophysics Data System (ADS)
Tabekina, N. A.; Chepchurov, M. S.; Evtushenko, E. I.; Dmitrievsky, B. S.
2018-05-01
The work solves the problem of automation of machining process namely turning to produce parts having the planes parallel to an axis of rotation of part without using special tools. According to the results, the availability of the equipment of a high speed electromechanical drive to control the operative movements of lathe machine will enable one to get the planes parallel to the part axis. The method of getting planes parallel to the part axis is based on the mathematical model, which is presented as functional dependency between the conveying velocity of the driven element and the time. It describes the operative movements of lathe machine all over the tool path. Using the model of movement of the tool, it has been found that the conveying velocity varies from the maximum to zero value. It will allow one to carry out the reverse of the drive. The scheme of tool placement regarding the workpiece has been proposed for unidirectional movement of the driven element at high conveying velocity. The control method of CNC machines can be used for getting geometrically complex parts on the lathe without using special milling tools.
Doll, J.; Dupuis, P.; Nyquist, P.
2017-02-08
Parallel tempering, or replica exchange, is a popular method for simulating complex systems. The idea is to run parallel simulations at different temperatures, and at a given swap rate exchange configurations between the parallel simulations. From the perspective of large deviations it is optimal to let the swap rate tend to infinity and it is possible to construct a corresponding simulation scheme, known as infinite swapping. In this paper we propose a novel use of large deviations for empirical measures for a more detailed analysis of the infinite swapping limit in the setting of continuous time jump Markov processes. Usingmore » the large deviations rate function and associated stochastic control problems we consider a diagnostic based on temperature assignments, which can be easily computed during a simulation. We show that the convergence of this diagnostic to its a priori known limit is a necessary condition for the convergence of infinite swapping. The rate function is also used to investigate the impact of asymmetries in the underlying potential landscape, and where in the state space poor sampling is most likely to occur.« less
3D simulations of early blood vessel formation
NASA Astrophysics Data System (ADS)
Cavalli, F.; Gamba, A.; Naldi, G.; Semplice, M.; Valdembri, D.; Serini, G.
2007-08-01
Blood vessel networks form by spontaneous aggregation of individual cells migrating toward vascularization sites (vasculogenesis). A successful theoretical model of two-dimensional experimental vasculogenesis has been recently proposed, showing the relevance of percolation concepts and of cell cross-talk (chemotactic autocrine loop) to the understanding of this self-aggregation process. Here we study the natural 3D extension of the computational model proposed earlier, which is relevant for the investigation of the genuinely three-dimensional process of vasculogenesis in vertebrate embryos. The computational model is based on a multidimensional Burgers equation coupled with a reaction diffusion equation for a chemotactic factor and a mass conservation law. The numerical approximation of the computational model is obtained by high order relaxed schemes. Space and time discretization are performed by using TVD schemes and, respectively, IMEX schemes. Due to the computational costs of realistic simulations, we have implemented the numerical algorithm on a cluster for parallel computation. Starting from initial conditions mimicking the experimentally observed ones, numerical simulations produce network-like structures qualitatively similar to those observed in the early stages of in vivo vasculogenesis. We develop the computation of critical percolative indices as a robust measure of the network geometry as a first step towards the comparison of computational and experimental data.
Electron acceleration by surface plasma waves in double metal surface structure
NASA Astrophysics Data System (ADS)
Liu, C. S.; Kumar, Gagan; Singh, D. B.; Tripathi, V. K.
2007-12-01
Two parallel metal sheets, separated by a vacuum region, support a surface plasma wave whose amplitude is maximum on the two parallel interfaces and minimum in the middle. This mode can be excited by a laser using a glass prism. An electron beam launched into the middle region experiences a longitudinal ponderomotive force due to the surface plasma wave and gets accelerated to velocities of the order of phase velocity of the surface wave. The scheme is viable to achieve beams of tens of keV energy. In the case of a surface plasma wave excited on a single metal-vacuum interface, the field gradient normal to the interface pushes the electrons away from the high field region, limiting the acceleration process. The acceleration energy thus achieved is in agreement with the experimental observations.
Hine, N D M; Haynes, P D; Mostofi, A A; Payne, M C
2010-09-21
We present calculations of formation energies of defects in an ionic solid (Al(2)O(3)) extrapolated to the dilute limit, corresponding to a simulation cell of infinite size. The large-scale calculations required for this extrapolation are enabled by developments in the approach to parallel sparse matrix algebra operations, which are central to linear-scaling density-functional theory calculations. The computational cost of manipulating sparse matrices, whose sizes are determined by the large number of basis functions present, is greatly improved with this new approach. We present details of the sparse algebra scheme implemented in the ONETEP code using hierarchical sparsity patterns, and demonstrate its use in calculations on a wide range of systems, involving thousands of atoms on hundreds to thousands of parallel processes.
Hierarchial parallel computer architecture defined by computational multidisciplinary mechanics
NASA Technical Reports Server (NTRS)
Padovan, Joe; Gute, Doug; Johnson, Keith
1989-01-01
The goal is to develop an architecture for parallel processors enabling optimal handling of multi-disciplinary computation of fluid-solid simulations employing finite element and difference schemes. The goals, philosphical and modeling directions, static and dynamic poly trees, example problems, interpolative reduction, the impact on solvers are shown in viewgraph form.
NASA Astrophysics Data System (ADS)
Clay, M. P.; Yeung, P. K.; Buaria, D.; Gotoh, T.
2017-11-01
Turbulent mixing at high Schmidt number is a multiscale problem which places demanding requirements on direct numerical simulations to resolve fluctuations down the to Batchelor scale. We use a dual-grid, dual-scheme and dual-communicator approach where velocity and scalar fields are computed by separate groups of parallel processes, the latter using a combined compact finite difference (CCD) scheme on finer grid with a static 3-D domain decomposition free of the communication overhead of memory transposes. A high degree of scalability is achieved for a 81923 scalar field at Schmidt number 512 in turbulence with a modest inertial range, by overlapping communication with computation whenever possible. On the Cray XE6 partition of Blue Waters, use of a dedicated thread for communication combined with OpenMP locks and nested parallelism reduces CCD timings by 34% compared to an MPI baseline. The code has been further optimized for the 27-petaflops Cray XK7 machine Titan using GPUs as accelerators with the latest OpenMP 4.5 directives, giving 2.7X speedup compared to CPU-only execution at the largest problem size. Supported by NSF Grant ACI-1036170, the NCSA Blue Waters Project with subaward via UIUC, and a DOE INCITE allocation at ORNL.
Burst-mode optical label processor with ultralow power consumption.
Ibrahim, Salah; Nakahara, Tatsushi; Ishikawa, Hiroshi; Takahashi, Ryo
2016-04-04
A novel label processor subsystem for 100-Gbps (25-Gbps × 4λs) burst-mode optical packets is developed, in which a highly energy-efficient method is pursued for extracting and interfacing the ultrafast packet-label to a CMOS-based processor where label recognition takes place. The method involves performing serial-to-parallel conversion for the label bits on a bit-by-bit basis by using an optoelectronic converter that is operated with a set of optical triggers generated in a burst-mode manner upon packet arrival. Here we present three key achievements that enabled a significant reduction in the total power consumption and latency of the whole subsystem; 1) based on a novel operation mechanism for providing amplification with bit-level selectivity, an optical trigger pulse generator, that consumes power for a very short duration upon packet arrival, is proposed and experimentally demonstrated, 2) the energy of optical triggers needed by the optoelectronic serial-to-parallel converter is reduced by utilizing a negative-polarity signal while employing an enhanced conversion scheme entitled the discharge-or-hold scheme, 3) the necessary optical trigger energy is further cut down by half by coupling the triggers through the chip's backside, whereas a novel lens-free packaging method is developed to enable a low-cost alignment process that works with simple visual observation.
UNFOLD-SENSE: a parallel MRI method with self-calibration and artifact suppression.
Madore, Bruno
2004-08-01
This work aims at improving the performance of parallel imaging by using it with our "unaliasing by Fourier-encoding the overlaps in the temporal dimension" (UNFOLD) temporal strategy. A self-calibration method called "self, hybrid referencing with UNFOLD and GRAPPA" (SHRUG) is presented. SHRUG combines the UNFOLD-based sensitivity mapping strategy introduced in the TSENSE method by Kellman et al. (5), with the strategy introduced in the GRAPPA method by Griswold et al. (10). SHRUG merges the two approaches to alleviate their respective limitations, and provides fast self-calibration at any given acceleration factor. UNFOLD-SENSE further includes an UNFOLD artifact suppression scheme to significantly suppress artifacts and amplified noise produced by parallel imaging. This suppression scheme, which was published previously (4), is related to another method that was presented independently as part of TSENSE. While the two are equivalent at accelerations < or = 2.0, the present approach is shown here to be significantly superior at accelerations > 2.0, with up to double the artifact suppression at high accelerations. Furthermore, a slight modification of Cartesian SENSE is introduced, which allows departures from purely Cartesian sampling grids. This technique, termed variable-density SENSE (vdSENSE), allows the variable-density data required by SHRUG to be reconstructed with the simplicity and fast processing of Cartesian SENSE. UNFOLD-SENSE is given by the combination of SHRUG for sensitivity mapping, vdSENSE for reconstruction, and UNFOLD for artifact/amplified noise suppression. The method was implemented, with online reconstruction, on both an SSFP and a myocardium-perfusion sequence. The results from six patients scanned with UNFOLD-SENSE are presented.
Optimal mapping of neural-network learning on message-passing multicomputers
NASA Technical Reports Server (NTRS)
Chu, Lon-Chan; Wah, Benjamin W.
1992-01-01
A minimization of learning-algorithm completion time is sought in the present optimal-mapping study of the learning process in multilayer feed-forward artificial neural networks (ANNs) for message-passing multicomputers. A novel approximation algorithm for mappings of this kind is derived from observations of the dominance of a parallel ANN algorithm over its communication time. Attention is given to both static and dynamic mapping schemes for systems with static and dynamic background workloads, as well as to experimental results obtained for simulated mappings on multicomputers with dynamic background workloads.
Multitasking the INS3D-LU code on the Cray Y-MP
NASA Technical Reports Server (NTRS)
Fatoohi, Rod; Yoon, Seokkwan
1991-01-01
This paper presents the results of multitasking the INS3D-LU code on eight processors. The code is a full Navier-Stokes solver for incompressible fluid in three dimensional generalized coordinates using a lower-upper symmetric-Gauss-Seidel implicit scheme. This code has been fully vectorized on oblique planes of sweep and parallelized using autotasking with some directives and minor modifications. The timing results for five grid sizes are presented and analyzed. The code has achieved a processing rate of over one Gflops.
Large-Constraint-Length, Fast Viterbi Decoder
NASA Technical Reports Server (NTRS)
Collins, O.; Dolinar, S.; Hsu, In-Shek; Pollara, F.; Olson, E.; Statman, J.; Zimmerman, G.
1990-01-01
Scheme for efficient interconnection makes VLSI design feasible. Concept for fast Viterbi decoder provides for processing of convolutional codes of constraint length K up to 15 and rates of 1/2 to 1/6. Fully parallel (but bit-serial) architecture developed for decoder of K = 7 implemented in single dedicated VLSI circuit chip. Contains six major functional blocks. VLSI circuits perform branch metric computations, add-compare-select operations, and then store decisions in traceback memory. Traceback processor reads appropriate memory locations and puts out decoded bits. Used as building block for decoders of larger K.
High Rate Digital Demodulator ASIC
NASA Technical Reports Server (NTRS)
Ghuman, Parminder; Sheikh, Salman; Koubek, Steve; Hoy, Scott; Gray, Andrew
1998-01-01
The architecture of High Rate (600 Mega-bits per second) Digital Demodulator (HRDD) ASIC capable of demodulating BPSK and QPSK modulated data is presented in this paper. The advantages of all-digital processing include increased flexibility and reliability with reduced reproduction costs. Conventional serial digital processing would require high processing rates necessitating a hardware implementation in other than CMOS technology such as Gallium Arsenide (GaAs) which has high cost and power requirements. It is more desirable to use CMOS technology with its lower power requirements and higher gate density. However, digital demodulation of high data rates in CMOS requires parallel algorithms to process the sampled data at a rate lower than the data rate. The parallel processing algorithms described here were developed jointly by NASA's Goddard Space Flight Center (GSFC) and the Jet Propulsion Laboratory (JPL). The resulting all-digital receiver has the capability to demodulate BPSK, QPSK, OQPSK, and DQPSK at data rates in excess of 300 Mega-bits per second (Mbps) per channel. This paper will provide an overview of the parallel architecture and features of the HRDR ASIC. In addition, this paper will provide an over-view of the implementation of the hardware architectures used to create flexibility over conventional high rate analog or hybrid receivers. This flexibility includes a wide range of data rates, modulation schemes, and operating environments. In conclusion it will be shown how this high rate digital demodulator can be used with an off-the-shelf A/D and a flexible analog front end, both of which are numerically computer controlled, to produce a very flexible, low cost high rate digital receiver.
Workflow of the Grover algorithm simulation incorporating CUDA and GPGPU
NASA Astrophysics Data System (ADS)
Lu, Xiangwen; Yuan, Jiabin; Zhang, Weiwei
2013-09-01
The Grover quantum search algorithm, one of only a few representative quantum algorithms, can speed up many classical algorithms that use search heuristics. No true quantum computer has yet been developed. For the present, simulation is one effective means of verifying the search algorithm. In this work, we focus on the simulation workflow using a compute unified device architecture (CUDA). Two simulation workflow schemes are proposed. These schemes combine the characteristics of the Grover algorithm and the parallelism of general-purpose computing on graphics processing units (GPGPU). We also analyzed the optimization of memory space and memory access from this perspective. We implemented four programs on CUDA to evaluate the performance of schemes and optimization. Through experimentation, we analyzed the organization of threads suited to Grover algorithm simulations, compared the storage costs of the four programs, and validated the effectiveness of optimization. Experimental results also showed that the distinguished program on CUDA outperformed the serial program of libquantum on a CPU with a speedup of up to 23 times (12 times on average), depending on the scale of the simulation.
Evaluation of concurrent priority queue algorithms. Technical report
DOE Office of Scientific and Technical Information (OSTI.GOV)
Huang, Q.
1991-02-01
The priority queue is a fundamental data structure that is used in a large variety of parallel algorithms, such as multiprocessor scheduling and parallel best-first search of state-space graphs. This thesis addresses the design and experimental evaluation of two novel concurrent priority queues: a parallel Fibonacci heap and a concurrent priority pool, and compares them with the concurrent binary heap. The parallel Fibonacci heap is based on the sequential Fibonacci heap, which is theoretically the most efficient data structure for sequential priority queues. This scheme not only preserves the efficient operation time bounds of its sequential counterpart, but also hasmore » very low contention by distributing locks over the entire data structure. The experimental results show its linearly scalable throughput and speedup up to as many processors as tested (currently 18). A concurrent access scheme for a doubly linked list is described as part of the implementation of the parallel Fibonacci heap. The concurrent priority pool is based on the concurrent B-tree and the concurrent pool. The concurrent priority pool has the highest throughput among the priority queues studied. Like the parallel Fibonacci heap, the concurrent priority pool scales linearly up to as many processors as tested. The priority queues are evaluated in terms of throughput and speedup. Some applications of concurrent priority queues such as the vertex cover problem and the single source shortest path problem are tested.« less
Intel Xeon Phi accelerated Weather Research and Forecasting (WRF) Goddard microphysics scheme
NASA Astrophysics Data System (ADS)
Mielikainen, J.; Huang, B.; Huang, A. H.-L.
2014-12-01
The Weather Research and Forecasting (WRF) model is a numerical weather prediction system designed to serve both atmospheric research and operational forecasting needs. The WRF development is a done in collaboration around the globe. Furthermore, the WRF is used by academic atmospheric scientists, weather forecasters at the operational centers and so on. The WRF contains several physics components. The most time consuming one is the microphysics. One microphysics scheme is the Goddard cloud microphysics scheme. It is a sophisticated cloud microphysics scheme in the Weather Research and Forecasting (WRF) model. The Goddard microphysics scheme is very suitable for massively parallel computation as there are no interactions among horizontal grid points. Compared to the earlier microphysics schemes, the Goddard scheme incorporates a large number of improvements. Thus, we have optimized the Goddard scheme code. In this paper, we present our results of optimizing the Goddard microphysics scheme on Intel Many Integrated Core Architecture (MIC) hardware. The Intel Xeon Phi coprocessor is the first product based on Intel MIC architecture, and it consists of up to 61 cores connected by a high performance on-die bidirectional interconnect. The Intel MIC is capable of executing a full operating system and entire programs rather than just kernels as the GPU does. The MIC coprocessor supports all important Intel development tools. Thus, the development environment is one familiar to a vast number of CPU developers. Although, getting a maximum performance out of MICs will require using some novel optimization techniques. Those optimization techniques are discussed in this paper. The results show that the optimizations improved performance of Goddard microphysics scheme on Xeon Phi 7120P by a factor of 4.7×. In addition, the optimizations reduced the Goddard microphysics scheme's share of the total WRF processing time from 20.0 to 7.5%. Furthermore, the same optimizations improved performance on Intel Xeon E5-2670 by a factor of 2.8× compared to the original code.
Multitasking for flows about multiple body configurations using the chimera grid scheme
NASA Technical Reports Server (NTRS)
Dougherty, F. C.; Morgan, R. L.
1987-01-01
The multitasking of a finite-difference scheme using multiple overset meshes is described. In this chimera, or multiple overset mesh approach, a multiple body configuration is mapped using a major grid about the main component of the configuration, with minor overset meshes used to map each additional component. This type of code is well suited to multitasking. Both steady and unsteady two dimensional computations are run on parallel processors on a CRAY-X/MP 48, usually with one mesh per processor. Flow field results are compared with single processor results to demonstrate the feasibility of running multiple mesh codes on parallel processors and to show the increase in efficiency.
Parallel solution of high-order numerical schemes for solving incompressible flows
NASA Technical Reports Server (NTRS)
Milner, Edward J.; Lin, Avi; Liou, May-Fun; Blech, Richard A.
1993-01-01
A new parallel numerical scheme for solving incompressible steady-state flows is presented. The algorithm uses a finite-difference approach to solving the Navier-Stokes equations. The algorithms are scalable and expandable. They may be used with only two processors or with as many processors as are available. The code is general and expandable. Any size grid may be used. Four processors of the NASA LeRC Hypercluster were used to solve for steady-state flow in a driven square cavity. The Hypercluster was configured in a distributed-memory, hypercube-like architecture. By using a 50-by-50 finite-difference solution grid, an efficiency of 74 percent (a speedup of 2.96) was obtained.
Parallel Logic Programming Architecture
1990-04-01
Section 3.1. 3.1. A STATIC ALLOCATION SCHEME (SAS) Methods that have been used for decomposing distributed problems in artificial intelligence...multiple agents, knowledge organization and allocation, and cooperative parallel execution. These difficulties are common to distributed artificial ...for the following reasons. First, intellegent backtracking requires much more bookkeeping and is therefore more costly during consult-time and during
Large-eddy simulation/Reynolds-averaged Navier-Stokes hybrid schemes for high speed flows
NASA Astrophysics Data System (ADS)
Xiao, Xudong
Three LES/RANS hybrid schemes have been proposed for the prediction of high speed separated flows. Each method couples the k-zeta (Enstrophy) BANS model with an LES subgrid scale one-equation model by using a blending function that is coordinate system independent. Two of these functions are based on turbulence dissipation length scale and grid size, while the third one has no explicit dependence on the grid. To implement the LES/RANS hybrid schemes, a new rescaling-reintroducing method is used to generate time-dependent turbulent inflow conditions. The hybrid schemes have been tested on a Mach 2.88 flow over 25 degree compression-expansion ramp and a Mach 2.79 flow over 20 degree compression ramp. A special computation procedure has been designed to prevent the separation zone from expanding upstream to the recycle-plane. The code is parallelized using Message Passing Interface (MPI) and is optimized for running on IBM-SP3 parallel machine. The scheme was validated first for a flat plate. It was shown that the blending function has to be monotonic to prevent the RANS region from appearing in the LES region. In the 25 deg ramp case, the hybrid schemes provided better agreement with experiment in the recovery region. Grid refinement studies demonstrated the importance of using a grid independent blend function and further improvement with experiment in the recovery region. In the 20 deg ramp case, with a relatively finer grid, the hybrid scheme characterized by grid independent blending function well predicted the flow field in both the separation region and the recovery region. Therefore, with "appropriately" fine grid, current hybrid schemes are promising for the simulation of shock wave/boundary layer interaction problems.
Gpu Implementation of a Viscous Flow Solver on Unstructured Grids
NASA Astrophysics Data System (ADS)
Xu, Tianhao; Chen, Long
2016-06-01
Graphics processing units have gained popularities in scientific computing over past several years due to their outstanding parallel computing capability. Computational fluid dynamics applications involve large amounts of calculations, therefore a latest GPU card is preferable of which the peak computing performance and memory bandwidth are much better than a contemporary high-end CPU. We herein focus on the detailed implementation of our GPU targeting Reynolds-averaged Navier-Stokes equations solver based on finite-volume method. The solver employs a vertex-centered scheme on unstructured grids for the sake of being capable of handling complex topologies. Multiple optimizations are carried out to improve the memory accessing performance and kernel utilization. Both steady and unsteady flow simulation cases are carried out using explicit Runge-Kutta scheme. The solver with GPU acceleration in this paper is demonstrated to have competitive advantages over the CPU targeting one.
NASA Astrophysics Data System (ADS)
Degtyarev, Alexander; Khramushin, Vasily
2016-02-01
The paper deals with the computer implementation of direct computational experiments in fluid mechanics, constructed on the basis of the approach developed by the authors. The proposed approach allows the use of explicit numerical scheme, which is an important condition for increasing the effciency of the algorithms developed by numerical procedures with natural parallelism. The paper examines the main objects and operations that let you manage computational experiments and monitor the status of the computation process. Special attention is given to a) realization of tensor representations of numerical schemes for direct simulation; b) realization of representation of large particles of a continuous medium motion in two coordinate systems (global and mobile); c) computing operations in the projections of coordinate systems, direct and inverse transformation in these systems. Particular attention is paid to the use of hardware and software of modern computer systems.
2nd Generation QUATARA Flight Computer Project
NASA Technical Reports Server (NTRS)
Falker, Jay; Keys, Andrew; Fraticelli, Jose Molina; Capo-Iugo, Pedro; Peeples, Steven
2015-01-01
Single core flight computer boards have been designed, developed, and tested (DD&T) to be flown in small satellites for the last few years. In this project, a prototype flight computer will be designed as a distributed multi-core system containing four microprocessors running code in parallel. This flight computer will be capable of performing multiple computationally intensive tasks such as processing digital and/or analog data, controlling actuator systems, managing cameras, operating robotic manipulators and transmitting/receiving from/to a ground station. In addition, this flight computer will be designed to be fault tolerant by creating both a robust physical hardware connection and by using a software voting scheme to determine the processor's performance. This voting scheme will leverage on the work done for the Space Launch System (SLS) flight software. The prototype flight computer will be constructed with Commercial Off-The-Shelf (COTS) components which are estimated to survive for two years in a low-Earth orbit.
Cartesian Off-Body Grid Adaption for Viscous Time- Accurate Flow Simulation
NASA Technical Reports Server (NTRS)
Buning, Pieter G.; Pulliam, Thomas H.
2011-01-01
An improved solution adaption capability has been implemented in the OVERFLOW overset grid CFD code. Building on the Cartesian off-body approach inherent in OVERFLOW and the original adaptive refinement method developed by Meakin, the new scheme provides for automated creation of multiple levels of finer Cartesian grids. Refinement can be based on the undivided second-difference of the flow solution variables, or on a specific flow quantity such as vorticity. Coupled with load-balancing and an inmemory solution interpolation procedure, the adaption process provides very good performance for time-accurate simulations on parallel compute platforms. A method of using refined, thin body-fitted grids combined with adaption in the off-body grids is presented, which maximizes the part of the domain subject to adaption. Two- and three-dimensional examples are used to illustrate the effectiveness and performance of the adaption scheme.
n-body simulations using message passing parallel computers.
NASA Astrophysics Data System (ADS)
Grama, A. Y.; Kumar, V.; Sameh, A.
The authors present new parallel formulations of the Barnes-Hut method for n-body simulations on message passing computers. These parallel formulations partition the domain efficiently incurring minimal communication overhead. This is in contrast to existing schemes that are based on sorting a large number of keys or on the use of global data structures. The new formulations are augmented by alternate communication strategies which serve to minimize communication overhead. The impact of these communication strategies is experimentally studied. The authors report on experimental results obtained from an astrophysical simulation on an nCUBE2 parallel computer.
NASA Astrophysics Data System (ADS)
Ying, Jia-ju; Chen, Yu-dan; Liu, Jie; Wu, Dong-sheng; Lu, Jun
2016-10-01
The maladjustment of photoelectric instrument binocular optical axis parallelism will affect the observe effect directly. A binocular optical axis parallelism digital calibration system is designed. On the basis of the principle of optical axis binocular photoelectric instrument calibration, the scheme of system is designed, and the binocular optical axis parallelism digital calibration system is realized, which include four modules: multiband parallel light tube, optical axis translation, image acquisition system and software system. According to the different characteristics of thermal infrared imager and low-light-level night viewer, different algorithms is used to localize the center of the cross reticle. And the binocular optical axis parallelism calibration is realized for calibrating low-light-level night viewer and thermal infrared imager.
NASA Pioneer: Venus reverse playback telemetry program TR 78-2
NASA Technical Reports Server (NTRS)
Modestino, J. W.; Daut, D. G.; Vickers, A. L.; Matis, K. R.
1978-01-01
During the entry of the Pioneer Venus Atmospheric Probes into the Venus atmosphere, there were several events (RF blackout and data rate changes) which caused the ground receiving equipment to lose lock on the signal. This caused periods of data loss immediately following each one of these disturbing events which lasted until all the ground receiving units (receiver, subcarrier demodulator, symbol synchronizer, and sequential decoder) acquired lock once more. A scheme to recover these data by off-line data processing was implemented. This scheme consisted of receiving the S band signals from the probes with an open loop reciever (requiring no lock up on the signal) in parallel with the closed loop receivers of the real time receiving equipment, down converting the signals to baseband, and recording them on an analog recorder. The off-line processing consisted of playing the analog recording in the reverse direction (starting with the end of the tape) up, converting the signal to S-band, feeding the signal into the "real time" receiving system and recording on digital tape, the soft decisions from the symbol synchronizer.
NASA Astrophysics Data System (ADS)
Huang, Melin; Huang, Bormin; Huang, Allen H.
2014-10-01
For weather forecasting and research, the Weather Research and Forecasting (WRF) model has been developed, consisting of several components such as dynamic solvers and physical simulation modules. WRF includes several Land- Surface Models (LSMs). The LSMs use atmospheric information, the radiative and precipitation forcing from the surface layer scheme, the radiation scheme, and the microphysics/convective scheme all together with the land's state variables and land-surface properties, to provide heat and moisture fluxes over land and sea-ice points. The WRF 5-layer thermal diffusion simulation is an LSM based on the MM5 5-layer soil temperature model with an energy budget that includes radiation, sensible, and latent heat flux. The WRF LSMs are very suitable for massively parallel computation as there are no interactions among horizontal grid points. The features, efficient parallelization and vectorization essentials, of Intel Many Integrated Core (MIC) architecture allow us to optimize this WRF 5-layer thermal diffusion scheme. In this work, we present the results of the computing performance on this scheme with Intel MIC architecture. Our results show that the MIC-based optimization improved the performance of the first version of multi-threaded code on Xeon Phi 5110P by a factor of 2.1x. Accordingly, the same CPU-based optimizations improved the performance on Intel Xeon E5- 2603 by a factor of 1.6x as compared to the first version of multi-threaded code.
Magnetohydrodynamics with GAMER
NASA Astrophysics Data System (ADS)
Zhang, Ui-Han; Schive, Hsi-Yu; Chiueh, Tzihong
2018-06-01
GAMER, a parallel Graphic-processing-unit-accelerated Adaptive-MEsh-Refinement (AMR) hydrodynamic code, has been extended to support magnetohydrodynamics (MHD) with both the corner-transport-upwind and MUSCL-Hancock schemes and the constraint transport technique. The divergent preserving operator for AMR has been applied to reinforce the divergence-free constraint on the magnetic field. GAMER-MHD has fully exploited the concurrent executions between the graphic process unit (GPU) MHD solver and other central processing unit computation pertinent to AMR. We perform various standard tests to demonstrate that GAMER-MHD is both second-order accurate and robust, producing results as accurate as those given by high-resolution uniform-grid runs. We also explore a new 3D MHD test, where the magnetic field assumes the Arnold–Beltrami–Childress configuration, temporarily becomes turbulent with current sheets, and finally settles to a lowest-energy equilibrium state. This 3D problem is adopted for the performance test of GAMER-MHD. The single-GPU performance reaches 1.2 × 108 and 5.5 × 107 cell updates per second for the single- and double-precision calculations, respectively, on Tesla P100. We also demonstrate a parallel efficiency of ∼70% for both weak and strong scaling using 1024 XK nodes on the Blue Waters supercomputers.
Sub-GeV dark matter detection with electron recoils in carbon nanotubes
NASA Astrophysics Data System (ADS)
Cavoto, G.; Luchetta, F.; Polosa, A. D.
2018-01-01
Directional detection of Dark Matter particles (DM) in the MeV mass range could be accomplished by studying electron recoils in large arrays of parallel carbon nanotubes. In a scattering process with a lattice electron, a DM particle might transfer sufficient energy to eject it from the nanotube surface. An external electric field is added to drive the electron from the open ends of the array to the detection region. The anisotropic response of this detection scheme, as a function of the orientation of the target with respect to the DM wind, is calculated, and it is concluded that no direct measurement of the electron ejection angle is needed to explore significant regions of the light DM exclusion plot. A compact sensor, in which the cathode element is substituted with a dense array of parallel carbon nanotubes, could serve as the basic detection unit.
A hybrid dynamic harmony search algorithm for identical parallel machines scheduling
NASA Astrophysics Data System (ADS)
Chen, Jing; Pan, Quan-Ke; Wang, Ling; Li, Jun-Qing
2012-02-01
In this article, a dynamic harmony search (DHS) algorithm is proposed for the identical parallel machines scheduling problem with the objective to minimize makespan. First, an encoding scheme based on a list scheduling rule is developed to convert the continuous harmony vectors to discrete job assignments. Second, the whole harmony memory (HM) is divided into multiple small-sized sub-HMs, and each sub-HM performs evolution independently and exchanges information with others periodically by using a regrouping schedule. Third, a novel improvisation process is applied to generate a new harmony by making use of the information of harmony vectors in each sub-HM. Moreover, a local search strategy is presented and incorporated into the DHS algorithm to find promising solutions. Simulation results show that the hybrid DHS (DHS_LS) is very competitive in comparison to its competitors in terms of mean performance and average computational time.
Parallel computation using boundary elements in solid mechanics
NASA Technical Reports Server (NTRS)
Chien, L. S.; Sun, C. T.
1990-01-01
The inherent parallelism of the boundary element method is shown. The boundary element is formulated by assuming the linear variation of displacements and tractions within a line element. Moreover, MACSYMA symbolic program is employed to obtain the analytical results for influence coefficients. Three computational components are parallelized in this method to show the speedup and efficiency in computation. The global coefficient matrix is first formed concurrently. Then, the parallel Gaussian elimination solution scheme is applied to solve the resulting system of equations. Finally, and more importantly, the domain solutions of a given boundary value problem are calculated simultaneously. The linear speedups and high efficiencies are shown for solving a demonstrated problem on Sequent Symmetry S81 parallel computing system.
NASA Astrophysics Data System (ADS)
Ajiatmo, Dwi; Robandi, Imam
2017-03-01
This paper proposes a control scheme photovoltaic, battery and super capacitor connected in parallel for use in a solar vehicle. Based on the features of battery charging, the control scheme consists of three modes, namely, mode dynamic irradian, constant load mode and constant voltage charging mode. The shift of the three modes can be realized by controlling the duty cycle of the mosffet Boost converter system. Meanwhile, the high voltage which is more suitable for the application can be obtained. Compared with normal charging method with parallel connected current limiting detention and charging method with dynamic irradian mode, constant load mode and constant voltage charging mode, the control scheme is proposed to shorten the charging time and increase the use of power generated from the PV array. From the simulation results and analysis conducted to determine the performance of the system in state transient and steady-state by using simulation software Matlab / Simulink. Response simulation results demonstrate the suitability of the proposed concept.
GPU acceleration of Runge Kutta-Fehlberg and its comparison with Dormand-Prince method
NASA Astrophysics Data System (ADS)
Seen, Wo Mei; Gobithaasan, R. U.; Miura, Kenjiro T.
2014-07-01
There is a significant reduction of processing time and speedup of performance in computer graphics with the emergence of Graphic Processing Units (GPUs). GPUs have been developed to surpass Central Processing Unit (CPU) in terms of performance and processing speed. This evolution has opened up a new area in computing and researches where highly parallel GPU has been used for non-graphical algorithms. Physical or phenomenal simulations and modelling can be accelerated through General Purpose Graphic Processing Units (GPGPU) and Compute Unified Device Architecture (CUDA) implementations. These phenomena can be represented with mathematical models in the form of Ordinary Differential Equations (ODEs) which encompasses the gist of change rate between independent and dependent variables. ODEs are numerically integrated over time in order to simulate these behaviours. The classical Runge-Kutta (RK) scheme is the common method used to numerically solve ODEs. The Runge Kutta Fehlberg (RKF) scheme has been specially developed to provide an estimate of the principal local truncation error at each step, known as embedding estimate technique. This paper delves into the implementation of RKF scheme for GPU devices and compares its result with Dorman Prince method. A pseudo code is developed to show the implementation in detail. Hence, practitioners will be able to understand the data allocation in GPU, formation of RKF kernels and the flow of data to/from GPU-CPU upon RKF kernel evaluation. The pseudo code is then written in C Language and two ODE models are executed to show the achievable speedup as compared to CPU implementation. The accuracy and efficiency of the proposed implementation method is discussed in the final section of this paper.
Log-less metadata management on metadata server for parallel file systems.
Liao, Jianwei; Xiao, Guoqiang; Peng, Xiaoning
2014-01-01
This paper presents a novel metadata management mechanism on the metadata server (MDS) for parallel and distributed file systems. In this technique, the client file system backs up the sent metadata requests, which have been handled by the metadata server, so that the MDS does not need to log metadata changes to nonvolatile storage for achieving highly available metadata service, as well as better performance improvement in metadata processing. As the client file system backs up certain sent metadata requests in its memory, the overhead for handling these backup requests is much smaller than that brought by the metadata server, while it adopts logging or journaling to yield highly available metadata service. The experimental results show that this newly proposed mechanism can significantly improve the speed of metadata processing and render a better I/O data throughput, in contrast to conventional metadata management schemes, that is, logging or journaling on MDS. Besides, a complete metadata recovery can be achieved by replaying the backup logs cached by all involved clients, when the metadata server has crashed or gone into nonoperational state exceptionally.
Log-Less Metadata Management on Metadata Server for Parallel File Systems
Xiao, Guoqiang; Peng, Xiaoning
2014-01-01
This paper presents a novel metadata management mechanism on the metadata server (MDS) for parallel and distributed file systems. In this technique, the client file system backs up the sent metadata requests, which have been handled by the metadata server, so that the MDS does not need to log metadata changes to nonvolatile storage for achieving highly available metadata service, as well as better performance improvement in metadata processing. As the client file system backs up certain sent metadata requests in its memory, the overhead for handling these backup requests is much smaller than that brought by the metadata server, while it adopts logging or journaling to yield highly available metadata service. The experimental results show that this newly proposed mechanism can significantly improve the speed of metadata processing and render a better I/O data throughput, in contrast to conventional metadata management schemes, that is, logging or journaling on MDS. Besides, a complete metadata recovery can be achieved by replaying the backup logs cached by all involved clients, when the metadata server has crashed or gone into nonoperational state exceptionally. PMID:24892093
Fuzzy Logic Based Autonomous Parallel Parking System with Kalman Filtering
NASA Astrophysics Data System (ADS)
Panomruttanarug, Benjamas; Higuchi, Kohji
This paper presents an emulation of fuzzy logic control schemes for an autonomous parallel parking system in a backward maneuver. There are four infrared sensors sending the distance data to a microcontroller for generating an obstacle-free parking path. Two of them mounted on the front and rear wheels on the parking side are used as the inputs to the fuzzy rules to calculate a proper steering angle while backing. The other two attached to the front and rear ends serve for avoiding collision with other cars along the parking space. At the end of parking processes, the vehicle will be in line with other parked cars and positioned in the middle of the free space. Fuzzy rules are designed based upon a wall following process. Performance of the infrared sensors is improved using Kalman filtering. The design method needs extra information from ultrasonic sensors. Starting from modeling the ultrasonic sensor in 1-D state space forms, one makes use of the infrared sensor as a measurement to update the predicted values. Experimental results demonstrate the effectiveness of sensor improvement.
Atoche, Alejandro Castillo; Castillo, Javier Vázquez
2012-01-01
A high-speed dual super-systolic core for reconstructive signal processing (SP) operations consists of a double parallel systolic array (SA) machine in which each processing element of the array is also conceptualized as another SA in a bit-level fashion. In this study, we addressed the design of a high-speed dual super-systolic array (SSA) core for the enhancement/reconstruction of remote sensing (RS) imaging of radar/synthetic aperture radar (SAR) sensor systems. The selected reconstructive SP algorithms are efficiently transformed in their parallel representation and then, they are mapped into an efficient high performance embedded computing (HPEC) architecture in reconfigurable Xilinx field programmable gate array (FPGA) platforms. As an implementation test case, the proposed approach was aggregated in a HW/SW co-design scheme in order to solve the nonlinear ill-posed inverse problem of nonparametric estimation of the power spatial spectrum pattern (SSP) from a remotely sensed scene. We show how such dual SSA core, drastically reduces the computational load of complex RS regularization techniques achieving the required real-time operational mode. PMID:22736964
Mozaffari, Brian
2014-01-01
Based on the notion that the brain is equipped with a hierarchical organization, which embodies environmental contingencies across many time scales, this paper suggests that the medial temporal lobe (MTL)—located deep in the hierarchy—serves as a bridge connecting supra- to infra—MTL levels. Bridging the upper and lower regions of the hierarchy provides a parallel architecture that optimizes information flow between upper and lower regions to aid attention, encoding, and processing of quick complex visual phenomenon. Bypassing intermediate hierarchy levels, information conveyed through the MTL “bridge” allows upper levels to make educated predictions about the prevailing context and accordingly select lower representations to increase the efficiency of predictive coding throughout the hierarchy. This selection or activation/deactivation is associated with endogenous attention. In the event that these “bridge” predictions are inaccurate, this architecture enables the rapid encoding of novel contingencies. A review of hierarchical models in relation to memory is provided along with a new theory, Medial-temporal-lobe Conduit for Parallel Connectivity (MCPC). In this scheme, consolidation is considered as a secondary process, occurring after a MTL-bridged connection, which eventually allows upper and lower levels to access each other directly. With repeated reactivations, as contingencies become consolidated, less MTL activity is predicted. Finally, MTL bridging may aid processing transient but structured perceptual events, by allowing communication between upper and lower levels without calling on intermediate levels of representation. PMID:25426036
A phase-based stereo vision system-on-a-chip.
Díaz, Javier; Ros, Eduardo; Sabatini, Silvio P; Solari, Fabio; Mota, Sonia
2007-02-01
A simple and fast technique for depth estimation based on phase measurement has been adopted for the implementation of a real-time stereo system with sub-pixel resolution on an FPGA device. The technique avoids the attendant problem of phase warping. The designed system takes full advantage of the inherent processing parallelism and segmentation capabilities of FPGA devices to achieve a computation speed of 65megapixels/s, which can be arranged with a customized frame-grabber module to process 211frames/s at a size of 640x480 pixels. The processing speed achieved is higher than conventional camera frame rates, thus allowing the system to extract multiple estimations and be used as a platform to evaluate integration schemes of a population of neurons without increasing hardware resource demands.
Gyrokinetic Magnetohydrodynamics and the Associated Equilibrium
NASA Astrophysics Data System (ADS)
Lee, W. W.; Hudson, S. R.; Ma, C. H.
2017-10-01
A proposed scheme for the calculations of gyrokinetic MHD and its associated equilibrium is discussed related a recent paper on the subject. The scheme is based on the time-dependent gyrokinetic vorticity equation and parallel Ohm's law, as well as the associated gyrokinetic Ampere's law. This set of equations, in terms of the electrostatic potential, ϕ, and the vector potential, ϕ , supports both spatially varying perpendicular and parallel pressure gradients and their associated currents. The MHD equilibrium can be reached when ϕ -> 0 and A becomes constant in time, which, in turn, gives ∇ . (J|| +J⊥) = 0 and the associated magnetic islands. Examples in simple cylindrical geometry will be given. The present work is partially supported by US DoE Grant DE-AC02-09CH11466.
A Strassen-Newton algorithm for high-speed parallelizable matrix inversion
NASA Technical Reports Server (NTRS)
Bailey, David H.; Ferguson, Helaman R. P.
1988-01-01
Techniques are described for computing matrix inverses by algorithms that are highly suited to massively parallel computation. The techniques are based on an algorithm suggested by Strassen (1969). Variations of this scheme use matrix Newton iterations and other methods to improve the numerical stability while at the same time preserving a very high level of parallelism. One-processor Cray-2 implementations of these schemes range from one that is up to 55 percent faster than a conventional library routine to one that is slower than a library routine but achieves excellent numerical stability. The problem of computing the solution to a single set of linear equations is discussed, and it is shown that this problem can also be solved efficiently using these techniques.
An Asynchronous Many-Task Implementation of In-Situ Statistical Analysis using Legion.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Pebay, Philippe Pierre; Bennett, Janine Camille
2015-11-01
In this report, we propose a framework for the design and implementation of in-situ analy- ses using an asynchronous many-task (AMT) model, using the Legion programming model together with the MiniAero mini-application as a surrogate for full-scale parallel scientific computing applications. The bulk of this work consists of converting the Learn/Derive/Assess model which we had initially developed for parallel statistical analysis using MPI [PTBM11], from a SPMD to an AMT model. In this goal, we propose an original use of the concept of Legion logical regions as a replacement for the parallel communication schemes used for the only operation ofmore » the statistics engines that require explicit communication. We then evaluate this proposed scheme in a shared memory environment, using the Legion port of MiniAero as a proxy for a full-scale scientific application, as a means to provide input data sets of variable size for the in-situ statistical analyses in an AMT context. We demonstrate in particular that the approach has merit, and warrants further investigation, in collaboration with ongoing efforts to improve the overall parallel performance of the Legion system.« less
Son of IXION: A Steady State Centrifugally Confined Plasma for Fusion*
NASA Astrophysics Data System (ADS)
Hassam, Adil
1996-11-01
A magnetic confinement scheme in which the inertial, u.grad(u), forces effect parallel confinement is proposed. The basic geometry is mirror-like as far as the poloidal field goes or, more simply, multipole (FM-1) type. The rotation is toroidal in this geometry. A supersonic rotation can effect complete parallel confinement, with the usual magnetic mirror force rendered irrelevant. The rotation shear, in addition, aids in the suppression of the flute mode. This suppression is not complete which indicates the addition of a toroidal field, at maximum of the order of the poloidal field. We show that at rotation in excess of Mach 3, the parallel particle and heat losses can be minimized to below the Lawson breakeven point. The crossfield transport can be expected to be better than tokamaks on account of the large velocity shear. Other advantages of the scheme are that it is steady state and disruption free. An exploratory experiment that tests equilibrium, parallel detachment, and MHD stability is proposed. The concept resembles earlier (Geneva, 1958) experiments on "homopolar generators" and a mirror configuration called IXION. Ixion, Greek mythological king, was forever strapped to a rotating, flaming wheel. *Work supported by DOE
Phased array ghost elimination.
Kellman, Peter; McVeigh, Elliot R
2006-05-01
Parallel imaging may be applied to cancel ghosts caused by a variety of distortion mechanisms, including distortions such as off-resonance or local flow, which are space variant. Phased array combining coefficients may be calculated that null ghost artifacts at known locations based on a constrained optimization, which optimizes SNR subject to the nulling constraint. The resultant phased array ghost elimination (PAGE) technique is similar to the method known as sensitivity encoding (SENSE) used for accelerated imaging; however, in this formulation is applied to full field-of-view (FOV) images. The phased array method for ghost elimination may result in greater flexibility in designing acquisition strategies. For example, in multi-shot EPI applications ghosts are typically mitigated by the use of an interleaved phase encode acquisition order. An alternative strategy is to use a sequential, non-interleaved phase encode order and cancel the resultant ghosts using PAGE parallel imaging. Cancellation of ghosts by means of phased array processing makes sequential, non-interleaved phase encode acquisition order practical, and permits a reduction in repetition time, TR, by eliminating the need for echo-shifting. Sequential, non-interleaved phase encode order has benefits of reduced distortion due to off-resonance, in-plane flow and EPI delay misalignment. Furthermore, the use of EPI with PAGE has inherent fat-water separation and has been used to provide off-resonance correction using a technique referred to as lipid elimination with an echo-shifting N/2-ghost acquisition (LEENA), and may further generalized using the multi-point Dixon method. Other applications of PAGE include cancelling ghosts which arise due to amplitude or phase variation during the approach to steady state. Parallel imaging requires estimates of the complex coil sensitivities. In vivo estimates may be derived by temporally varying the phase encode ordering to obtain a full k-space dataset in a scheme similar to the autocalibrating TSENSE method. This scheme is a generalization of the UNFOLD method used for removing aliasing in undersampled acquisitions. The more general scheme may be used to modulate each EPI ghost image to a separate temporal frequency as described in this paper. Copyright (c) 2006 John Wiley & Sons, Ltd.
Phased array ghost elimination
Kellman, Peter; McVeigh, Elliot R.
2007-01-01
Parallel imaging may be applied to cancel ghosts caused by a variety of distortion mechanisms, including distortions such as off-resonance or local flow, which are space variant. Phased array combining coefficients may be calculated that null ghost artifacts at known locations based on a constrained optimization, which optimizes SNR subject to the nulling constraint. The resultant phased array ghost elimination (PAGE) technique is similar to the method known as sensitivity encoding (SENSE) used for accelerated imaging; however, in this formulation is applied to full field-of-view (FOV) images. The phased array method for ghost elimination may result in greater flexibility in designing acquisition strategies. For example, in multi-shot EPI applications ghosts are typically mitigated by the use of an interleaved phase encode acquisition order. An alternative strategy is to use a sequential, non-interleaved phase encode order and cancel the resultant ghosts using PAGE parallel imaging. Cancellation of ghosts by means of phased array processing makes sequential, non-interleaved phase encode acquisition order practical, and permits a reduction in repetition time, TR, by eliminating the need for echo-shifting. Sequential, non-interleaved phase encode order has benefits of reduced distortion due to off-resonance, in-plane flow and EPI delay misalignment. Furthermore, the use of EPI with PAGE has inherent fat-water separation and has been used to provide off-resonance correction using a technique referred to as lipid elimination with an echo-shifting N/2-ghost acquisition (LEENA), and may further generalized using the multi-point Dixon method. Other applications of PAGE include cancelling ghosts which arise due to amplitude or phase variation during the approach to steady state. Parallel imaging requires estimates of the complex coil sensitivities. In vivo estimates may be derived by temporally varying the phase encode ordering to obtain a full k-space dataset in a scheme similar to the autocalibrating TSENSE method. This scheme is a generalization of the UNFOLD method used for removing aliasing in undersampled acquisitions. The more general scheme may be used to modulate each EPI ghost image to a separate temporal frequency as described in this paper. PMID:16705636
LEAP: An Innovative Direction Dependent Ionospheric Calibration Scheme for Low Frequency Arrays
NASA Astrophysics Data System (ADS)
Rioja, María J.; Dodson, Richard; Franzen, Thomas M. O.
2018-05-01
The ambitious scientific goals of the SKA require a matching capability for calibration of atmospheric propagation errors, which contaminate the observed signals. We demonstrate a scheme for correcting the direction-dependent ionospheric and instrumental phase effects at the low frequencies and with the wide fields of view planned for SKA-Low. It leverages bandwidth smearing, to filter-out signals from off-axis directions, allowing the measurement of the direction-dependent antenna-based gains in the visibility domain; by doing this towards multiple directions it is possible to calibrate across wide fields of view. This strategy removes the need for a global sky model, therefore all directions are independent. We use MWA results at 88 and 154 MHz under various weather conditions to characterise the performance and applicability of the technique. We conclude that this method is suitable to measure and correct for temporal fluctuations and direction-dependent spatial ionospheric phase distortions on a wide range of scales: both larger and smaller than the array size. The latter are the most intractable and pose a major challenge for future instruments. Moreover this scheme is an embarrassingly parallel process, as multiple directions can be processed independently and simultaneously. This is an important consideration for the SKA, where the current planned architecture is one of compute-islands with limited interconnects. Current implementation of the algorithm and on-going developments are discussed.
A parallel computing engine for a class of time critical processes.
Nabhan, T M; Zomaya, A Y
1997-01-01
This paper focuses on the efficient parallel implementation of systems of numerically intensive nature over loosely coupled multiprocessor architectures. These analytical models are of significant importance to many real-time systems that have to meet severe time constants. A parallel computing engine (PCE) has been developed in this work for the efficient simplification and the near optimal scheduling of numerical models over the different cooperating processors of the parallel computer. First, the analytical system is efficiently coded in its general form. The model is then simplified by using any available information (e.g., constant parameters). A task graph representing the interconnections among the different components (or equations) is generated. The graph can then be compressed to control the computation/communication requirements. The task scheduler employs a graph-based iterative scheme, based on the simulated annealing algorithm, to map the vertices of the task graph onto a Multiple-Instruction-stream Multiple-Data-stream (MIMD) type of architecture. The algorithm uses a nonanalytical cost function that properly considers the computation capability of the processors, the network topology, the communication time, and congestion possibilities. Moreover, the proposed technique is simple, flexible, and computationally viable. The efficiency of the algorithm is demonstrated by two case studies with good results.
Parallel Agent-Based Simulations on Clusters of GPUs and Multi-Core Processors
DOE Office of Scientific and Technical Information (OSTI.GOV)
Aaby, Brandon G; Perumalla, Kalyan S; Seal, Sudip K
2010-01-01
An effective latency-hiding mechanism is presented in the parallelization of agent-based model simulations (ABMS) with millions of agents. The mechanism is designed to accommodate the hierarchical organization as well as heterogeneity of current state-of-the-art parallel computing platforms. We use it to explore the computation vs. communication trade-off continuum available with the deep computational and memory hierarchies of extant platforms and present a novel analytical model of the tradeoff. We describe our implementation and report preliminary performance results on two distinct parallel platforms suitable for ABMS: CUDA threads on multiple, networked graphical processing units (GPUs), and pthreads on multi-core processors. Messagemore » Passing Interface (MPI) is used for inter-GPU as well as inter-socket communication on a cluster of multiple GPUs and multi-core processors. Results indicate the benefits of our latency-hiding scheme, delivering as much as over 100-fold improvement in runtime for certain benchmark ABMS application scenarios with several million agents. This speed improvement is obtained on our system that is already two to three orders of magnitude faster on one GPU than an equivalent CPU-based execution in a popular simulator in Java. Thus, the overall execution of our current work is over four orders of magnitude faster when executed on multiple GPUs.« less
NASA Astrophysics Data System (ADS)
Raphael, David T.; McIntee, Diane; Tsuruda, Jay S.; Colletti, Patrick; Tatevossian, Raymond; Frazier, James
2006-03-01
We explored multiple image processing approaches by which to display the segmented adult brachial plexus in a three-dimensional manner. Magnetic resonance neurography (MRN) 1.5-Tesla scans with STIR sequences, which preferentially highlight nerves, were performed in adult volunteers to generate high-resolution raw images. Using multiple software programs, the raw MRN images were then manipulated so as to achieve segmentation of plexus neurovascular structures, which were incorporated into three different visualization schemes: rotating upper thoracic girdle skeletal frames, dynamic fly-throughs parallel to the clavicle, and thin slab volume-rendered composite projections.
NASA Astrophysics Data System (ADS)
Oh, Kwang Jin; Kang, Ji Hoon; Myung, Hun Joo
2012-02-01
We have revised a general purpose parallel molecular dynamics simulation program mm_par using the object-oriented programming. We parallelized the revised version using a hierarchical scheme in order to utilize more processors for a given system size. The benchmark result will be presented here. New version program summaryProgram title: mm_par2.0 Catalogue identifier: ADXP_v2_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADXP_v2_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC license, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 2 390 858 No. of bytes in distributed program, including test data, etc.: 25 068 310 Distribution format: tar.gz Programming language: C++ Computer: Any system operated by Linux or Unix Operating system: Linux Classification: 7.7 External routines: We provide wrappers for FFTW [1], Intel MKL library [2] FFT routine, and Numerical recipes [3] FFT, random number generator, and eigenvalue solver routines, SPRNG [4] random number generator, Mersenne Twister [5] random number generator, space filling curve routine. Catalogue identifier of previous version: ADXP_v1_0 Journal reference of previous version: Comput. Phys. Comm. 174 (2006) 560 Does the new version supersede the previous version?: Yes Nature of problem: Structural, thermodynamic, and dynamical properties of fluids and solids from microscopic scales to mesoscopic scales. Solution method: Molecular dynamics simulation in NVE, NVT, and NPT ensemble, Langevin dynamics simulation, dissipative particle dynamics simulation. Reasons for new version: First, object-oriented programming has been used, which is known to be open for extension and closed for modification. It is also known to be better for maintenance. Second, version 1.0 was based on atom decomposition and domain decomposition scheme [6] for parallelization. However, atom decomposition is not popular due to its poor scalability. On the other hand, domain decomposition scheme is better for scalability. It still has a limitation in utilizing a large number of cores on recent petascale computers due to the requirement that the domain size is larger than the potential cutoff distance. To go beyond such a limitation, a hierarchical parallelization scheme has been adopted in this new version and implemented using MPI [7] and OPENMP [8]. Summary of revisions: (1) Object-oriented programming has been used. (2) A hierarchical parallelization scheme has been adopted. (3) SPME routine has been fully parallelized with parallel 3D FFT using volumetric decomposition scheme [9]. K.J.O. thanks Mr. Seung Min Lee for useful discussion on programming and debugging. Running time: Running time depends on system size and methods used. For test system containing a protein (PDB id: 5DHFR) with CHARMM22 force field [10] and 7023 TIP3P [11] waters in simulation box having dimension 62.23 Å×62.23 Å×62.23 Å, the benchmark results are given in Fig. 1. Here the potential cutoff distance was set to 12 Å and the switching function was applied from 10 Å for the force calculation in real space. For the SPME [12] calculation, K, K, and K were set to 64 and the interpolation order was set to 4. To do the fast Fourier transform, we used Intel MKL library. All bonds including hydrogen atoms were constrained using SHAKE/RATTLE algorithms [13,14]. The code was compiled using Intel compiler version 11.1 and mvapich2 version 1.5. Fig. 2 shows performance gains from using CUDA-enabled version [15] of mm_par for 5DHFR simulation in water on Intel Core2Quad 2.83 GHz and GeForce GTX 580. Even though mm_par2.0 is not ported yet for GPU, its performance data would be useful to expect mm_par2.0 performance on GPU. Timing results for 1000 MD steps. 1, 2, 4, and 8 in the figure mean the number of OPENMP threads. Timing results for 1000 MD steps from double precision simulation on CPU, single precision simulation on GPU, and double precision simulation on GPU.
Approximate optimal guidance for the advanced launch system
NASA Technical Reports Server (NTRS)
Feeley, T. S.; Speyer, J. L.
1993-01-01
A real-time guidance scheme for the problem of maximizing the payload into orbit subject to the equations of motion for a rocket over a spherical, non-rotating earth is presented. An approximate optimal launch guidance law is developed based upon an asymptotic expansion of the Hamilton - Jacobi - Bellman or dynamic programming equation. The expansion is performed in terms of a small parameter, which is used to separate the dynamics of the problem into primary and perturbation dynamics. For the zeroth-order problem the small parameter is set to zero and a closed-form solution to the zeroth-order expansion term of Hamilton - Jacobi - Bellman equation is obtained. Higher-order terms of the expansion include the effects of the neglected perturbation dynamics. These higher-order terms are determined from the solution of first-order linear partial differential equations requiring only the evaluation of quadratures. This technique is preferred as a real-time, on-line guidance scheme to alternative numerical iterative optimization schemes because of the unreliable convergence properties of these iterative guidance schemes and because the quadratures needed for the approximate optimal guidance law can be performed rapidly and by parallel processing. Even if the approximate solution is not nearly optimal, when using this technique the zeroth-order solution always provides a path which satisfies the terminal constraints. Results for two-degree-of-freedom simulations are presented for the simplified problem of flight in the equatorial plane and compared to the guidance scheme generated by the shooting method which is an iterative second-order technique.
Saucedo-Espinosa, Mario A.; Lapizco-Encinas, Blanca H.
2016-01-01
Current monitoring is a well-established technique for the characterization of electroosmotic (EO) flow in microfluidic devices. This method relies on monitoring the time response of the electric current when a test buffer solution is displaced by an auxiliary solution using EO flow. In this scheme, each solution has a different ionic concentration (and electric conductivity). The difference in the ionic concentration of the two solutions defines the dynamic time response of the electric current and, hence, the current signal to be measured: larger concentration differences result in larger measurable signals. A small concentration difference is needed, however, to avoid dispersion at the interface between the two solutions, which can result in undesired pressure-driven flow that conflicts with the EO flow. Additional challenges arise as the conductivity of the test solution decreases, leading to a reduced electric current signal that may be masked by noise during the measuring process, making for a difficult estimation of an accurate EO mobility. This contribution presents a new scheme for current monitoring that employs multiple channels arranged in parallel, producing an increase in the signal-to-noise ratio of the electric current to be measured and increasing the estimation accuracy. The use of this parallel approach is particularly useful in the estimation of the EO mobility in systems where low conductivity mediums are required, such as insulator based dielectrophoresis devices. PMID:27375813
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kodavasal, Janardhan; Harms, Kevin; Srivastava, Priyesh
A closed-cycle gasoline compression ignition engine simulation near top dead center (TDC) was used to profile the performance of a parallel commercial engine computational fluid dynamics code, as it was scaled on up to 4096 cores of an IBM Blue Gene/Q supercomputer. The test case has 9 million cells near TDC, with a fixed mesh size of 0.15 mm, and was run on configurations ranging from 128 to 4096 cores. Profiling was done for a small duration of 0.11 crank angle degrees near TDC during ignition. Optimization of input/output performance resulted in a significant speedup in reading restart files, andmore » in an over 100-times speedup in writing restart files and files for post-processing. Improvements to communication resulted in a 1400-times speedup in the mesh load balancing operation during initialization, on 4096 cores. An improved, “stiffness-based” algorithm for load balancing chemical kinetics calculations was developed, which results in an over 3-times faster run-time near ignition on 4096 cores relative to the original load balancing scheme. With this improvement to load balancing, the code achieves over 78% scaling efficiency on 2048 cores, and over 65% scaling efficiency on 4096 cores, relative to 256 cores.« less
Efficient Parallel Kernel Solvers for Computational Fluid Dynamics Applications
NASA Technical Reports Server (NTRS)
Sun, Xian-He
1997-01-01
Distributed-memory parallel computers dominate today's parallel computing arena. These machines, such as Intel Paragon, IBM SP2, and Cray Origin2OO, have successfully delivered high performance computing power for solving some of the so-called "grand-challenge" problems. Despite initial success, parallel machines have not been widely accepted in production engineering environments due to the complexity of parallel programming. On a parallel computing system, a task has to be partitioned and distributed appropriately among processors to reduce communication cost and to attain load balance. More importantly, even with careful partitioning and mapping, the performance of an algorithm may still be unsatisfactory, since conventional sequential algorithms may be serial in nature and may not be implemented efficiently on parallel machines. In many cases, new algorithms have to be introduced to increase parallel performance. In order to achieve optimal performance, in addition to partitioning and mapping, a careful performance study should be conducted for a given application to find a good algorithm-machine combination. This process, however, is usually painful and elusive. The goal of this project is to design and develop efficient parallel algorithms for highly accurate Computational Fluid Dynamics (CFD) simulations and other engineering applications. The work plan is 1) developing highly accurate parallel numerical algorithms, 2) conduct preliminary testing to verify the effectiveness and potential of these algorithms, 3) incorporate newly developed algorithms into actual simulation packages. The work plan has well achieved. Two highly accurate, efficient Poisson solvers have been developed and tested based on two different approaches: (1) Adopting a mathematical geometry which has a better capacity to describe the fluid, (2) Using compact scheme to gain high order accuracy in numerical discretization. The previously developed Parallel Diagonal Dominant (PDD) algorithm and Reduced Parallel Diagonal Dominant (RPDD) algorithm have been carefully studied on different parallel platforms for different applications, and a NASA simulation code developed by Man M. Rai and his colleagues has been parallelized and implemented based on data dependency analysis. These achievements are addressed in detail in the paper.
Efficient Thread Labeling for Monitoring Programs with Nested Parallelism
NASA Astrophysics Data System (ADS)
Ha, Ok-Kyoon; Kim, Sun-Sook; Jun, Yong-Kee
It is difficult and cumbersome to detect data races occurred in an execution of parallel programs. Any on-the-fly race detection techniques using Lamport's happened-before relation needs a thread labeling scheme for generating unique identifiers which maintain logical concurrency information for the parallel threads. NR labeling is an efficient thread labeling scheme for the fork-join program model with nested parallelism, because its efficiency depends only on the nesting depth for every fork and join operation. This paper presents an improved NR labeling, called e-NR labeling, in which every thread generates its label by inheriting the pointer to its ancestor list from the parent threads or by updating the pointer in a constant amount of time and space. This labeling is more efficient than the NR labeling, because its efficiency does not depend on the nesting depth for every fork and join operation. Some experiments were performed with OpenMP programs having nesting depths of three or four and maximum parallelisms varying from 10,000 to 1,000,000. The results show that e-NR is 5 times faster than NR labeling and 4.3 times faster than OS labeling in the average time for creating and maintaining the thread labels. In average space required for labeling, it is 3.5 times smaller than NR labeling and 3 times smaller than OS labeling.
Dynamic analysis and control of lightweight manipulators with flexible parallel link mechanisms
NASA Technical Reports Server (NTRS)
Lee, Jeh Won
1991-01-01
The flexible parallel link mechanism is designed for increased rigidity to sustain the buckling when it carries a heavy payload. Compared to a one link flexible manipulator, a two link flexible manipulator, especially the flexible parallel mechanism, has more complicated characteristics in dynamics and control. The objective of this research is the theoretical analysis and the experimental verification of dynamics and control of a two link flexible manipulator with a flexible parallel link mechanism. Nonlinear equations of motion of the lightweight manipulator are derived by the Lagrangian method in symbolic form to better understand the structure of the dynamic model. A manipulator with a flexible parallel link mechanism is a constrained dynamic system whose equations are sensitive to numerical integration error. This constrained system is solved using singular value decomposition of the constraint Jacobian matrix. The discrepancies between the analytical model and the experiment are explained using a simplified and a detailed finite element model. The step response of the analytical model and the TREETOPS model match each other well. The nonlinear dynamics is studied using a sinusoidal excitation. The actuator dynamic effect on a flexible robot was investigated. The effects are explained by the root loci and the Bode plot theoretically and experimentally. For the base performance for the advanced control scheme, a simple decoupled feedback scheme is applied.
A general purpose subroutine for fast fourier transform on a distributed memory parallel machine
NASA Technical Reports Server (NTRS)
Dubey, A.; Zubair, M.; Grosch, C. E.
1992-01-01
One issue which is central in developing a general purpose Fast Fourier Transform (FFT) subroutine on a distributed memory parallel machine is the data distribution. It is possible that different users would like to use the FFT routine with different data distributions. Thus, there is a need to design FFT schemes on distributed memory parallel machines which can support a variety of data distributions. An FFT implementation on a distributed memory parallel machine which works for a number of data distributions commonly encountered in scientific applications is presented. The problem of rearranging the data after computing the FFT is also addressed. The performance of the implementation on a distributed memory parallel machine Intel iPSC/860 is evaluated.
Limpanuparb, Taweetham; Milthorpe, Josh; Rendell, Alistair P
2014-10-30
Use of the modern parallel programming language X10 for computing long-range Coulomb and exchange interactions is presented. By using X10, a partitioned global address space language with support for task parallelism and the explicit representation of data locality, the resolution of the Ewald operator can be parallelized in a straightforward manner including use of both intranode and internode parallelism. We evaluate four different schemes for dynamic load balancing of integral calculation using X10's work stealing runtime, and report performance results for long-range HF energy calculation of large molecule/high quality basis running on up to 1024 cores of a high performance cluster machine. Copyright © 2014 Wiley Periodicals, Inc.
Customizing FP-growth algorithm to parallel mining with Charm++ library
NASA Astrophysics Data System (ADS)
Puścian, Marek
2017-08-01
This paper presents a frequent item mining algorithm that was customized to handle growing data repositories. The proposed solution applies Master Slave scheme to frequent pattern growth technique. Efficient utilization of available computation units is achieved by dynamic reallocation of tasks. Conditional frequent trees are assigned to parallel workers basing on their workload. Proposed enhancements have been successfully implemented using Charm++ library. This paper discusses results of the performance of parallelized FP-growth algorithm against different datasets. The approach has been illustrated with many experiments and measurements performed using multiprocessor and multithreaded computer.
A new conformal absorbing boundary condition for finite element meshes and parallelization of FEMATS
NASA Technical Reports Server (NTRS)
Chatterjee, A.; Volakis, J. L.; Nguyen, J.; Nurnberger, M.; Ross, D.
1993-01-01
Some of the progress toward the development and parallelization of an improved version of the finite element code FEMATS is described. This is a finite element code for computing the scattering by arbitrarily shaped three dimensional surfaces composite scatterers. The following tasks were worked on during the report period: (1) new absorbing boundary conditions (ABC's) for truncating the finite element mesh; (2) mixed mesh termination schemes; (3) hierarchical elements and multigridding; (4) parallelization; and (5) various modeling enhancements (antenna feeds, anisotropy, and higher order GIBC).
Approximation algorithms for scheduling unrelated parallel machines with release dates
NASA Astrophysics Data System (ADS)
Avdeenko, T. V.; Mesentsev, Y. A.; Estraykh, I. V.
2017-01-01
In this paper we propose approaches to optimal scheduling of unrelated parallel machines with release dates. One approach is based on the scheme of dynamic programming modified with adaptive narrowing of search domain ensuring its computational effectiveness. We discussed complexity of the exact schedules synthesis and compared it with approximate, close to optimal, solutions. Also we explain how the algorithm works for the example of two unrelated parallel machines and five jobs with release dates. Performance results that show the efficiency of the proposed approach have been given.
NASA Astrophysics Data System (ADS)
Haron, Adib; Mahdzair, Fazren; Luqman, Anas; Osman, Nazmie; Junid, Syed Abdul Mutalib Al
2018-03-01
One of the most significant constraints of Von Neumann architecture is the limited bandwidth between memory and processor. The cost to move data back and forth between memory and processor is considerably higher than the computation in the processor itself. This architecture significantly impacts the Big Data and data-intensive application such as DNA analysis comparison which spend most of the processing time to move data. Recently, the in-memory processing concept was proposed, which is based on the capability to perform the logic operation on the physical memory structure using a crossbar topology and non-volatile resistive-switching memristor technology. This paper proposes a scheme to map digital equality comparator circuit on memristive memory crossbar array. The 2-bit, 4-bit, 8-bit, 16-bit, 32-bit, and 64-bit of equality comparator circuit are mapped on memristive memory crossbar array by using material implication logic in a sequential and parallel method. The simulation results show that, for the 64-bit word size, the parallel mapping exhibits 2.8× better performance in total execution time than sequential mapping but has a trade-off in terms of energy consumption and area utilization. Meanwhile, the total crossbar area can be reduced by 1.2× for sequential mapping and 1.5× for parallel mapping both by using the overlapping technique.
Effective switching frequency multiplier inverter
Su, Gui-Jia [Oak Ridge, TN; Peng, Fang Z [Okemos, MI
2007-08-07
A switching frequency multiplier inverter for low inductance machines that uses parallel connection of switches and each switch is independently controlled according to a pulse width modulation scheme. The effective switching frequency is multiplied by the number of switches connected in parallel while each individual switch operates within its limit of switching frequency. This technique can also be used for other power converters such as DC/DC, AC/DC converters.
Parallelization Issues and Particle-In Codes.
NASA Astrophysics Data System (ADS)
Elster, Anne Cathrine
1994-01-01
"Everything should be made as simple as possible, but not simpler." Albert Einstein. The field of parallel scientific computing has concentrated on parallelization of individual modules such as matrix solvers and factorizers. However, many applications involve several interacting modules. Our analyses of a particle-in-cell code modeling charged particles in an electric field, show that these accompanying dependencies affect data partitioning and lead to new parallelization strategies concerning processor, memory and cache utilization. Our test-bed, a KSR1, is a distributed memory machine with a globally shared addressing space. However, most of the new methods presented hold generally for hierarchical and/or distributed memory systems. We introduce a novel approach that uses dual pointers on the local particle arrays to keep the particle locations automatically partially sorted. Complexity and performance analyses with accompanying KSR benchmarks, have been included for both this scheme and for the traditional replicated grids approach. The latter approach maintains load-balance with respect to particles. However, our results demonstrate it fails to scale properly for problems with large grids (say, greater than 128-by-128) running on as few as 15 KSR nodes, since the extra storage and computation time associated with adding the grid copies, becomes significant. Our grid partitioning scheme, although harder to implement, does not need to replicate the whole grid. Consequently, it scales well for large problems on highly parallel systems. It may, however, require load balancing schemes for non-uniform particle distributions. Our dual pointer approach may facilitate this through dynamically partitioned grids. We also introduce hierarchical data structures that store neighboring grid-points within the same cache -line by reordering the grid indexing. This alignment produces a 25% savings in cache-hits for a 4-by-4 cache. A consideration of the input data's effect on the simulation may lead to further improvements. For example, in the case of mean particle drift, it is often advantageous to partition the grid primarily along the direction of the drift. The particle-in-cell codes for this study were tested using physical parameters, which lead to predictable phenomena including plasma oscillations and two-stream instabilities. An overview of the most central references related to parallel particle codes is also given.
Fairness of QoS supporting in optical burst switching
NASA Astrophysics Data System (ADS)
Xuan, Xuelei; Liu, Hua; Chen, Chunfeng; Zhang, Zhizhong
2004-04-01
In this paper we investigate the fairness problem of offset-time-based quality of service (QoS) scheme proposed by Qiao and Dixit in optical burst switching (OBS) networks. In the proposed schemes, QoS relies on the fact that the requests for reservation further into the future, but for practical, benchmark offset-time of data bursts at the intermediate nodes is not equal to each other. Here, a new offset-time-based QoS scheme is introduced, where data bursts are classified according to their offset-time and isolated in the wavelength domain or time domain to achieve the parallel reservation. Through simulation, it is found that this scheme achieves fairness among data bursts with different priority.
Quantum Iterative Deepening with an Application to the Halting Problem
Tarrataca, Luís; Wichert, Andreas
2013-01-01
Classical models of computation traditionally resort to halting schemes in order to enquire about the state of a computation. In such schemes, a computational process is responsible for signaling an end of a calculation by setting a halt bit, which needs to be systematically checked by an observer. The capacity of quantum computational models to operate on a superposition of states requires an alternative approach. From a quantum perspective, any measurement of an equivalent halt qubit would have the potential to inherently interfere with the computation by provoking a random collapse amongst the states. This issue is exacerbated by undecidable problems such as the Entscheidungsproblem which require universal computational models, e.g. the classical Turing machine, to be able to proceed indefinitely. In this work we present an alternative view of quantum computation based on production system theory in conjunction with Grover's amplitude amplification scheme that allows for (1) a detection of halt states without interfering with the final result of a computation; (2) the possibility of non-terminating computation and (3) an inherent speedup to occur during computations susceptible of parallelization. We discuss how such a strategy can be employed in order to simulate classical Turing machines. PMID:23520465
A Real-Time Capable Software-Defined Receiver Using GPU for Adaptive Anti-Jam GPS Sensors
Seo, Jiwon; Chen, Yu-Hsuan; De Lorenzo, David S.; Lo, Sherman; Enge, Per; Akos, Dennis; Lee, Jiyun
2011-01-01
Due to their weak received signal power, Global Positioning System (GPS) signals are vulnerable to radio frequency interference. Adaptive beam and null steering of the gain pattern of a GPS antenna array can significantly increase the resistance of GPS sensors to signal interference and jamming. Since adaptive array processing requires intensive computational power, beamsteering GPS receivers were usually implemented using hardware such as field-programmable gate arrays (FPGAs). However, a software implementation using general-purpose processors is much more desirable because of its flexibility and cost effectiveness. This paper presents a GPS software-defined radio (SDR) with adaptive beamsteering capability for anti-jam applications. The GPS SDR design is based on an optimized desktop parallel processing architecture using a quad-core Central Processing Unit (CPU) coupled with a new generation Graphics Processing Unit (GPU) having massively parallel processors. This GPS SDR demonstrates sufficient computational capability to support a four-element antenna array and future GPS L5 signal processing in real time. After providing the details of our design and optimization schemes for future GPU-based GPS SDR developments, the jamming resistance of our GPS SDR under synthetic wideband jamming is presented. Since the GPS SDR uses commercial-off-the-shelf hardware and processors, it can be easily adopted in civil GPS applications requiring anti-jam capabilities. PMID:22164116
A real-time capable software-defined receiver using GPU for adaptive anti-jam GPS sensors.
Seo, Jiwon; Chen, Yu-Hsuan; De Lorenzo, David S; Lo, Sherman; Enge, Per; Akos, Dennis; Lee, Jiyun
2011-01-01
Due to their weak received signal power, Global Positioning System (GPS) signals are vulnerable to radio frequency interference. Adaptive beam and null steering of the gain pattern of a GPS antenna array can significantly increase the resistance of GPS sensors to signal interference and jamming. Since adaptive array processing requires intensive computational power, beamsteering GPS receivers were usually implemented using hardware such as field-programmable gate arrays (FPGAs). However, a software implementation using general-purpose processors is much more desirable because of its flexibility and cost effectiveness. This paper presents a GPS software-defined radio (SDR) with adaptive beamsteering capability for anti-jam applications. The GPS SDR design is based on an optimized desktop parallel processing architecture using a quad-core Central Processing Unit (CPU) coupled with a new generation Graphics Processing Unit (GPU) having massively parallel processors. This GPS SDR demonstrates sufficient computational capability to support a four-element antenna array and future GPS L5 signal processing in real time. After providing the details of our design and optimization schemes for future GPU-based GPS SDR developments, the jamming resistance of our GPS SDR under synthetic wideband jamming is presented. Since the GPS SDR uses commercial-off-the-shelf hardware and processors, it can be easily adopted in civil GPS applications requiring anti-jam capabilities.
Constraint treatment techniques and parallel algorithms for multibody dynamic analysis. Ph.D. Thesis
NASA Technical Reports Server (NTRS)
Chiou, Jin-Chern
1990-01-01
Computational procedures for kinematic and dynamic analysis of three-dimensional multibody dynamic (MBD) systems are developed from the differential-algebraic equations (DAE's) viewpoint. Constraint violations during the time integration process are minimized and penalty constraint stabilization techniques and partitioning schemes are developed. The governing equations of motion, a two-stage staggered explicit-implicit numerical algorithm, are treated which takes advantage of a partitioned solution procedure. A robust and parallelizable integration algorithm is developed. This algorithm uses a two-stage staggered central difference algorithm to integrate the translational coordinates and the angular velocities. The angular orientations of bodies in MBD systems are then obtained by using an implicit algorithm via the kinematic relationship between Euler parameters and angular velocities. It is shown that the combination of the present solution procedures yields a computationally more accurate solution. To speed up the computational procedures, parallel implementation of the present constraint treatment techniques, the two-stage staggered explicit-implicit numerical algorithm was efficiently carried out. The DAE's and the constraint treatment techniques were transformed into arrowhead matrices to which Schur complement form was derived. By fully exploiting the sparse matrix structural analysis techniques, a parallel preconditioned conjugate gradient numerical algorithm is used to solve the systems equations written in Schur complement form. A software testbed was designed and implemented in both sequential and parallel computers. This testbed was used to demonstrate the robustness and efficiency of the constraint treatment techniques, the accuracy of the two-stage staggered explicit-implicit numerical algorithm, and the speed up of the Schur-complement-based parallel preconditioned conjugate gradient algorithm on a parallel computer.
A Dynamic Finite Element Method for Simulating the Physics of Faults Systems
NASA Astrophysics Data System (ADS)
Saez, E.; Mora, P.; Gross, L.; Weatherley, D.
2004-12-01
We introduce a dynamic Finite Element method using a novel high level scripting language to describe the physical equations, boundary conditions and time integration scheme. The library we use is the parallel Finley library: a finite element kernel library, designed for solving large-scale problems. It is incorporated as a differential equation solver into a more general library called escript, based on the scripting language Python. This library has been developed to facilitate the rapid development of 3D parallel codes, and is optimised for the Australian Computational Earth Systems Simulator Major National Research Facility (ACcESS MNRF) supercomputer, a 208 processor SGI Altix with a peak performance of 1.1 TFlops. Using the scripting approach we obtain a parallel FE code able to take advantage of the computational efficiency of the Altix 3700. We consider faults as material discontinuities (the displacement, velocity, and acceleration fields are discontinuous at the fault), with elastic behavior. The stress continuity at the fault is achieved naturally through the expression of the fault interactions in the weak formulation. The elasticity problem is solved explicitly in time, using the Saint Verlat scheme. Finally, we specify a suitable frictional constitutive relation and numerical scheme to simulate fault behaviour. Our model is based on previous work on modelling fault friction and multi-fault systems using lattice solid-like models. We adapt the 2D model for simulating the dynamics of parallel fault systems described to the Finite-Element method. The approach uses a frictional relation along faults that is slip and slip-rate dependent, and the numerical integration approach introduced by Mora and Place in the lattice solid model. In order to illustrate the new Finite Element model, single and multi-fault simulation examples are presented.
A performance study of sparse Cholesky factorization on INTEL iPSC/860
NASA Technical Reports Server (NTRS)
Zubair, M.; Ghose, M.
1992-01-01
The problem of Cholesky factorization of a sparse matrix has been very well investigated on sequential machines. A number of efficient codes exist for factorizing large unstructured sparse matrices. However, there is a lack of such efficient codes on parallel machines in general, and distributed machines in particular. Some of the issues that are critical to the implementation of sparse Cholesky factorization on a distributed memory parallel machine are ordering, partitioning and mapping, load balancing, and ordering of various tasks within a processor. Here, we focus on the effect of various partitioning schemes on the performance of sparse Cholesky factorization on the Intel iPSC/860. Also, a new partitioning heuristic for structured as well as unstructured sparse matrices is proposed, and its performance is compared with other schemes.
Aerodynamic optimization by simultaneously updating flow variables and design parameters
NASA Technical Reports Server (NTRS)
Rizk, M. H.
1990-01-01
The application of conventional optimization schemes to aerodynamic design problems leads to inner-outer iterative procedures that are very costly. An alternative approach is presented based on the idea of updating the flow variable iterative solutions and the design parameter iterative solutions simultaneously. Two schemes based on this idea are applied to problems of correcting wind tunnel wall interference and optimizing advanced propeller designs. The first of these schemes is applicable to a limited class of two-design-parameter problems with an equality constraint. It requires the computation of a single flow solution. The second scheme is suitable for application to general aerodynamic problems. It requires the computation of several flow solutions in parallel. In both schemes, the design parameters are updated as the iterative flow solutions evolve. Computations are performed to test the schemes' efficiency, accuracy, and sensitivity to variations in the computational parameters.
Implementation of ADI: Schemes on MIMD parallel computers
NASA Technical Reports Server (NTRS)
Vanderwijngaart, Rob F.
1993-01-01
In order to simulate the effects of the impingement of hot exhaust jets of High Performance Aircraft on landing surfaces a multi-disciplinary computation coupling flow dynamics to heat conduction in the runway needs to be carried out. Such simulations, which are essentially unsteady, require very large computational power in order to be completed within a reasonable time frame of the order of an hour. Such power can be furnished by the latest generation of massively parallel computers. These remove the bottleneck of ever more congested data paths to one or a few highly specialized central processing units (CPU's) by having many off-the-shelf CPU's work independently on their own data, and exchange information only when needed. During the past year the first phase of this project was completed, in which the optimal strategy for mapping an ADI-algorithm for the three dimensional unsteady heat equation to a MIMD parallel computer was identified. This was done by implementing and comparing three different domain decomposition techniques that define the tasks for the CPU's in the parallel machine. These implementations were done for a Cartesian grid and Dirichlet boundary conditions. The most promising technique was then used to implement the heat equation solver on a general curvilinear grid with a suite of nontrivial boundary conditions. Finally, this technique was also used to implement the Scalar Penta-diagonal (SP) benchmark, which was taken from the NAS Parallel Benchmarks report. All implementations were done in the programming language C on the Intel iPSC/860 computer.
Ng, C M
2013-10-01
The development of a population PK/PD model, an essential component for model-based drug development, is both time- and labor-intensive. A graphical-processing unit (GPU) computing technology has been proposed and used to accelerate many scientific computations. The objective of this study was to develop a hybrid GPU-CPU implementation of parallelized Monte Carlo parametric expectation maximization (MCPEM) estimation algorithm for population PK data analysis. A hybrid GPU-CPU implementation of the MCPEM algorithm (MCPEMGPU) and identical algorithm that is designed for the single CPU (MCPEMCPU) were developed using MATLAB in a single computer equipped with dual Xeon 6-Core E5690 CPU and a NVIDIA Tesla C2070 GPU parallel computing card that contained 448 stream processors. Two different PK models with rich/sparse sampling design schemes were used to simulate population data in assessing the performance of MCPEMCPU and MCPEMGPU. Results were analyzed by comparing the parameter estimation and model computation times. Speedup factor was used to assess the relative benefit of parallelized MCPEMGPU over MCPEMCPU in shortening model computation time. The MCPEMGPU consistently achieved shorter computation time than the MCPEMCPU and can offer more than 48-fold speedup using a single GPU card. The novel hybrid GPU-CPU implementation of parallelized MCPEM algorithm developed in this study holds a great promise in serving as the core for the next-generation of modeling software for population PK/PD analysis.
NASA Astrophysics Data System (ADS)
Herrick, Gregory Paul
The quest to accurately capture flow phenomena with length-scales both short and long and to accurately represent complex flow phenomena within disparately sized geometry inspires a need for an efficient, high-fidelity, multi-block structured computational fluid dynamics (CFD) parallel computational scheme. This research presents and demonstrates a more efficient computational method by which to perform multi-block structured CFD parallel computational simulations, thus facilitating higher-fidelity solutions of complicated geometries (due to the inclusion of grids for "small'' flow areas which are often merely modeled) and their associated flows. This computational framework offers greater flexibility and user-control in allocating the resource balance between process count and wall-clock computation time. The principal modifications implemented in this revision consist of a "multiple grid block per processing core'' software infrastructure and an analytic computation of viscous flux Jacobians. The development of this scheme is largely motivated by the desire to simulate axial compressor stall inception with more complete gridding of the flow passages (including rotor tip clearance regions) than has been previously done while maintaining high computational efficiency (i.e., minimal consumption of computational resources), and thus this paradigm shall be demonstrated with an examination of instability in a transonic axial compressor. However, the paradigm presented herein facilitates CFD simulation of myriad previously impractical geometries and flows and is not limited to detailed analyses of axial compressor flows. While the simulations presented herein were technically possible under the previous structure of the subject software, they were much less computationally efficient and thus not pragmatically feasible; the previous research using this software to perform three-dimensional, full-annulus, time-accurate, unsteady, full-stage (with sliding-interface) simulations of rotating stall inception in axial compressors utilized tip clearance periodic models, while the scheme here is demonstrated by a simulation of axial compressor stall inception utilizing gridded rotor tip clearance regions. As will be discussed, much previous research---experimental, theoretical, and computational---has suggested that understanding clearance flow behavior is critical to understanding stall inception, and previous computational research efforts which have used tip clearance models have begged the question, "What about the clearance flows?''. This research begins to address that question.
Forward Period Analysis Method of the Periodic Hamiltonian System.
Wang, Pengfei
2016-01-01
Using the forward period analysis (FPA), we obtain the period of a Morse oscillator and mathematical pendulum system, with the accuracy of 100 significant digits. From these results, the long-term [0, 1060] (time unit) solutions, ranging from the Planck time to the age of the universe, are computed reliably and quickly with a parallel multiple-precision Taylor series (PMT) scheme. The application of FPA to periodic systems can greatly reduce the computation time of long-term reliable simulations. This scheme provides an efficient way to generate reference solutions, against which long-term simulations using other schemes can be tested.
Scalable and fast heterogeneous molecular simulation with predictive parallelization schemes
NASA Astrophysics Data System (ADS)
Guzman, Horacio V.; Junghans, Christoph; Kremer, Kurt; Stuehn, Torsten
2017-11-01
Multiscale and inhomogeneous molecular systems are challenging topics in the field of molecular simulation. In particular, modeling biological systems in the context of multiscale simulations and exploring material properties are driving a permanent development of new simulation methods and optimization algorithms. In computational terms, those methods require parallelization schemes that make a productive use of computational resources for each simulation and from its genesis. Here, we introduce the heterogeneous domain decomposition approach, which is a combination of an heterogeneity-sensitive spatial domain decomposition with an a priori rearrangement of subdomain walls. Within this approach, the theoretical modeling and scaling laws for the force computation time are proposed and studied as a function of the number of particles and the spatial resolution ratio. We also show the new approach capabilities, by comparing it to both static domain decomposition algorithms and dynamic load-balancing schemes. Specifically, two representative molecular systems have been simulated and compared to the heterogeneous domain decomposition proposed in this work. These two systems comprise an adaptive resolution simulation of a biomolecule solvated in water and a phase-separated binary Lennard-Jones fluid.
Towards a large-scale scalable adaptive heart model using shallow tree meshes
NASA Astrophysics Data System (ADS)
Krause, Dorian; Dickopf, Thomas; Potse, Mark; Krause, Rolf
2015-10-01
Electrophysiological heart models are sophisticated computational tools that place high demands on the computing hardware due to the high spatial resolution required to capture the steep depolarization front. To address this challenge, we present a novel adaptive scheme for resolving the deporalization front accurately using adaptivity in space. Our adaptive scheme is based on locally structured meshes. These tensor meshes in space are organized in a parallel forest of trees, which allows us to resolve complicated geometries and to realize high variations in the local mesh sizes with a minimal memory footprint in the adaptive scheme. We discuss both a non-conforming mortar element approximation and a conforming finite element space and present an efficient technique for the assembly of the respective stiffness matrices using matrix representations of the inclusion operators into the product space on the so-called shallow tree meshes. We analyzed the parallel performance and scalability for a two-dimensional ventricle slice as well as for a full large-scale heart model. Our results demonstrate that the method has good performance and high accuracy.
Chai, Zhenhua; Zhao, T S
2014-07-01
In this paper, we propose a local nonequilibrium scheme for computing the flux of the convection-diffusion equation with a source term in the framework of the multiple-relaxation-time (MRT) lattice Boltzmann method (LBM). Both the Chapman-Enskog analysis and the numerical results show that, at the diffusive scaling, the present nonequilibrium scheme has a second-order convergence rate in space. A comparison between the nonequilibrium scheme and the conventional second-order central-difference scheme indicates that, although both schemes have a second-order convergence rate in space, the present nonequilibrium scheme is more accurate than the central-difference scheme. In addition, the flux computation rendered by the present scheme also preserves the parallel computation feature of the LBM, making the scheme more efficient than conventional finite-difference schemes in the study of large-scale problems. Finally, a comparison between the single-relaxation-time model and the MRT model is also conducted, and the results show that the MRT model is more accurate than the single-relaxation-time model, both in solving the convection-diffusion equation and in computing the flux.
Transient Finite Element Computations on a Variable Transputer System
NASA Technical Reports Server (NTRS)
Smolinski, Patrick J.; Lapczyk, Ireneusz
1993-01-01
A parallel program to analyze transient finite element problems was written and implemented on a system of transputer processors. The program uses the explicit time integration algorithm which eliminates the need for equation solving, making it more suitable for parallel computations. An interprocessor communication scheme was developed for arbitrary two dimensional grid processor configurations. Several 3-D problems were analyzed on a system with a small number of processors.
Solving Navier-Stokes equations on a massively parallel processor; The 1 GFLOP performance
DOE Office of Scientific and Technical Information (OSTI.GOV)
Saati, A.; Biringen, S.; Farhat, C.
This paper reports on experience in solving large-scale fluid dynamics problems on the Connection Machine model CM-2. The authors have implemented a parallel version of the MacCormack scheme for the solution of the Navier-Stokes equations. By using triad floating point operations and reducing the number of interprocessor communications, they have achieved a sustained performance rate of 1.42 GFLOPS.
Fine-grained parallel RNAalifold algorithm for RNA secondary structure prediction on FPGA
Xia, Fei; Dou, Yong; Zhou, Xingming; Yang, Xuejun; Xu, Jiaqing; Zhang, Yang
2009-01-01
Background In the field of RNA secondary structure prediction, the RNAalifold algorithm is one of the most popular methods using free energy minimization. However, general-purpose computers including parallel computers or multi-core computers exhibit parallel efficiency of no more than 50%. Field Programmable Gate-Array (FPGA) chips provide a new approach to accelerate RNAalifold by exploiting fine-grained custom design. Results RNAalifold shows complicated data dependences, in which the dependence distance is variable, and the dependence direction is also across two dimensions. We propose a systolic array structure including one master Processing Element (PE) and multiple slave PEs for fine grain hardware implementation on FPGA. We exploit data reuse schemes to reduce the need to load energy matrices from external memory. We also propose several methods to reduce energy table parameter size by 80%. Conclusion To our knowledge, our implementation with 16 PEs is the only FPGA accelerator implementing the complete RNAalifold algorithm. The experimental results show a factor of 12.2 speedup over the RNAalifold (ViennaPackage – 1.6.5) software for a group of aligned RNA sequences with 2981-residue running on a Personal Computer (PC) platform with Pentium 4 2.6 GHz CPU. PMID:19208138
Massively parallel simulator of optical coherence tomography of inhomogeneous turbid media.
Malektaji, Siavash; Lima, Ivan T; Escobar I, Mauricio R; Sherif, Sherif S
2017-10-01
An accurate and practical simulator for Optical Coherence Tomography (OCT) could be an important tool to study the underlying physical phenomena in OCT such as multiple light scattering. Recently, many researchers have investigated simulation of OCT of turbid media, e.g., tissue, using Monte Carlo methods. The main drawback of these earlier simulators is the long computational time required to produce accurate results. We developed a massively parallel simulator of OCT of inhomogeneous turbid media that obtains both Class I diffusive reflectivity, due to ballistic and quasi-ballistic scattered photons, and Class II diffusive reflectivity due to multiply scattered photons. This Monte Carlo-based simulator is implemented on graphic processing units (GPUs), using the Compute Unified Device Architecture (CUDA) platform and programming model, to exploit the parallel nature of propagation of photons in tissue. It models an arbitrary shaped sample medium as a tetrahedron-based mesh and uses an advanced importance sampling scheme. This new simulator speeds up simulations of OCT of inhomogeneous turbid media by about two orders of magnitude. To demonstrate this result, we have compared the computation times of our new parallel simulator and its serial counterpart using two samples of inhomogeneous turbid media. We have shown that our parallel implementation reduced simulation time of OCT of the first sample medium from 407 min to 92 min by using a single GPU card, to 12 min by using 8 GPU cards and to 7 min by using 16 GPU cards. For the second sample medium, the OCT simulation time was reduced from 209 h to 35.6 h by using a single GPU card, and to 4.65 h by using 8 GPU cards, and to only 2 h by using 16 GPU cards. Therefore our new parallel simulator is considerably more practical to use than its central processing unit (CPU)-based counterpart. Our new parallel OCT simulator could be a practical tool to study the different physical phenomena underlying OCT, or to design OCT systems with improved performance. Copyright © 2017 Elsevier B.V. All rights reserved.
García-Calvo, Raúl; Guisado, J L; Diaz-Del-Rio, Fernando; Córdoba, Antonio; Jiménez-Morales, Francisco
2018-01-01
Understanding the regulation of gene expression is one of the key problems in current biology. A promising method for that purpose is the determination of the temporal dynamics between known initial and ending network states, by using simple acting rules. The huge amount of rule combinations and the nonlinear inherent nature of the problem make genetic algorithms an excellent candidate for finding optimal solutions. As this is a computationally intensive problem that needs long runtimes in conventional architectures for realistic network sizes, it is fundamental to accelerate this task. In this article, we study how to develop efficient parallel implementations of this method for the fine-grained parallel architecture of graphics processing units (GPUs) using the compute unified device architecture (CUDA) platform. An exhaustive and methodical study of various parallel genetic algorithm schemes-master-slave, island, cellular, and hybrid models, and various individual selection methods (roulette, elitist)-is carried out for this problem. Several procedures that optimize the use of the GPU's resources are presented. We conclude that the implementation that produces better results (both from the performance and the genetic algorithm fitness perspectives) is simulating a few thousands of individuals grouped in a few islands using elitist selection. This model comprises 2 mighty factors for discovering the best solutions: finding good individuals in a short number of generations, and introducing genetic diversity via a relatively frequent and numerous migration. As a result, we have even found the optimal solution for the analyzed gene regulatory network (GRN). In addition, a comparative study of the performance obtained by the different parallel implementations on GPU versus a sequential application on CPU is carried out. In our tests, a multifold speedup was obtained for our optimized parallel implementation of the method on medium class GPU over an equivalent sequential single-core implementation running on a recent Intel i7 CPU. This work can provide useful guidance to researchers in biology, medicine, or bioinformatics in how to take advantage of the parallelization on massively parallel devices and GPUs to apply novel metaheuristic algorithms powered by nature for real-world applications (like the method to solve the temporal dynamics of GRNs).
RICH: OPEN-SOURCE HYDRODYNAMIC SIMULATION ON A MOVING VORONOI MESH
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yalinewich, Almog; Steinberg, Elad; Sari, Re’em
2015-02-01
We present here RICH, a state-of-the-art two-dimensional hydrodynamic code based on Godunov’s method, on an unstructured moving mesh (the acronym stands for Racah Institute Computational Hydrodynamics). This code is largely based on the code AREPO. It differs from AREPO in the interpolation and time-advancement schemeS as well as a novel parallelization scheme based on Voronoi tessellation. Using our code, we study the pros and cons of a moving mesh (in comparison to a static mesh). We also compare its accuracy to other codes. Specifically, we show that our implementation of external sources and time-advancement scheme is more accurate and robustmore » than is AREPO when the mesh is allowed to move. We performed a parameter study of the cell rounding mechanism (Lloyd iterations) and its effects. We find that in most cases a moving mesh gives better results than a static mesh, but it is not universally true. In the case where matter moves in one way and a sound wave is traveling in the other way (such that relative to the grid the wave is not moving) a static mesh gives better results than a moving mesh. We perform an analytic analysis for finite difference schemes that reveals that a Lagrangian simulation is better than a Eulerian simulation in the case of a highly supersonic flow. Moreover, we show that Voronoi-based moving mesh schemes suffer from an error, which is resolution independent, due to inconsistencies between the flux calculation and the change in the area of a cell. Our code is publicly available as open source and designed in an object-oriented, user-friendly way that facilitates incorporation of new algorithms and physical processes.« less
Implementing and analyzing the multi-threaded LP-inference
NASA Astrophysics Data System (ADS)
Bolotova, S. Yu; Trofimenko, E. V.; Leschinskaya, M. V.
2018-03-01
The logical production equations provide new possibilities for the backward inference optimization in intelligent production-type systems. The strategy of a relevant backward inference is aimed at minimization of a number of queries to external information source (either to a database or an interactive user). The idea of the method is based on the computing of initial preimages set and searching for the true preimage. The execution of each stage can be organized independently and in parallel and the actual work at a given stage can also be distributed between parallel computers. This paper is devoted to the parallel algorithms of the relevant inference based on the advanced scheme of the parallel computations “pipeline” which allows to increase the degree of parallelism. The author also provides some details of the LP-structures implementation.
Provably secure identity-based identification and signature schemes from code assumptions
Zhao, Yiming
2017-01-01
Code-based cryptography is one of few alternatives supposed to be secure in a post-quantum world. Meanwhile, identity-based identification and signature (IBI/IBS) schemes are two of the most fundamental cryptographic primitives, so several code-based IBI/IBS schemes have been proposed. However, with increasingly profound researches on coding theory, the security reduction and efficiency of such schemes have been invalidated and challenged. In this paper, we construct provably secure IBI/IBS schemes from code assumptions against impersonation under active and concurrent attacks through a provably secure code-based signature technique proposed by Preetha, Vasant and Rangan (PVR signature), and a security enhancement Or-proof technique. We also present the parallel-PVR technique to decrease parameter values while maintaining the standard security level. Compared to other code-based IBI/IBS schemes, our schemes achieve not only preferable public parameter size, private key size, communication cost and signature length due to better parameter choices, but also provably secure. PMID:28809940
Provably secure identity-based identification and signature schemes from code assumptions.
Song, Bo; Zhao, Yiming
2017-01-01
Code-based cryptography is one of few alternatives supposed to be secure in a post-quantum world. Meanwhile, identity-based identification and signature (IBI/IBS) schemes are two of the most fundamental cryptographic primitives, so several code-based IBI/IBS schemes have been proposed. However, with increasingly profound researches on coding theory, the security reduction and efficiency of such schemes have been invalidated and challenged. In this paper, we construct provably secure IBI/IBS schemes from code assumptions against impersonation under active and concurrent attacks through a provably secure code-based signature technique proposed by Preetha, Vasant and Rangan (PVR signature), and a security enhancement Or-proof technique. We also present the parallel-PVR technique to decrease parameter values while maintaining the standard security level. Compared to other code-based IBI/IBS schemes, our schemes achieve not only preferable public parameter size, private key size, communication cost and signature length due to better parameter choices, but also provably secure.
Energy-efficient STDP-based learning circuits with memristor synapses
NASA Astrophysics Data System (ADS)
Wu, Xinyu; Saxena, Vishal; Campbell, Kristy A.
2014-05-01
It is now accepted that the traditional von Neumann architecture, with processor and memory separation, is ill suited to process parallel data streams which a mammalian brain can efficiently handle. Moreover, researchers now envision computing architectures which enable cognitive processing of massive amounts of data by identifying spatio-temporal relationships in real-time and solving complex pattern recognition problems. Memristor cross-point arrays, integrated with standard CMOS technology, are expected to result in massively parallel and low-power Neuromorphic computing architectures. Recently, significant progress has been made in spiking neural networks (SNN) which emulate data processing in the cortical brain. These architectures comprise of a dense network of neurons and the synapses formed between the axons and dendrites. Further, unsupervised or supervised competitive learning schemes are being investigated for global training of the network. In contrast to a software implementation, hardware realization of these networks requires massive circuit overhead for addressing and individually updating network weights. Instead, we employ bio-inspired learning rules such as the spike-timing-dependent plasticity (STDP) to efficiently update the network weights locally. To realize SNNs on a chip, we propose to use densely integrating mixed-signal integrate-andfire neurons (IFNs) and cross-point arrays of memristors in back-end-of-the-line (BEOL) of CMOS chips. Novel IFN circuits have been designed to drive memristive synapses in parallel while maintaining overall power efficiency (<1 pJ/spike/synapse), even at spike rate greater than 10 MHz. We present circuit design details and simulation results of the IFN with memristor synapses, its response to incoming spike trains and STDP learning characterization.
Research on the Application of Fast-steering Mirror in Stellar Interferometer
NASA Astrophysics Data System (ADS)
Mei, R.; Hu, Z. W.; Xu, T.; Sun, C. S.
2017-07-01
For a stellar interferometer, the fast-steering mirror (FSM) is widely utilized to correct wavefront tilt caused by atmospheric turbulence and internal instrumental vibration due to its high resolution and fast response frequency. In this study, the non-coplanar error between the FSM and actuator deflection axis introduced by manufacture, assembly, and adjustment is analyzed. Via a numerical method, the additional optical path difference (OPD) caused by above factors is studied, and its effects on tracking accuracy of stellar interferometer are also discussed. On the other hand, the starlight parallelism between the beams of two arms is one of the main factors of the loss of fringe visibility. By analyzing the influence of wavefront tilt caused by the atmospheric turbulence on fringe visibility, a simple and efficient real-time correction scheme of starlight parallelism is proposed based on a single array detector. The feasibility of this scheme is demonstrated by laboratory experiment. The results show that starlight parallelism meets the requirement of stellar interferometer in wavefront tilt preliminarily after the correction of fast-steering mirror.
Analysis of impact of general-purpose graphics processor units in supersonic flow modeling
NASA Astrophysics Data System (ADS)
Emelyanov, V. N.; Karpenko, A. G.; Kozelkov, A. S.; Teterina, I. V.; Volkov, K. N.; Yalozo, A. V.
2017-06-01
Computational methods are widely used in prediction of complex flowfields associated with off-normal situations in aerospace engineering. Modern graphics processing units (GPU) provide architectures and new programming models that enable to harness their large processing power and to design computational fluid dynamics (CFD) simulations at both high performance and low cost. Possibilities of the use of GPUs for the simulation of external and internal flows on unstructured meshes are discussed. The finite volume method is applied to solve three-dimensional unsteady compressible Euler and Navier-Stokes equations on unstructured meshes with high resolution numerical schemes. CUDA technology is used for programming implementation of parallel computational algorithms. Solutions of some benchmark test cases on GPUs are reported, and the results computed are compared with experimental and computational data. Approaches to optimization of the CFD code related to the use of different types of memory are considered. Speedup of solution on GPUs with respect to the solution on central processor unit (CPU) is compared. Performance measurements show that numerical schemes developed achieve 20-50 speedup on GPU hardware compared to CPU reference implementation. The results obtained provide promising perspective for designing a GPU-based software framework for applications in CFD.
New adaptive color quantization method based on self-organizing maps.
Chang, Chip-Hong; Xu, Pengfei; Xiao, Rui; Srikanthan, Thambipillai
2005-01-01
Color quantization (CQ) is an image processing task popularly used to convert true color images to palletized images for limited color display devices. To minimize the contouring artifacts introduced by the reduction of colors, a new competitive learning (CL) based scheme called the frequency sensitive self-organizing maps (FS-SOMs) is proposed to optimize the color palette design for CQ. FS-SOM harmonically blends the neighborhood adaptation of the well-known self-organizing maps (SOMs) with the neuron dependent frequency sensitive learning model, the global butterfly permutation sequence for input randomization, and the reinitialization of dead neurons to harness effective utilization of neurons. The net effect is an improvement in adaptation, a well-ordered color palette, and the alleviation of underutilization problem, which is the main cause of visually perceivable artifacts of CQ. Extensive simulations have been performed to analyze and compare the learning behavior and performance of FS-SOM against other vector quantization (VQ) algorithms. The results show that the proposed FS-SOM outperforms classical CL, Linde, Buzo, and Gray (LBG), and SOM algorithms. More importantly, FS-SOM achieves its superiority in reconstruction quality and topological ordering with a much greater robustness against variations in network parameters than the current art SOM algorithm for CQ. A most significant bit (MSB) biased encoding scheme is also introduced to reduce the number of parallel processing units. By mapping the pixel values as sign-magnitude numbers and biasing the magnitudes according to their sign bits, eight lattice points in the color space are condensed into one common point density function. Consequently, the same processing element can be used to map several color clusters and the entire FS-SOM network can be substantially scaled down without severely scarifying the quality of the displayed image. The drawback of this encoding scheme is the additional storage overhead, which can be cut down by leveraging on existing encoder in an overall lossy compression scheme.
Decentralized Fuzzy MPC on Spatial Power Control of a Large PHWR
NASA Astrophysics Data System (ADS)
Liu, Xiangjie; Jiang, Di; Lee, Kwang Y.
2016-08-01
Reliable power control for stabilizing the spatial oscillations is quite important for ensuring the safe operation of a modern pressurized heavy water reactor (PHWR), since these spatial oscillations can cause “flux tilting” in the reactor core. In this paper, a decentralized fuzzy model predictive control (DFMPC) is proposed for spatial control of PHWR. Due to the load dependent dynamics of the nuclear power plant, fuzzy modeling is used to approximate the nonlinear process. A fuzzy Lyapunov function and “quasi-min-max” strategy is utilized in designing the DFMPC, to reduce the conservatism. The plant-wide stability is achieved by the asymptotically positive realness constraint (APRC) for this decentralized MPC. The solving optimization problem is based on a receding horizon scheme involving the linear matrix inequalities (LMIs) technique. Through dynamic simulations, it is demonstrated that the designed DFMPC can effectively suppress spatial oscillations developed in PHWR, and further, shows the advantages over the typical parallel distributed compensation (PDC) control scheme.
Vibratory tactile display for textures
NASA Technical Reports Server (NTRS)
Ikei, Yasushi; Ikeno, Akihisa; Fukuda, Shuichi
1994-01-01
We have developed a tactile display that produces vibratory stimulus to a fingertip in contact with a vibrating tactor matrix. The display depicts tactile surface textures while the user is exploring a virtual object surface. A piezoelectric actuator drives the individual tactor in accordance with both the finger movement and the surface texture being traced. Spatiotemporal display control schemes were examined for presenting the fundamental surface texture elements. The temporal duration of vibratory stimulus was experimentally optimized to simulate the adaptation process of cutaneous sensation. The selected duration time for presenting a single line edge agreed with the time threshold of tactile sensation. Then spatial stimulus disposition schemes were discussed for representation of other edge shapes. As an alternative means not relying on amplitude control, a method of augmented duration at the edge was investigated. Spatial resolution of the display was measured for the lines presented both in perpendicular and parallel to a finger axis. Discrimination of texture density was also measured on random dot textures.
NASA Astrophysics Data System (ADS)
Hu, Guiqiang; Xiao, Di; Wang, Yong; Xiang, Tao; Zhou, Qing
2017-11-01
Recently, a new kind of image encryption approach using compressive sensing (CS) and double random phase encoding has received much attention due to the advantages such as compressibility and robustness. However, this approach is found to be vulnerable to chosen plaintext attack (CPA) if the CS measurement matrix is re-used. Therefore, designing an efficient measurement matrix updating mechanism that ensures resistance to CPA is of practical significance. In this paper, we provide a novel solution to update the CS measurement matrix by altering the secret sparse basis with the help of counter mode operation. Particularly, the secret sparse basis is implemented by a reality-preserving fractional cosine transform matrix. Compared with the conventional CS-based cryptosystem that totally generates all the random entries of measurement matrix, our scheme owns efficiency superiority while guaranteeing resistance to CPA. Experimental and analysis results show that the proposed scheme has a good security performance and has robustness against noise and occlusion.
Force sharing in high-power parallel servo-actuators
NASA Technical Reports Server (NTRS)
Neal, T. P.
1974-01-01
The various existing force sharing schemes were examined by conducting a literature survey. A list of potentially applicable concepts was compiled from this survey, and a brief analysis was then made of each concept, which resulted in two competing schemes being selected for in-depth evaluation. A functional design of the equalization logic for the two schemes was undertaken and specific space shuttle application was chosen for experimental evaluation. The application was scaled down so that existing hardware could be utilized. Next, an analog computer study was conducted to evaluate the more important characteristics of the two competing force sharing schemes. On the basis of the computers study, a final configuration was selected. A load simulator was then designed to evaluate this configuration on actual hardware.
Massively parallel GPU-accelerated minimization of classical density functional theory
NASA Astrophysics Data System (ADS)
Stopper, Daniel; Roth, Roland
2017-08-01
In this paper, we discuss the ability to numerically minimize the grand potential of hard disks in two-dimensional and of hard spheres in three-dimensional space within the framework of classical density functional and fundamental measure theory on modern graphics cards. Our main finding is that a massively parallel minimization leads to an enormous performance gain in comparison to standard sequential minimization schemes. Furthermore, the results indicate that in complex multi-dimensional situations, a heavy parallel minimization of the grand potential seems to be mandatory in order to reach a reasonable balance between accuracy and computational cost.
Non-contact angle measurement based on parallel multiplex laser feedback interferometry
NASA Astrophysics Data System (ADS)
Zhang, Song; Tan, Yi-Dong; Zhang, Shu-Lian
2014-11-01
We present a novel precise angle measurement scheme based on parallel multiplex laser feedback interferometry (PLFI), which outputs two parallel laser beams and thus their displacement difference reflects the angle variation of the target. Due to its ultrahigh sensitivity to the feedback light, PLFI realizes the direct non-contact measurement of non-cooperative targets. Experimental results show that PLFI has an accuracy of 8″ within a range of 1400″. The yaw of a guide is also measured and the experimental results agree with those of the dual-frequency laser interferometer Agilent 5529A.
NASA Astrophysics Data System (ADS)
Wan, Yuhong; Man, Tianlong; Wu, Fan; Kim, Myung K.; Wang, Dayong
2016-11-01
We present a new self-interference digital holographic approach that allows single-shot capturing three-dimensional intensity distribution of the spatially incoherent objects. The Fresnel incoherent correlation holographic microscopy is combined with parallel phase-shifting technique to instantaneously obtain spatially multiplexed phase-shifting holograms. The compressive-sensing-based reconstruction algorithm is implemented to reconstruct the original object from the under sampled demultiplexed holograms. The scheme is verified with simulations. The validity of the proposed method is experimentally demonstrated in an indirectly way by simulating the use of specific parallel phase-shifting recording device.
NASA Astrophysics Data System (ADS)
Nielsen, M.; Elezzabi, A. Y.
2013-03-01
To become a competitor to replace CMOS-electronics for next-generation data processing, signal routing, and computing, nanoplasmonic circuits will require an analogue to electrical vias in order to enable vertical connections between device layers. Vertically stacked nanoplasmonic nanoring resonators formed of Ag/Si/Ag gap plasmon waveguides were studied as a novel 3-D coupling scheme that could be monolithically integrated on a silicon platform. The vertically coupled ring resonators were evanescently coupled to 100 nm x 100 nm Ag/Si/Ag input and output waveguides and the whole device was submerged in silicon dioxide. 3-D finite difference time domain simulations were used to examine the transmission spectra of the coupling device with varying device sizes and orientations. By having the signal coupling occur over multiple trips around the resonator, coupling efficiencies as high as 39% at telecommunication wavelengths between adjacent layers were present with planar device areas of only 1.00 μm2. As the vertical signal transfer was based on coupled ring resonators, the signal transfer was inherently wavelength dependent. Changing the device size by varying the radii of the nanorings allowed for tailoring the coupled frequency spectra. The plasmonic resonator based coupling scheme was found to have quality (Q) factors of upwards of 30 at telecommunication wavelengths. By allowing different device layers to operate on different wavelengths, this coupling scheme could to lead to parallel processing in stacked independent device layers.
Formation of Electrostatic Potential Drops in the Auroral Zone
NASA Technical Reports Server (NTRS)
Schriver, D.; Ashour-Abdalla, M.; Richard, R. L.
2001-01-01
In order to examine the self-consistent formation of large-scale quasi-static parallel electric fields in the auroral zone on a micro/meso scale, a particle in cell simulation has been developed. The code resolves electron Debye length scales so that electron micro-processes are included and a variable grid scheme is used such that the overall length scale of the simulation is of the order of an Earth radii along the magnetic field. The simulation is electrostatic and includes the magnetic mirror force, as well as two types of plasmas, a cold dense ionospheric plasma and a warm tenuous magnetospheric plasma. In order to study the formation of parallel electric fields in the auroral zone, different magnetospheric ion and electron inflow boundary conditions are used to drive the system. It has been found that for conditions in the primary (upward) current region an upward directed quasi-static electric field can form across the system due to magnetic mirroring of the magnetospheric ions and electrons at different altitudes. For conditions in the return (downward) current region it is shown that a quasi-static parallel electric field in the opposite sense of that in the primary current region is formed, i.e., the parallel electric field is directed earthward. The conditions for how these different electric fields can be formed are discussed using satellite observations and numerical simulations.
Microwave Power Combiners for Signals of Arbitrary Amplitude
NASA Technical Reports Server (NTRS)
Conroy, Bruce; Hoppe, Daniel
2009-01-01
Schemes for combining power from coherent microwave sources of arbitrary (unequal or equal) amplitude have been proposed. Most prior microwave-power-combining schemes are limited to sources of equal amplitude. The basic principle of the schemes now proposed is to use quasi-optical components to manipulate the polarizations and phases of two arbitrary-amplitude input signals in such a way as to combine them into one output signal having a specified, fixed polarization. To combine power from more than two sources, one could use multiple powercombining stages based on this principle, feeding the outputs of lower-power stages as inputs to higher-power stages. Quasi-optical components suitable for implementing these schemes include grids of parallel wires, vane polarizers, and a variety of waveguide structures. For the sake of brevity, the remainder of this article illustrates the basic principle by focusing on one scheme in which a wire grid and two vane polarizers would be used. Wire grids are the key quasi-optical elements in many prior equal-power combiners. In somewhat oversimplified terms, a wire grid reflects an incident beam having an electric field parallel to the wires and passes an incident beam having an electric field perpendicular to the wires. In a typical prior equal-power combining scheme, one provides for two properly phased, equal-amplitude signals having mutually perpendicular linear polarizations to impinge from two mutually perpendicular directions on a wire grid in a plane oriented at an angle of 45 with respect to both beam axes. The wires in the grid are oriented to pass one of the incident beams straight through onto the output path and to reflect the other incident beam onto the output path along with the first-mentioned beam.
Kadlecek, Stephen; Hamedani, Hooman; Xu, Yinan; Emami, Kiarash; Xin, Yi; Ishii, Masaru; Rizi, Rahim
2013-10-01
Alveolar oxygen tension (Pao2) is sensitive to the interplay between local ventilation, perfusion, and alveolar-capillary membrane permeability, and thus reflects physiologic heterogeneity of healthy and diseased lung function. Several hyperpolarized helium ((3)He) magnetic resonance imaging (MRI)-based Pao2 mapping techniques have been reported, and considerable effort has gone toward reducing Pao2 measurement error. We present a new Pao2 imaging scheme, using parallel accelerated MRI, which significantly reduces measurement error. The proposed Pao2 mapping scheme was computer-simulated and was tested on both phantoms and five human subjects. Where possible, correspondence between actual local oxygen concentration and derived values was assessed for both bias (deviation from the true mean) and imaging artifact (deviation from the true spatial distribution). Phantom experiments demonstrated a significantly reduced coefficient of variation using the accelerated scheme. Simulation results support this observation and predict that correspondence between the true spatial distribution and the derived map is always superior using the accelerated scheme, although the improvement becomes less significant as the signal-to-noise ratio increases. Paired measurements in the human subjects, comparing accelerated and fully sampled schemes, show a reduced Pao2 distribution width for 41 of 46 slices. In contrast to proton MRI, acceleration of hyperpolarized imaging has no signal-to-noise penalty; its use in Pao2 measurement is therefore always beneficial. Comparison of multiple schemes shows that the benefit arises from a longer time-base during which oxygen-induced depolarization modifies the signal strength. Demonstration of the accelerated technique in human studies shows the feasibility of the method and suggests that measurement error is reduced here as well, particularly at low signal-to-noise levels. Copyright © 2013 AUR. Published by Elsevier Inc. All rights reserved.
A new parallel-vector finite element analysis software on distributed-memory computers
NASA Technical Reports Server (NTRS)
Qin, Jiangning; Nguyen, Duc T.
1993-01-01
A new parallel-vector finite element analysis software package MPFEA (Massively Parallel-vector Finite Element Analysis) is developed for large-scale structural analysis on massively parallel computers with distributed-memory. MPFEA is designed for parallel generation and assembly of the global finite element stiffness matrices as well as parallel solution of the simultaneous linear equations, since these are often the major time-consuming parts of a finite element analysis. Block-skyline storage scheme along with vector-unrolling techniques are used to enhance the vector performance. Communications among processors are carried out concurrently with arithmetic operations to reduce the total execution time. Numerical results on the Intel iPSC/860 computers (such as the Intel Gamma with 128 processors and the Intel Touchstone Delta with 512 processors) are presented, including an aircraft structure and some very large truss structures, to demonstrate the efficiency and accuracy of MPFEA.
Research Studies on Advanced Optical Module/Head Designs for Optical Data Storage
NASA Technical Reports Server (NTRS)
1992-01-01
Preprints are presented from the recent 1992 Optical Data Storage meeting in San Jose. The papers are divided into the following topical areas: Magneto-optical media (Modeling/design and fabrication/characterization/testing); Optical heads (holographic optical elements); and Optical heads (integrated optics). Some representative titles are as follow: Diffraction analysis and evaluation of several focus and track error detection schemes for magneto-optical disk systems; Proposal for massively parallel data storage system; Transfer function characteristics of super resolving systems; Modeling and measurement of a micro-optic beam deflector; Oxidation processes in magneto-optic and related materials; and A modal analysis of lamellar diffraction gratings in conical mountings.
Lu, Emily; Elizondo-Riojas, Miguel-Angel; Chang, Jeffrey T; Volk, David E
2014-06-10
Next-generation sequencing results from bead-based aptamer libraries have demonstrated that traditional DNA/RNA alignment software is insufficient. This is particularly true for X-aptamers containing specialty bases (W, X, Y, Z, ...) that are identified by special encoding. Thus, we sought an automated program that uses the inherent design scheme of bead-based X-aptamers to create a hypothetical reference library and Markov modeling techniques to provide improved alignments. Aptaligner provides this feature as well as length error and noise level cutoff features, is parallelized to run on multiple central processing units (cores), and sorts sequences from a single chip into projects and subprojects.
Kubař, Tomáš; Elstner, Marcus
2013-04-28
In this work, a fragment-orbital density functional theory-based method is combined with two different non-adiabatic schemes for the propagation of the electronic degrees of freedom. This allows us to perform unbiased simulations of electron transfer processes in complex media, and the computational scheme is applied to the transfer of a hole in solvated DNA. It turns out that the mean-field approach, where the wave function of the hole is driven into a superposition of adiabatic states, leads to over-delocalization of the hole charge. This problem is avoided using a surface hopping scheme, resulting in a smaller rate of hole transfer. The method is highly efficient due to the on-the-fly computation of the coarse-grained DFT Hamiltonian for the nucleobases, which is coupled to the environment using a QM/MM approach. The computational efficiency and partial parallel character of the methodology make it possible to simulate electron transfer in systems of relevant biochemical size on a nanosecond time scale. Since standard non-polarizable force fields are applied in the molecular-mechanics part of the calculation, a simple scaling scheme was introduced into the electrostatic potential in order to simulate the effect of electronic polarization. It is shown that electronic polarization has an important effect on the features of charge transfer. The methodology is applied to two kinds of DNA sequences, illustrating the features of transfer along a flat energy landscape as well as over an energy barrier. The performance and relative merit of the mean-field scheme and the surface hopping for this application are discussed.
NASA Astrophysics Data System (ADS)
Chen, R. J.; Wang, M.; Yan, X. L.; Yang, Q.; Lam, Y. H.; Yang, L.; Zhang, Y. H.
2017-12-01
The periodic signals tracking algorithm has been used to determine the revolution times of ions stored in storage rings in isochronous mass spectrometry (IMS) experiments. It has been a challenge to perform real-time data analysis by using the periodic signals tracking algorithm in the IMS experiments. In this paper, a parallelization scheme of the periodic signals tracking algorithm is introduced and a new program is developed. The computing time of data analysis can be reduced by a factor of ∼71 and of ∼346 by using our new program on Tesla C1060 GPU and Tesla K20c GPU, compared to using old program on Xeon E5540 CPU. We succeed in performing real-time data analysis for the IMS experiments by using the new program on Tesla K20c GPU.
Longitudinal train dynamics: an overview
NASA Astrophysics Data System (ADS)
Wu, Qing; Spiryagin, Maksym; Cole, Colin
2016-12-01
This paper discusses the evolution of longitudinal train dynamics (LTD) simulations, which covers numerical solvers, vehicle connection systems, air brake systems, wagon dumper systems and locomotives, resistance forces and gravitational components, vehicle in-train instabilities, and computing schemes. A number of potential research topics are suggested, such as modelling of friction, polymer, and transition characteristics for vehicle connection simulations, studies of wagon dumping operations, proper modelling of vehicle in-train instabilities, and computing schemes for LTD simulations. Evidence shows that LTD simulations have evolved with computing capabilities. Currently, advanced component models that directly describe the working principles of the operation of air brake systems, vehicle connection systems, and traction systems are available. Parallel computing is a good solution to combine and simulate all these advanced models. Parallel computing can also be used to conduct three-dimensional long train dynamics simulations.
Real-time computing platform for spiking neurons (RT-spike).
Ros, Eduardo; Ortigosa, Eva M; Agís, Rodrigo; Carrillo, Richard; Arnold, Michael
2006-07-01
A computing platform is described for simulating arbitrary networks of spiking neurons in real time. A hybrid computing scheme is adopted that uses both software and hardware components to manage the tradeoff between flexibility and computational power; the neuron model is implemented in hardware and the network model and the learning are implemented in software. The incremental transition of the software components into hardware is supported. We focus on a spike response model (SRM) for a neuron where the synapses are modeled as input-driven conductances. The temporal dynamics of the synaptic integration process are modeled with a synaptic time constant that results in a gradual injection of charge. This type of model is computationally expensive and is not easily amenable to existing software-based event-driven approaches. As an alternative we have designed an efficient time-based computing architecture in hardware, where the different stages of the neuron model are processed in parallel. Further improvements occur by computing multiple neurons in parallel using multiple processing units. This design is tested using reconfigurable hardware and its scalability and performance evaluated. Our overall goal is to investigate biologically realistic models for the real-time control of robots operating within closed action-perception loops, and so we evaluate the performance of the system on simulating a model of the cerebellum where the emulation of the temporal dynamics of the synaptic integration process is important.
Porting Gravitational Wave Signal Extraction to Parallel Virtual Machine (PVM)
NASA Technical Reports Server (NTRS)
Thirumalainambi, Rajkumar; Thompson, David E.; Redmon, Jeffery
2009-01-01
Laser Interferometer Space Antenna (LISA) is a planned NASA-ESA mission to be launched around 2012. The Gravitational Wave detection is fundamentally the determination of frequency, source parameters, and waveform amplitude derived in a specific order from the interferometric time-series of the rotating LISA spacecrafts. The LISA Science Team has developed a Mock LISA Data Challenge intended to promote the testing of complicated nested search algorithms to detect the 100-1 millihertz frequency signals at amplitudes of 10E-21. However, it has become clear that, sequential search of the parameters is very time consuming and ultra-sensitive; hence, a new strategy has been developed. Parallelization of existing sequential search algorithms of Gravitational Wave signal identification consists of decomposing sequential search loops, beginning with outermost loops and working inward. In this process, the main challenge is to detect interdependencies among loops and partitioning the loops so as to preserve concurrency. Existing parallel programs are based upon either shared memory or distributed memory paradigms. In PVM, master and node programs are used to execute parallelization and process spawning. The PVM can handle process management and process addressing schemes using a virtual machine configuration. The task scheduling and the messaging and signaling can be implemented efficiently for the LISA Gravitational Wave search process using a master and 6 nodes. This approach is accomplished using a server that is available at NASA Ames Research Center, and has been dedicated to the LISA Data Challenge Competition. Historically, gravitational wave and source identification parameters have taken around 7 days in this dedicated single thread Linux based server. Using PVM approach, the parameter extraction problem can be reduced to within a day. The low frequency computation and a proxy signal-to-noise ratio are calculated in separate nodes that are controlled by the master using message and vector of data passing. The message passing among nodes follows a pattern of synchronous and asynchronous send-and-receive protocols. The communication model and the message buffers are allocated dynamically to address rapid search of gravitational wave source information in the Mock LISA data sets.
Yau, Wei-Chuen; Phan, Raphael C-W
2013-12-01
Many authentication schemes have been proposed for telecare medicine information systems (TMIS) to ensure the privacy, integrity, and availability of patient records. These schemes are crucial for TMIS systems because otherwise patients' medical records become susceptible to tampering thus hampering diagnosis or private medical conditions of patients could be disclosed to parties who do not have a right to access such information. Very recently, Hao et al. proposed a chaotic map-based authentication scheme for telecare medicine information systems in a recent issue of Journal of Medical Systems. They claimed that the authentication scheme can withstand various attacks and it is secure to be used in TMIS. In this paper, we show that this authentication scheme is vulnerable to key-compromise impersonation attacks, off-line password guessing attacks upon compromising of a smart card, and parallel session attacks. We also exploit weaknesses in the password change phase of the scheme to mount a denial-of-service attack. Our results show that this scheme cannot be used to provide security in a telecare medicine information system.
NASA Astrophysics Data System (ADS)
Li, Hao; Xie, Lunguo
2013-03-01
The design of cache system for Chip Multiprocessor (CMP) face many challenges because future CMPs will have more cores and greater on-chip cache capacity. There are two base design schemes about L2 cache: private scheme in which each L2 slice is treated as a private L2 cache and shared scheme in which all L2 slices are treated as a large L2 cache shared by all cores. Private caches provide the lowest hit latency but reduce the total effective cache capacity. A shared L2 cache increases the effective cache capacity but has long hit latencies when data is on a remote tile. This paper present a new Controlled Replication (CR) policy to reduce the capacities occupied by redundant shared replicas. the new CR policy increases the effective capacity than victim replication scheme and has lower hit latency than shared scheme. We evaluate the various schemes using full-system simulation of parallel applications. Results show that CR reduces the average memory access latency of shared scheme by an average of 13%, providing better overall performance than victim replication and shared schemes.
A high-order language for a system of closely coupled processing elements
NASA Technical Reports Server (NTRS)
Feyock, S.; Collins, W. R.
1986-01-01
The research reported in this paper was occasioned by the requirements on part of the Real-Time Digital Simulator (RTDS) project under way at NASA Lewis Research Center. The RTDS simulation scheme employs a network of CPUs running lock-step cycles in the parallel computations of jet airplane simulations. Their need for a high order language (HOL) that would allow non-experts to write simulation applications and that could be implemented on a possibly varying network can best be fulfilled by using the programming language Ada. We describe how the simulation problems can be modeled in Ada, how to map a single, multi-processing Ada program into code for individual processors, regardless of network reconfiguration, and why some Ada language features are particulary well-suited to network simulations.
Nuclear Exodus: Herpesviruses Lead the Way
Bigalke, Janna M.; Heldwein, Ekaterina E.
2016-01-01
Most DNA viruses replicate in the nucleus and exit it either by passing through the nuclear pores or by rupturing the nuclear envelope. Unusually, herpesviruses have evolved a complex mechanism of nuclear escape whereby nascent capsids bud at the inner nuclear membrane to form perinuclear virions that subsequently fuse with the outer nuclear membrane, releasing capsids into the cytosol. Although this general scheme is accepted in the field, the players and their roles are still debated. Recent studies illuminated critical mechanistic features of this enigmatic process and uncovered surprising parallels with a novel cellular nuclear export process. This review summarizes our current understanding of nuclear egress in herpesviruses, examines the experimental evidence and models, and outlines outstanding questions with the goal of stimulating new research in this area. PMID:27482898
Nieder, Andreas; Miller, Earl K
2003-01-09
Whether cognitive representations are better conceived as language-based, symbolic representations or perceptually related, analog representations is a subject of debate. If cognitive processes parallel perceptual processes, then fundamental psychophysical laws should hold for each. To test this, we analyzed both behavioral and neuronal representations of numerosity in the prefrontal cortex of rhesus monkeys. The data were best described by a nonlinearly compressed scaling of numerical information, as postulated by the Weber-Fechner law or Stevens' law for psychophysical/sensory magnitudes. This nonlinear compression was observed on the neural level during the acquisition phase of the task and maintained through the memory phase with no further compression. These results suggest that certain cognitive and perceptual/sensory representations share the same fundamental mechanisms and neural coding schemes.
Linux Kernel Co-Scheduling and Bulk Synchronous Parallelism
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jones, Terry R
2012-01-01
This paper describes a kernel scheduling algorithm that is based on coscheduling principles and that is intended for parallel applications running on 1000 cores or more. Experimental results for a Linux implementation on a Cray XT5 machine are presented. The results indicate that Linux is a suitable operating system for this new scheduling scheme, and that this design provides a dramatic improvement in scaling performance for synchronizing collective operations at scale.
Design consideration in constructing high performance embedded Knowledge-Based Systems (KBS)
NASA Technical Reports Server (NTRS)
Dalton, Shelly D.; Daley, Philip C.
1988-01-01
As the hardware trends for artificial intelligence (AI) involve more and more complexity, the process of optimizing the computer system design for a particular problem will also increase in complexity. Space applications of knowledge based systems (KBS) will often require an ability to perform both numerically intensive vector computations and real time symbolic computations. Although parallel machines can theoretically achieve the speeds necessary for most of these problems, if the application itself is not highly parallel, the machine's power cannot be utilized. A scheme is presented which will provide the computer systems engineer with a tool for analyzing machines with various configurations of array, symbolic, scaler, and multiprocessors. High speed networks and interconnections make customized, distributed, intelligent systems feasible for the application of AI in space. The method presented can be used to optimize such AI system configurations and to make comparisons between existing computer systems. It is an open question whether or not, for a given mission requirement, a suitable computer system design can be constructed for any amount of money.
High Performance Radiation Transport Simulations on TITAN
DOE Office of Scientific and Technical Information (OSTI.GOV)
Baker, Christopher G; Davidson, Gregory G; Evans, Thomas M
2012-01-01
In this paper we describe the Denovo code system. Denovo solves the six-dimensional, steady-state, linear Boltzmann transport equation, of central importance to nuclear technology applications such as reactor core analysis (neutronics), radiation shielding, nuclear forensics and radiation detection. The code features multiple spatial differencing schemes, state-of-the-art linear solvers, the Koch-Baker-Alcouffe (KBA) parallel-wavefront sweep algorithm for inverting the transport operator, a new multilevel energy decomposition method scaling to hundreds of thousands of processing cores, and a modern, novel code architecture that supports straightforward integration of new features. In this paper we discuss the performance of Denovo on the 10--20 petaflop ORNLmore » GPU-based system, Titan. We describe algorithms and techniques used to exploit the capabilities of Titan's heterogeneous compute node architecture and the challenges of obtaining good parallel performance for this sparse hyperbolic PDE solver containing inherently sequential computations. Numerical results demonstrating Denovo performance on early Titan hardware are presented.« less
Fourth order discretization of anisotropic heat conduction operator
NASA Astrophysics Data System (ADS)
Krasheninnikova, Natalia; Chacon, Luis
2008-11-01
In magnetized plasmas, heat conduction plays an important role in such processes as energy confinement, turbulence, and a number of instabilities. As a consequence of the presence of a magnetic field, heat transport is strongly anisotropic, with energy flowing preferentially along the magnetic field direction. This in turn results in parallel and perpendicular heat conduction coefficients being separated by orders of magnitude. The computational difficulties in treating such heat conduction anisotropies are significant, as perpendicular dynamics numerically is polluted by the parallel one. In this work, we report on progress of the implementation of a fourth order, conservative finite volume discretization scheme for the anisotropic heat conduction operator into the extended MHD code PIXIE3D [1]. We will demonstrate its spatial discretization accuracy and its effectiveness with two physical applications of interest, both of which feature a strong sensitivity to the heat conduction anisotropy: the thermal instability and the neoclassical tearing mode. [1] L. Chacon Phys. Plasmas 15, 056103 (2008)
Electromagnetically induced disintegration and polarization plane rotation of laser pulses
NASA Astrophysics Data System (ADS)
Parshkov, Oleg M.; Budyak, Victoria V.; Kochetkova, Anastasia E.
2017-04-01
The numerical simulation results of disintegration effect of linear polarized shot probe pulses of electromagnetically induced transparency in the counterintuitive superposed linear polarized control field are presented. It is shown, that this disintegration occurs, if linear polarizations of interacting pulses are not parallel or mutually perpendicular. In case of weak input probe field the polarization of one probe pulse in the medium is parallel, whereas the polarization of another probe pulse is perpendicular to polarization direction of input control radiation. The concerned effect is analogous to the effect, which must to take place when short laser pulse propagates along main axes of biaxial crystal because of group velocity of normal mod difference. The essential difference of probe pulse disintegration and linear process in biaxial crystal is that probe pulse preserves linear polarization in all stages of propagation. The numerical simulation is performed for scheme of degenerated quantum transitions between 3P0 , 3P01 and 3P2 energy levels of 208Pb isotope.
NASA Astrophysics Data System (ADS)
Urban, Matthias; Möller, Robert; Fritzsche, Wolfgang
2003-02-01
DNA analytics is a growing field based on the increasing knowledge about the genome with special implications for the understanding of molecular bases for diseases. Driven by the need for cost-effective and high-throughput methods for molecular detection, DNA chips are an interesting alternative to more traditional analytical methods in this field. The standard readout principle for DNA chips is fluorescence based. Fluorescence is highly sensitive and broadly established, but shows limitations regarding quantification (due to signal and/or dye instability) and the need for sophisticated (and therefore high-cost) equipment. This article introduces a readout system for an alternative detection scheme based on electrical detection of nanoparticle-labeled DNA. If labeled DNA is present in the analyte solution, it will bind on complementary capture DNA immobilized in a microelectrode gap. A subsequent metal enhancement step leads to a deposition of conductive material on the nanoparticles, and finally an electrical contact between the electrodes. This detection scheme offers the potential for a simple (low-cost as well as robust) and highly miniaturizable method, which could be well-suited for point-of-care applications in the context of lab-on-a-chip technologies. The demonstrated apparatus allows a parallel readout of an entire array of microstructured measurement sites. The readout is combined with data-processing by an embedded personal computer, resulting in an autonomous instrument that measures and presents the results. The design and realization of such a system is described, and first measurements are presented.
Fast Acceleration of 2D Wave Propagation Simulations Using Modern Computational Accelerators
Wang, Wei; Xu, Lifan; Cavazos, John; Huang, Howie H.; Kay, Matthew
2014-01-01
Recent developments in modern computational accelerators like Graphics Processing Units (GPUs) and coprocessors provide great opportunities for making scientific applications run faster than ever before. However, efficient parallelization of scientific code using new programming tools like CUDA requires a high level of expertise that is not available to many scientists. This, plus the fact that parallelized code is usually not portable to different architectures, creates major challenges for exploiting the full capabilities of modern computational accelerators. In this work, we sought to overcome these challenges by studying how to achieve both automated parallelization using OpenACC and enhanced portability using OpenCL. We applied our parallelization schemes using GPUs as well as Intel Many Integrated Core (MIC) coprocessor to reduce the run time of wave propagation simulations. We used a well-established 2D cardiac action potential model as a specific case-study. To the best of our knowledge, we are the first to study auto-parallelization of 2D cardiac wave propagation simulations using OpenACC. Our results identify several approaches that provide substantial speedups. The OpenACC-generated GPU code achieved more than speedup above the sequential implementation and required the addition of only a few OpenACC pragmas to the code. An OpenCL implementation provided speedups on GPUs of at least faster than the sequential implementation and faster than a parallelized OpenMP implementation. An implementation of OpenMP on Intel MIC coprocessor provided speedups of with only a few code changes to the sequential implementation. We highlight that OpenACC provides an automatic, efficient, and portable approach to achieve parallelization of 2D cardiac wave simulations on GPUs. Our approach of using OpenACC, OpenCL, and OpenMP to parallelize this particular model on modern computational accelerators should be applicable to other computational models of wave propagation in multi-dimensional media. PMID:24497950
Suppressing correlations in massively parallel simulations of lattice models
NASA Astrophysics Data System (ADS)
Kelling, Jeffrey; Ódor, Géza; Gemming, Sibylle
2017-11-01
For lattice Monte Carlo simulations parallelization is crucial to make studies of large systems and long simulation time feasible, while sequential simulations remain the gold-standard for correlation-free dynamics. Here, various domain decomposition schemes are compared, concluding with one which delivers virtually correlation-free simulations on GPUs. Extensive simulations of the octahedron model for 2 + 1 dimensional Kardar-Parisi-Zhang surface growth, which is very sensitive to correlation in the site-selection dynamics, were performed to show self-consistency of the parallel runs and agreement with the sequential algorithm. We present a GPU implementation providing a speedup of about 30 × over a parallel CPU implementation on a single socket and at least 180 × with respect to the sequential reference.
Parallel, adaptive finite element methods for conservation laws
NASA Technical Reports Server (NTRS)
Biswas, Rupak; Devine, Karen D.; Flaherty, Joseph E.
1994-01-01
We construct parallel finite element methods for the solution of hyperbolic conservation laws in one and two dimensions. Spatial discretization is performed by a discontinuous Galerkin finite element method using a basis of piecewise Legendre polynomials. Temporal discretization utilizes a Runge-Kutta method. Dissipative fluxes and projection limiting prevent oscillations near solution discontinuities. A posteriori estimates of spatial errors are obtained by a p-refinement technique using superconvergence at Radau points. The resulting method is of high order and may be parallelized efficiently on MIMD computers. We compare results using different limiting schemes and demonstrate parallel efficiency through computations on an NCUBE/2 hypercube. We also present results using adaptive h- and p-refinement to reduce the computational cost of the method.
A comparative study of serial and parallel aeroelastic computations of wings
NASA Technical Reports Server (NTRS)
Byun, Chansup; Guruswamy, Guru P.
1994-01-01
A procedure for computing the aeroelasticity of wings on parallel multiple-instruction, multiple-data (MIMD) computers is presented. In this procedure, fluids are modeled using Euler equations, and structures are modeled using modal or finite element equations. The procedure is designed in such a way that each discipline can be developed and maintained independently by using a domain decomposition approach. In the present parallel procedure, each computational domain is scalable. A parallel integration scheme is used to compute aeroelastic responses by solving fluid and structural equations concurrently. The computational efficiency issues of parallel integration of both fluid and structural equations are investigated in detail. This approach, which reduces the total computational time by a factor of almost 2, is demonstrated for a typical aeroelastic wing by using various numbers of processors on the Intel iPSC/860.
Implementation of parallel moment equations in NIMROD
NASA Astrophysics Data System (ADS)
Lee, Hankyu Q.; Held, Eric D.; Ji, Jeong-Young
2017-10-01
As collisionality is low (the Knudsen number is large) in many plasma applications, kinetic effects become important, particularly in parallel dynamics for magnetized plasmas. Fluid models can capture some kinetic effects when integral parallel closures are adopted. The adiabatic and linear approximations are used in solving general moment equations to obtain the integral closures. In this work, we present an effort to incorporate non-adiabatic (time-dependent) and nonlinear effects into parallel closures. Instead of analytically solving the approximate moment system, we implement exact parallel moment equations in the NIMROD fluid code. The moment code is expected to provide a natural convergence scheme by increasing the number of moments. Work in collaboration with the PSI Center and supported by the U.S. DOE under Grant Nos. DE-SC0014033, DE-SC0016256, and DE-FG02-04ER54746.
The upwind control volume scheme for unstructured triangular grids
NASA Technical Reports Server (NTRS)
Giles, Michael; Anderson, W. Kyle; Roberts, Thomas W.
1989-01-01
A new algorithm for the numerical solution of the Euler equations is presented. This algorithm is particularly suited to the use of unstructured triangular meshes, allowing geometric flexibility. Solutions are second-order accurate in the steady state. Implementation of the algorithm requires minimal grid connectivity information, resulting in modest storage requirements, and should enhance the implementation of the scheme on massively parallel computers. A novel form of upwind differencing is developed, and is shown to yield sharp resolution of shocks. Two new artificial viscosity models are introduced that enhance the performance of the new scheme. Numerical results for transonic airfoil flows are presented, which demonstrate the performance of the algorithm.
Experimental evaluation of multiprocessor cache-based error recovery
NASA Technical Reports Server (NTRS)
Janssens, Bob; Fuchs, W. K.
1991-01-01
Several variations of cache-based checkpointing for rollback error recovery in shared-memory multiprocessors have been recently developed. By modifying the cache replacement policy, these techniques use the inherent redundancy in the memory hierarchy to periodically checkpoint the computation state. Three schemes, different in the manner in which they avoid rollback propagation, are evaluated. By simulation with address traces from parallel applications running on an Encore Multimax shared-memory multiprocessor, the performance effect of integrating the recovery schemes in the cache coherence protocol are evaluated. The results indicate that the cache-based schemes can provide checkpointing capability with low performance overhead but uncontrollable high variability in the checkpoint interval.
NASA Astrophysics Data System (ADS)
Vaidya, Bhargav; Prasad, Deovrat; Mignone, Andrea; Sharma, Prateek; Rickler, Luca
2017-12-01
An important ingredient in numerical modelling of high temperature magnetized astrophysical plasmas is the anisotropic transport of heat along magnetic field lines from higher to lower temperatures. Magnetohydrodynamics typically involves solving the hyperbolic set of conservation equations along with the induction equation. Incorporating anisotropic thermal conduction requires to also treat parabolic terms arising from the diffusion operator. An explicit treatment of parabolic terms will considerably reduce the simulation time step due to its dependence on the square of the grid resolution (Δx) for stability. Although an implicit scheme relaxes the constraint on stability, it is difficult to distribute efficiently on a parallel architecture. Treating parabolic terms with accelerated super-time-stepping (STS) methods has been discussed in literature, but these methods suffer from poor accuracy (first order in time) and also have difficult-to-choose tuneable stability parameters. In this work, we highlight a second-order (in time) Runge-Kutta-Legendre (RKL) scheme (first described by Meyer, Balsara & Aslam 2012) that is robust, fast and accurate in treating parabolic terms alongside the hyperbolic conversation laws. We demonstrate its superiority over the first-order STS schemes with standard tests and astrophysical applications. We also show that explicit conduction is particularly robust in handling saturated thermal conduction. Parallel scaling of explicit conduction using RKL scheme is demonstrated up to more than 104 processors.
Analysis of fault-tolerant neurocontrol architectures
NASA Technical Reports Server (NTRS)
Troudet, T.; Merrill, W.
1992-01-01
The fault-tolerance of analog parallel distributed implementations of a multivariable aircraft neurocontroller is analyzed by simulating weight and neuron failures in a simplified scheme of analog processing based on the functional architecture of the ETANN chip (Electrically Trainable Artificial Neural Network). The neural information processing is found to be only partially distributed throughout the set of weights of the neurocontroller synthesized with the backpropagation algorithm. Although the degree of distribution of the neural processing, and consequently the fault-tolerance of the neurocontroller, could be enhanced using Locally Distributed Weight and Neuron Approaches, a satisfactory level of fault-tolerance could only be obtained by retraining the degrated VLSI neurocontroller. The possibility of maintaining neurocontrol performance and stability in the presence of single weight of neuron failures was demonstrated through an automated retraining procedure of the neurocontroller based on a pre-programmed choice and sequence of the training parameters.
NASA Astrophysics Data System (ADS)
Noh, S.; Tachikawa, Y.; Shiiba, M.; Kim, S.
2011-12-01
Applications of the sequential data assimilation methods have been increasing in hydrology to reduce uncertainty in the model prediction. In a distributed hydrologic model, there are many types of state variables and each variable interacts with each other based on different time scales. However, the framework to deal with the delayed response, which originates from different time scale of hydrologic processes, has not been thoroughly addressed in the hydrologic data assimilation. In this study, we propose the lagged filtering scheme to consider the lagged response of internal states in a distributed hydrologic model using two filtering schemes; particle filtering (PF) and ensemble Kalman filtering (EnKF). The EnKF is one of the widely used sub-optimal filters implementing an efficient computation with limited number of ensemble members, however, still based on Gaussian approximation. PF can be an alternative in which the propagation of all uncertainties is carried out by a suitable selection of randomly generated particles without any assumptions about the nature of the distributions involved. In case of PF, advanced particle regularization scheme is implemented together to preserve the diversity of the particle system. In case of EnKF, the ensemble square root filter (EnSRF) are implemented. Each filtering method is parallelized and implemented in the high performance computing system. A distributed hydrologic model, the water and energy transfer processes (WEP) model, is applied for the Katsura River catchment, Japan to demonstrate the applicability of proposed approaches. Forecasted results via PF and EnKF are compared and analyzed in terms of the prediction accuracy and the probabilistic adequacy. Discussions are focused on the prospects and limitations of each data assimilation method.
Protocol Processing for 100 Gbit/s and Beyond - A Soft Real-Time Approach in Hardware and Software
NASA Astrophysics Data System (ADS)
Büchner, Steffen; Lopacinski, Lukasz; Kraemer, Rolf; Nolte, Jörg
2017-09-01
100 Gbit/s wireless communication protocol processing stresses all parts of a communication system until the outermost. The efficient use of upcoming 100 Gbit/s and beyond transmission technology requires the rethinking of the way protocols are processed by the communication endpoints. This paper summarizes the achievements of the project End2End100. We will present a comprehensive soft real-time stream processing approach that allows the protocol designer to develop, analyze, and plan scalable protocols for ultra high data rates of 100 Gbit/s and beyond. Furthermore, we will present an ultra-low power, adaptable, and massively parallelized FEC (Forward Error Correction) scheme that detects and corrects bit errors at line rate with an energy consumption between 1 pJ/bit and 13 pJ/bit. The evaluation results discussed in this publication show that our comprehensive approach allows end-to-end communication with a very low protocol processing overhead.
NASA Technical Reports Server (NTRS)
Shu, Chi-Wang
1992-01-01
The nonlinear stability of compact schemes for shock calculations is investigated. In recent years compact schemes were used in various numerical simulations including direct numerical simulation of turbulence. However to apply them to problems containing shocks, one has to resolve the problem of spurious numerical oscillation and nonlinear instability. A framework to apply nonlinear limiting to a local mean is introduced. The resulting scheme can be proven total variation (1D) or maximum norm (multi D) stable and produces nice numerical results in the test cases. The result is summarized in the preprint entitled 'Nonlinearly Stable Compact Schemes for Shock Calculations', which was submitted to SIAM Journal on Numerical Analysis. Research was continued on issues related to two and three dimensional essentially non-oscillatory (ENO) schemes. The main research topics include: parallel implementation of ENO schemes on Connection Machines; boundary conditions; shock interaction with hydrogen bubbles, a preparation for the full combustion simulation; and direct numerical simulation of compressible sheared turbulence.
Position-based quantum cryptography over untrusted networks
NASA Astrophysics Data System (ADS)
Nadeem, Muhammad
2014-08-01
In this article, we propose quantum position verification (QPV) schemes where all the channels are untrusted except the position of the prover and distant reference stations of verifiers. We review and analyze the existing QPV schemes containing some pre-shared data between the prover and verifiers. Most of these schemes are based on non-cryptographic assumptions, i.e. quantum/classical channels between the verifiers are secure. It seems impractical in an environment fully controlled by adversaries and would lead to security compromise in practical implementations. However, our proposed formula for QPV is more robust, secure and according to the standard assumptions of cryptography. Furthermore, once the position of the prover is verified, our schemes establish secret keys in parallel and can be used for authentication and secret communication between the prover and verifiers.
Multi-thread parallel algorithm for reconstructing 3D large-scale porous structures
NASA Astrophysics Data System (ADS)
Ju, Yang; Huang, Yaohui; Zheng, Jiangtao; Qian, Xu; Xie, Heping; Zhao, Xi
2017-04-01
Geomaterials inherently contain many discontinuous, multi-scale, geometrically irregular pores, forming a complex porous structure that governs their mechanical and transport properties. The development of an efficient reconstruction method for representing porous structures can significantly contribute toward providing a better understanding of the governing effects of porous structures on the properties of porous materials. In order to improve the efficiency of reconstructing large-scale porous structures, a multi-thread parallel scheme was incorporated into the simulated annealing reconstruction method. In the method, four correlation functions, which include the two-point probability function, the linear-path functions for the pore phase and the solid phase, and the fractal system function for the solid phase, were employed for better reproduction of the complex well-connected porous structures. In addition, a random sphere packing method and a self-developed pre-conditioning method were incorporated to cast the initial reconstructed model and select independent interchanging pairs for parallel multi-thread calculation, respectively. The accuracy of the proposed algorithm was evaluated by examining the similarity between the reconstructed structure and a prototype in terms of their geometrical, topological, and mechanical properties. Comparisons of the reconstruction efficiency of porous models with various scales indicated that the parallel multi-thread scheme significantly shortened the execution time for reconstruction of a large-scale well-connected porous model compared to a sequential single-thread procedure.
Parareal in time 3D numerical solver for the LWR Benchmark neutron diffusion transient model
DOE Office of Scientific and Technical Information (OSTI.GOV)
Baudron, Anne-Marie, E-mail: anne-marie.baudron@cea.fr; CEA-DRN/DMT/SERMA, CEN-Saclay, 91191 Gif sur Yvette Cedex; Lautard, Jean-Jacques, E-mail: jean-jacques.lautard@cea.fr
2014-12-15
In this paper we present a time-parallel algorithm for the 3D neutrons calculation of a transient model in a nuclear reactor core. The neutrons calculation consists in numerically solving the time dependent diffusion approximation equation, which is a simplified transport equation. The numerical resolution is done with finite elements method based on a tetrahedral meshing of the computational domain, representing the reactor core, and time discretization is achieved using a θ-scheme. The transient model presents moving control rods during the time of the reaction. Therefore, cross-sections (piecewise constants) are taken into account by interpolations with respect to the velocity ofmore » the control rods. The parallelism across the time is achieved by an adequate use of the parareal in time algorithm to the handled problem. This parallel method is a predictor corrector scheme that iteratively combines the use of two kinds of numerical propagators, one coarse and one fine. Our method is made efficient by means of a coarse solver defined with large time step and fixed position control rods model, while the fine propagator is assumed to be a high order numerical approximation of the full model. The parallel implementation of our method provides a good scalability of the algorithm. Numerical results show the efficiency of the parareal method on large light water reactor transient model corresponding to the Langenbuch–Maurer–Werner benchmark.« less
Fractional Steps methods for transient problems on commodity computer architectures
NASA Astrophysics Data System (ADS)
Krotkiewski, M.; Dabrowski, M.; Podladchikov, Y. Y.
2008-12-01
Fractional Steps methods are suitable for modeling transient processes that are central to many geological applications. Low memory requirements and modest computational complexity facilitates calculations on high-resolution three-dimensional models. An efficient implementation of Alternating Direction Implicit/Locally One-Dimensional schemes for an Opteron-based shared memory system is presented. The memory bandwidth usage, the main bottleneck on modern computer architectures, is specially addressed. High efficiency of above 2 GFlops per CPU is sustained for problems of 1 billion degrees of freedom. The optimized sequential implementation of all 1D sweeps is comparable in execution time to copying the used data in the memory. Scalability of the parallel implementation on up to 8 CPUs is close to perfect. Performing one timestep of the Locally One-Dimensional scheme on a system of 1000 3 unknowns on 8 CPUs takes only 11 s. We validate the LOD scheme using a computational model of an isolated inclusion subject to a constant far field flux. Next, we study numerically the evolution of a diffusion front and the effective thermal conductivity of composites consisting of multiple inclusions and compare the results with predictions based on the differential effective medium approach. Finally, application of the developed parabolic solver is suggested for a real-world problem of fluid transport and reactions inside a reservoir.
A Method for Molecular Dynamics on Curved Surfaces
Paquay, Stefan; Kusters, Remy
2016-01-01
Dynamics simulations of constrained particles can greatly aid in understanding the temporal and spatial evolution of biological processes such as lateral transport along membranes and self-assembly of viruses. Most theoretical efforts in the field of diffusive transport have focused on solving the diffusion equation on curved surfaces, for which it is not tractable to incorporate particle interactions even though these play a crucial role in crowded systems. We show here that it is possible to take such interactions into account by combining standard constraint algorithms with the classical velocity Verlet scheme to perform molecular dynamics simulations of particles constrained to an arbitrarily curved surface. Furthermore, unlike Brownian dynamics schemes in local coordinates, our method is based on Cartesian coordinates, allowing for the reuse of many other standard tools without modifications, including parallelization through domain decomposition. We show that by applying the schemes to the Langevin equation for various surfaces, we obtain confined Brownian motion, which has direct applications to many biological and physical problems. Finally we present two practical examples that highlight the applicability of the method: 1) the influence of crowding and shape on the lateral diffusion of proteins in curved membranes; and 2) the self-assembly of a coarse-grained virus capsid protein model. PMID:27028633
Parallel computing using a Lagrangian formulation
NASA Technical Reports Server (NTRS)
Liou, May-Fun; Loh, Ching Yuen
1991-01-01
A new Lagrangian formulation of the Euler equation is adopted for the calculation of 2-D supersonic steady flow. The Lagrangian formulation represents the inherent parallelism of the flow field better than the common Eulerian formulation and offers a competitive alternative on parallel computers. The implementation of the Lagrangian formulation on the Thinking Machines Corporation CM-2 Computer is described. The program uses a finite volume, first-order Godunov scheme and exhibits high accuracy in dealing with multidimensional discontinuities (slip-line and shock). By using this formulation, a better than six times speed-up was achieved on a 8192-processor CM-2 over a single processor of a CRAY-2.
Parallel computing using a Lagrangian formulation
NASA Technical Reports Server (NTRS)
Liou, May-Fun; Loh, Ching-Yuen
1992-01-01
This paper adopts a new Lagrangian formulation of the Euler equation for the calculation of two dimensional supersonic steady flow. The Lagrangian formulation represents the inherent parallelism of the flow field better than the common Eulerian formulation and offers a competitive alternative on parallel computers. The implementation of the Lagrangian formulation on the Thinking Machines Corporation CM-2 Computer is described. The program uses a finite volume, first-order Godunov scheme and exhibits high accuracy in dealing with multidimensional discontinuities (slip-line and shock). By using this formulation, we have achieved better than six times speed-up on a 8192-processor CM-2 over a single processor of a CRAY-2.
Parallel implementation of approximate atomistic models of the AMOEBA polarizable model
NASA Astrophysics Data System (ADS)
Demerdash, Omar; Head-Gordon, Teresa
2016-11-01
In this work we present a replicated data hybrid OpenMP/MPI implementation of a hierarchical progression of approximate classical polarizable models that yields speedups of up to ∼10 compared to the standard OpenMP implementation of the exact parent AMOEBA polarizable model. In addition, our parallel implementation exhibits reasonable weak and strong scaling. The resulting parallel software will prove useful for those who are interested in how molecular properties converge in the condensed phase with respect to the MBE, it provides a fruitful test bed for exploring different electrostatic embedding schemes, and offers an interesting possibility for future exascale computing paradigms.
Parallel implementation of an adaptive scheme for 3D unstructured grids on the SP2
NASA Technical Reports Server (NTRS)
Strawn, Roger C.; Oliker, Leonid; Biswas, Rupak
1996-01-01
Dynamic mesh adaption on unstructured grids is a powerful tool for computing unsteady flows that require local grid modifications to efficiently resolve solution features. For this work, we consider an edge-based adaption scheme that has shown good single-processor performance on the C90. We report on our experience parallelizing this code for the SP2. Results show a 47.0X speedup on 64 processors when 10 percent of the mesh is randomly refined. Performance deteriorates to 7.7X when the same number of edges are refined in a highly-localized region. This is because almost all the mesh adaption is confined to a single processor. However, this problem can be remedied by repartitioning the mesh immediately after targeting edges for refinement but before the actual adaption takes place. With this change, the speedup improves dramatically to 43.6X.
Parallel Implementation of an Adaptive Scheme for 3D Unstructured Grids on the SP2
NASA Technical Reports Server (NTRS)
Oliker, Leonid; Biswas, Rupak; Strawn, Roger C.
1996-01-01
Dynamic mesh adaption on unstructured grids is a powerful tool for computing unsteady flows that require local grid modifications to efficiently resolve solution features. For this work, we consider an edge-based adaption scheme that has shown good single-processor performance on the C90. We report on our experience parallelizing this code for the SP2. Results show a 47.OX speedup on 64 processors when 10% of the mesh is randomly refined. Performance deteriorates to 7.7X when the same number of edges are refined in a highly-localized region. This is because almost all mesh adaption is confined to a single processor. However, this problem can be remedied by repartitioning the mesh immediately after targeting edges for refinement but before the actual adaption takes place. With this change, the speedup improves dramatically to 43.6X.
GANDALF - Graphical Astrophysics code for N-body Dynamics And Lagrangian Fluids
NASA Astrophysics Data System (ADS)
Hubber, D. A.; Rosotti, G. P.; Booth, R. A.
2018-01-01
GANDALF is a new hydrodynamics and N-body dynamics code designed for investigating planet formation, star formation and star cluster problems. GANDALF is written in C++, parallelized with both OPENMP and MPI and contains a PYTHON library for analysis and visualization. The code has been written with a fully object-oriented approach to easily allow user-defined implementations of physics modules or other algorithms. The code currently contains implementations of smoothed particle hydrodynamics, meshless finite-volume and collisional N-body schemes, but can easily be adapted to include additional particle schemes. We present in this paper the details of its implementation, results from the test suite, serial and parallel performance results and discuss the planned future development. The code is freely available as an open source project on the code-hosting website github at https://github.com/gandalfcode/gandalf and is available under the GPLv2 license.
Research on retailer data clustering algorithm based on Spark
NASA Astrophysics Data System (ADS)
Huang, Qiuman; Zhou, Feng
2017-03-01
Big data analysis is a hot topic in the IT field now. Spark is a high-reliability and high-performance distributed parallel computing framework for big data sets. K-means algorithm is one of the classical partition methods in clustering algorithm. In this paper, we study the k-means clustering algorithm on Spark. Firstly, the principle of the algorithm is analyzed, and then the clustering analysis is carried out on the supermarket customers through the experiment to find out the different shopping patterns. At the same time, this paper proposes the parallelization of k-means algorithm and the distributed computing framework of Spark, and gives the concrete design scheme and implementation scheme. This paper uses the two-year sales data of a supermarket to validate the proposed clustering algorithm and achieve the goal of subdividing customers, and then analyze the clustering results to help enterprises to take different marketing strategies for different customer groups to improve sales performance.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Man, Zhong-Xiao, E-mail: zxman@mail.qfnu.edu.cn; An, Nguyen Ba, E-mail: nban@iop.vast.ac.vn; Xia, Yun-Jie, E-mail: yjxia@mail.qfnu.edu.cn
In combination with the theories of open system and quantum recovering measurement, we propose a quantum state transfer scheme using spin chains by performing two sequential operations: a projective measurement on the spins of ‘environment’ followed by suitably designed quantum recovering measurements on the spins of interest. The scheme allows perfect transfer of arbitrary multispin states through multiple parallel spin chains with finite probability. Our scheme is universal in the sense that it is state-independent and applicable to any model possessing spin–spin interactions. We also present possible methods to implement the required measurements taking into account the current experimental technologies.more » As applications, we consider two typical models for which the probabilities of perfect state transfer are found to be reasonably high at optimally chosen moments during the time evolution. - Highlights: • Scheme that can achieve perfect quantum state transfer is devised. • The scheme is state-independent and applicable to any spin-interaction models. • The scheme allows perfect transfer of arbitrary multispin states. • Applications to two typical models are considered in detail.« less
Particle/Continuum Hybrid Simulation in a Parallel Computing Environment
NASA Technical Reports Server (NTRS)
Baganoff, Donald
1996-01-01
The objective of this study was to modify an existing parallel particle code based on the direct simulation Monte Carlo (DSMC) method to include a Navier-Stokes (NS) calculation so that a hybrid solution could be developed. In carrying out this work, it was determined that the following five issues had to be addressed before extensive program development of a three dimensional capability was pursued: (1) find a set of one-sided kinetic fluxes that are fully compatible with the DSMC method, (2) develop a finite volume scheme to make use of these one-sided kinetic fluxes, (3) make use of the one-sided kinetic fluxes together with DSMC type boundary conditions at a material surface so that velocity slip and temperature slip arise naturally for near-continuum conditions, (4) find a suitable sampling scheme so that the values of the one-sided fluxes predicted by the NS solution at an interface between the two domains can be converted into the correct distribution of particles to be introduced into the DSMC domain, (5) carry out a suitable number of tests to confirm that the developed concepts are valid, individually and in concert for a hybrid scheme.
Scalable and fast heterogeneous molecular simulation with predictive parallelization schemes
DOE Office of Scientific and Technical Information (OSTI.GOV)
Guzman, Horacio V.; Junghans, Christoph; Kremer, Kurt
Multiscale and inhomogeneous molecular systems are challenging topics in the field of molecular simulation. In particular, modeling biological systems in the context of multiscale simulations and exploring material properties are driving a permanent development of new simulation methods and optimization algorithms. In computational terms, those methods require parallelization schemes that make a productive use of computational resources for each simulation and from its genesis. Here, we introduce the heterogeneous domain decomposition approach, which is a combination of an heterogeneity-sensitive spatial domain decomposition with an a priori rearrangement of subdomain walls. Within this approach and paper, the theoretical modeling and scalingmore » laws for the force computation time are proposed and studied as a function of the number of particles and the spatial resolution ratio. We also show the new approach capabilities, by comparing it to both static domain decomposition algorithms and dynamic load-balancing schemes. Specifically, two representative molecular systems have been simulated and compared to the heterogeneous domain decomposition proposed in this work. Finally, these two systems comprise an adaptive resolution simulation of a biomolecule solvated in water and a phase-separated binary Lennard-Jones fluid.« less
Scalable and fast heterogeneous molecular simulation with predictive parallelization schemes
Guzman, Horacio V.; Junghans, Christoph; Kremer, Kurt; ...
2017-11-27
Multiscale and inhomogeneous molecular systems are challenging topics in the field of molecular simulation. In particular, modeling biological systems in the context of multiscale simulations and exploring material properties are driving a permanent development of new simulation methods and optimization algorithms. In computational terms, those methods require parallelization schemes that make a productive use of computational resources for each simulation and from its genesis. Here, we introduce the heterogeneous domain decomposition approach, which is a combination of an heterogeneity-sensitive spatial domain decomposition with an a priori rearrangement of subdomain walls. Within this approach and paper, the theoretical modeling and scalingmore » laws for the force computation time are proposed and studied as a function of the number of particles and the spatial resolution ratio. We also show the new approach capabilities, by comparing it to both static domain decomposition algorithms and dynamic load-balancing schemes. Specifically, two representative molecular systems have been simulated and compared to the heterogeneous domain decomposition proposed in this work. Finally, these two systems comprise an adaptive resolution simulation of a biomolecule solvated in water and a phase-separated binary Lennard-Jones fluid.« less
GSHR-Tree: a spatial index tree based on dynamic spatial slot and hash table in grid environments
NASA Astrophysics Data System (ADS)
Chen, Zhanlong; Wu, Xin-cai; Wu, Liang
2008-12-01
Computation Grids enable the coordinated sharing of large-scale distributed heterogeneous computing resources that can be used to solve computationally intensive problems in science, engineering, and commerce. Grid spatial applications are made possible by high-speed networks and a new generation of Grid middleware that resides between networks and traditional GIS applications. The integration of the multi-sources and heterogeneous spatial information and the management of the distributed spatial resources and the sharing and cooperative of the spatial data and Grid services are the key problems to resolve in the development of the Grid GIS. The performance of the spatial index mechanism is the key technology of the Grid GIS and spatial database affects the holistic performance of the GIS in Grid Environments. In order to improve the efficiency of parallel processing of a spatial mass data under the distributed parallel computing grid environment, this paper presents a new grid slot hash parallel spatial index GSHR-Tree structure established in the parallel spatial indexing mechanism. Based on the hash table and dynamic spatial slot, this paper has improved the structure of the classical parallel R tree index. The GSHR-Tree index makes full use of the good qualities of R-Tree and hash data structure. This paper has constructed a new parallel spatial index that can meet the needs of parallel grid computing about the magnanimous spatial data in the distributed network. This arithmetic splits space in to multi-slots by multiplying and reverting and maps these slots to sites in distributed and parallel system. Each sites constructs the spatial objects in its spatial slot into an R tree. On the basis of this tree structure, the index data was distributed among multiple nodes in the grid networks by using large node R-tree method. The unbalance during process can be quickly adjusted by means of a dynamical adjusting algorithm. This tree structure has considered the distributed operation, reduplication operation transfer operation of spatial index in the grid environment. The design of GSHR-Tree has ensured the performance of the load balance in the parallel computation. This tree structure is fit for the parallel process of the spatial information in the distributed network environments. Instead of spatial object's recursive comparison where original R tree has been used, the algorithm builds the spatial index by applying binary code operation in which computer runs more efficiently, and extended dynamic hash code for bit comparison. In GSHR-Tree, a new server is assigned to the network whenever a split of a full node is required. We describe a more flexible allocation protocol which copes with a temporary shortage of storage resources. It uses a distributed balanced binary spatial tree that scales with insertions to potentially any number of storage servers through splits of the overloaded ones. The application manipulates the GSHR-Tree structure from a node in the grid environment. The node addresses the tree through its image that the splits can make outdated. This may generate addressing errors, solved by the forwarding among the servers. In this paper, a spatial index data distribution algorithm that limits the number of servers has been proposed. We improve the storage utilization at the cost of additional messages. The structure of GSHR-Tree is believed that the scheme of this grid spatial index should fit the needs of new applications using endlessly larger sets of spatial data. Our proposal constitutes a flexible storage allocation method for a distributed spatial index. The insertion policy can be tuned dynamically to cope with periods of storage shortage. In such cases storage balancing should be favored for better space utilization, at the price of extra message exchanges between servers. This structure makes a compromise in the updating of the duplicated index and the transformation of the spatial index data. Meeting the needs of the grid computing, GSHRTree has a flexible structure in order to satisfy new needs in the future. The GSHR-Tree provides the R-tree capabilities for large spatial datasets stored over interconnected servers. The analysis, including the experiments, confirmed the efficiency of our design choices. The scheme should fit the needs of new applications of spatial data, using endlessly larger datasets. Using the system response time of the parallel processing of spatial scope query algorithm as the performance evaluation factor, According to the result of the simulated the experiments, GSHR-Tree is performed to prove the reasonable design and the high performance of the indexing structure that the paper presented.
Dynamic performance of high speed solenoid valve with parallel coils
NASA Astrophysics Data System (ADS)
Kong, Xiaowu; Li, Shizhen
2014-07-01
The methods of improving the dynamic performance of high speed on/off solenoid valve include increasing the magnetic force of armature and the slew rate of coil current, decreasing the mass and stroke of moving parts. The increase of magnetic force usually leads to the decrease of current slew rate, which could increase the delay time of the dynamic response of solenoid valve. Using a high voltage to drive coil can solve this contradiction, but a high driving voltage can also lead to more cost and a decrease of safety and reliability. In this paper, a new scheme of parallel coils is investigated, in which the single coil of solenoid is replaced by parallel coils with same ampere turns. Based on the mathematic model of high speed solenoid valve, the theoretical formula for the delay time of solenoid valve is deduced. Both the theoretical analysis and the dynamic simulation show that the effect of dividing a single coil into N parallel sub-coils is close to that of driving the single coil with N times of the original driving voltage as far as the delay time of solenoid valve is concerned. A specific test bench is designed to measure the dynamic performance of high speed on/off solenoid valve. The experimental results also prove that both the delay time and switching time of the solenoid valves can be decreased greatly by adopting the parallel coil scheme. This research presents a simple and practical method to improve the dynamic performance of high speed on/off solenoid valve.
Jung, Jaewoon; Mori, Takaharu; Kobayashi, Chigusa; Matsunaga, Yasuhiro; Yoda, Takao; Feig, Michael; Sugita, Yuji
2015-07-01
GENESIS (Generalized-Ensemble Simulation System) is a new software package for molecular dynamics (MD) simulations of macromolecules. It has two MD simulators, called ATDYN and SPDYN. ATDYN is parallelized based on an atomic decomposition algorithm for the simulations of all-atom force-field models as well as coarse-grained Go-like models. SPDYN is highly parallelized based on a domain decomposition scheme, allowing large-scale MD simulations on supercomputers. Hybrid schemes combining OpenMP and MPI are used in both simulators to target modern multicore computer architectures. Key advantages of GENESIS are (1) the highly parallel performance of SPDYN for very large biological systems consisting of more than one million atoms and (2) the availability of various REMD algorithms (T-REMD, REUS, multi-dimensional REMD for both all-atom and Go-like models under the NVT, NPT, NPAT, and NPγT ensembles). The former is achieved by a combination of the midpoint cell method and the efficient three-dimensional Fast Fourier Transform algorithm, where the domain decomposition space is shared in real-space and reciprocal-space calculations. Other features in SPDYN, such as avoiding concurrent memory access, reducing communication times, and usage of parallel input/output files, also contribute to the performance. We show the REMD simulation results of a mixed (POPC/DMPC) lipid bilayer as a real application using GENESIS. GENESIS is released as free software under the GPLv2 licence and can be easily modified for the development of new algorithms and molecular models. WIREs Comput Mol Sci 2015, 5:310-323. doi: 10.1002/wcms.1220.
A Hybrid MPI/OpenMP Approach for Parallel Groundwater Model Calibration on Multicore Computers
DOE Office of Scientific and Technical Information (OSTI.GOV)
Tang, Guoping; D'Azevedo, Ed F; Zhang, Fan
2010-01-01
Groundwater model calibration is becoming increasingly computationally time intensive. We describe a hybrid MPI/OpenMP approach to exploit two levels of parallelism in software and hardware to reduce calibration time on multicore computers with minimal parallelization effort. At first, HydroGeoChem 5.0 (HGC5) is parallelized using OpenMP for a uranium transport model with over a hundred species involving nearly a hundred reactions, and a field scale coupled flow and transport model. In the first application, a single parallelizable loop is identified to consume over 97% of the total computational time. With a few lines of OpenMP compiler directives inserted into the code,more » the computational time reduces about ten times on a compute node with 16 cores. The performance is further improved by selectively parallelizing a few more loops. For the field scale application, parallelizable loops in 15 of the 174 subroutines in HGC5 are identified to take more than 99% of the execution time. By adding the preconditioned conjugate gradient solver and BICGSTAB, and using a coloring scheme to separate the elements, nodes, and boundary sides, the subroutines for finite element assembly, soil property update, and boundary condition application are parallelized, resulting in a speedup of about 10 on a 16-core compute node. The Levenberg-Marquardt (LM) algorithm is added into HGC5 with the Jacobian calculation and lambda search parallelized using MPI. With this hybrid approach, compute nodes at the number of adjustable parameters (when the forward difference is used for Jacobian approximation), or twice that number (if the center difference is used), are used to reduce the calibration time from days and weeks to a few hours for the two applications. This approach can be extended to global optimization scheme and Monte Carol analysis where thousands of compute nodes can be efficiently utilized.« less
Zhang, Xiaohua; Wong, Sergio E; Lightstone, Felice C
2013-04-30
A mixed parallel scheme that combines message passing interface (MPI) and multithreading was implemented in the AutoDock Vina molecular docking program. The resulting program, named VinaLC, was tested on the petascale high performance computing (HPC) machines at Lawrence Livermore National Laboratory. To exploit the typical cluster-type supercomputers, thousands of docking calculations were dispatched by the master process to run simultaneously on thousands of slave processes, where each docking calculation takes one slave process on one node, and within the node each docking calculation runs via multithreading on multiple CPU cores and shared memory. Input and output of the program and the data handling within the program were carefully designed to deal with large databases and ultimately achieve HPC on a large number of CPU cores. Parallel performance analysis of the VinaLC program shows that the code scales up to more than 15K CPUs with a very low overhead cost of 3.94%. One million flexible compound docking calculations took only 1.4 h to finish on about 15K CPUs. The docking accuracy of VinaLC has been validated against the DUD data set by the re-docking of X-ray ligands and an enrichment study, 64.4% of the top scoring poses have RMSD values under 2.0 Å. The program has been demonstrated to have good enrichment performance on 70% of the targets in the DUD data set. An analysis of the enrichment factors calculated at various percentages of the screening database indicates VinaLC has very good early recovery of actives. Copyright © 2013 Wiley Periodicals, Inc.
2007-09-01
m (Cyt-m). We chose to study the oxidation of camphor to hydroxycamphor (Scheme 1) because it is the natural reaction for P450cam and there was...only one known reaction product. 10 O O HO camphor 5-exo-hydroxycamphor Scheme 1. The hydroxylation of camphor by P450cam, producing...phases, and 250 rpm. The oxidation of camphor to hydroxycamphor is 100% coupled with NADH oxidation, allowing for a direct correlation of NADH
NASA Astrophysics Data System (ADS)
Koldan, Jelena; Puzyrev, Vladimir; de la Puente, Josep; Houzeaux, Guillaume; Cela, José María
2014-06-01
We present an elaborate preconditioning scheme for Krylov subspace methods which has been developed to improve the performance and reduce the execution time of parallel node-based finite-element (FE) solvers for 3-D electromagnetic (EM) numerical modelling in exploration geophysics. This new preconditioner is based on algebraic multigrid (AMG) that uses different basic relaxation methods, such as Jacobi, symmetric successive over-relaxation (SSOR) and Gauss-Seidel, as smoothers and the wave front algorithm to create groups, which are used for a coarse-level generation. We have implemented and tested this new preconditioner within our parallel nodal FE solver for 3-D forward problems in EM induction geophysics. We have performed series of experiments for several models with different conductivity structures and characteristics to test the performance of our AMG preconditioning technique when combined with biconjugate gradient stabilized method. The results have shown that, the more challenging the problem is in terms of conductivity contrasts, ratio between the sizes of grid elements and/or frequency, the more benefit is obtained by using this preconditioner. Compared to other preconditioning schemes, such as diagonal, SSOR and truncated approximate inverse, the AMG preconditioner greatly improves the convergence of the iterative solver for all tested models. Also, when it comes to cases in which other preconditioners succeed to converge to a desired precision, AMG is able to considerably reduce the total execution time of the forward-problem code-up to an order of magnitude. Furthermore, the tests have confirmed that our AMG scheme ensures grid-independent rate of convergence, as well as improvement in convergence regardless of how big local mesh refinements are. In addition, AMG is designed to be a black-box preconditioner, which makes it easy to use and combine with different iterative methods. Finally, it has proved to be very practical and efficient in the parallel context.
Parallelized modelling and solution scheme for hierarchically scaled simulations
NASA Technical Reports Server (NTRS)
Padovan, Joe
1995-01-01
This two-part paper presents the results of a benchmarked analytical-numerical investigation into the operational characteristics of a unified parallel processing strategy for implicit fluid mechanics formulations. This hierarchical poly tree (HPT) strategy is based on multilevel substructural decomposition. The Tree morphology is chosen to minimize memory, communications and computational effort. The methodology is general enough to apply to existing finite difference (FD), finite element (FEM), finite volume (FV) or spectral element (SE) based computer programs without an extensive rewrite of code. In addition to finding large reductions in memory, communications, and computational effort associated with a parallel computing environment, substantial reductions are generated in the sequential mode of application. Such improvements grow with increasing problem size. Along with a theoretical development of general 2-D and 3-D HPT, several techniques for expanding the problem size that the current generation of computers are capable of solving, are presented and discussed. Among these techniques are several interpolative reduction methods. It was found that by combining several of these techniques that a relatively small interpolative reduction resulted in substantial performance gains. Several other unique features/benefits are discussed in this paper. Along with Part 1's theoretical development, Part 2 presents a numerical approach to the HPT along with four prototype CFD applications. These demonstrate the potential of the HPT strategy.
NASA Astrophysics Data System (ADS)
Wu, J.; Yang, Y.; Luo, Q.; Wu, J.
2012-12-01
This study presents a new hybrid multi-objective evolutionary algorithm, the niched Pareto tabu search combined with a genetic algorithm (NPTSGA), whereby the global search ability of niched Pareto tabu search (NPTS) is improved by the diversification of candidate solutions arose from the evolving nondominated sorting genetic algorithm II (NSGA-II) population. Also, the NPTSGA coupled with the commonly used groundwater flow and transport codes, MODFLOW and MT3DMS, is developed for multi-objective optimal design of groundwater remediation systems. The proposed methodology is then applied to a large-scale field groundwater remediation system for cleanup of large trichloroethylene (TCE) plume at the Massachusetts Military Reservation (MMR) in Cape Cod, Massachusetts. Furthermore, a master-slave (MS) parallelization scheme based on the Message Passing Interface (MPI) is incorporated into the NPTSGA to implement objective function evaluations in distributed processor environment, which can greatly improve the efficiency of the NPTSGA in finding Pareto-optimal solutions to the real-world application. This study shows that the MS parallel NPTSGA in comparison with the original NPTS and NSGA-II can balance the tradeoff between diversity and optimality of solutions during the search process and is an efficient and effective tool for optimizing the multi-objective design of groundwater remediation systems under complicated hydrogeologic conditions.
Proposed best modeling practices for assessing the effects of ecosystem restoration on fish
Rose, Kenneth A; Sable, Shaye; DeAngelis, Donald L.; Yurek, Simeon; Trexler, Joel C.; Graf, William L.; Reed, Denise J.
2015-01-01
Large-scale aquatic ecosystem restoration is increasing and is often controversial because of the economic costs involved, with the focus of the controversies gravitating to the modeling of fish responses. We present a scheme for best practices in selecting, implementing, interpreting, and reporting of fish modeling designed to assess the effects of restoration actions on fish populations and aquatic food webs. Previous best practice schemes that tended to be more general are summarized, and they form the foundation for our scheme that is specifically tailored for fish and restoration. We then present a 31-step scheme, with supporting text and narrative for each step, which goes from understanding how the results will be used through post-auditing to ensure the approach is used effectively in subsequent applications. We also describe 13 concepts that need to be considered in parallel to these best practice steps. Examples of these concepts include: life cycles and strategies; variability and uncertainty; nonequilibrium theory; biological, temporal, and spatial scaling; explicit versus implicit representation of processes; and model validation. These concepts are often not considered or not explicitly stated and casual treatment of them leads to mis-communication and mis-understandings, which in turn, often underlie the resulting controversies. We illustrate a subset of these steps, and their associated concepts, using the three case studies of Glen Canyon Dam on the Colorado River, the wetlands of coastal Louisiana, and the Everglades. Use of our proposed scheme will require investment of additional time and effort (and dollars) to be done effectively. We argue that such an investment is well worth it and will more than pay back in the long run in effective and efficient restoration actions and likely avoided controversies and legal proceedings.
Nyx: Adaptive mesh, massively-parallel, cosmological simulation code
NASA Astrophysics Data System (ADS)
Almgren, Ann; Beckner, Vince; Friesen, Brian; Lukic, Zarija; Zhang, Weiqun
2017-12-01
Nyx code solves equations of compressible hydrodynamics on an adaptive grid hierarchy coupled with an N-body treatment of dark matter. The gas dynamics in Nyx use a finite volume methodology on an adaptive set of 3-D Eulerian grids; dark matter is represented as discrete particles moving under the influence of gravity. Particles are evolved via a particle-mesh method, using Cloud-in-Cell deposition/interpolation scheme. Both baryonic and dark matter contribute to the gravitational field. In addition, Nyx includes physics for accurately modeling the intergalactic medium; in optically thin limits and assuming ionization equilibrium, the code calculates heating and cooling processes of the primordial-composition gas in an ionizing ultraviolet background radiation field.
HEP - A semaphore-synchronized multiprocessor with central control. [Heterogeneous Element Processor
NASA Technical Reports Server (NTRS)
Gilliland, M. C.; Smith, B. J.; Calvert, W.
1976-01-01
The paper describes the design concept of the Heterogeneous Element Processor (HEP), a system tailored to the special needs of scientific simulation. In order to achieve high-speed computation required by simulation, HEP features a hierarchy of processes executing in parallel on a number of processors, with synchronization being largely accomplished by hardware. A full-empty-reserve scheme of synchronization is realized by zero-one-valued hardware semaphores. A typical system has, besides the control computer and the scheduler, an algebraic module, a memory module, a first-in first-out (FIFO) module, an integrator module, and an I/O module. The architecture of the scheduler and the algebraic module is examined in detail.
Yang, L. H.; Brooks III, E. D.; Belak, J.
1992-01-01
A molecular dynamics algorithm for performing large-scale simulations using the Parallel C Preprocessor (PCP) programming paradigm on the BBN TC2000, a massively parallel computer, is discussed. The algorithm uses a linked-cell data structure to obtain the near neighbors of each atom as time evoles. Each processor is assigned to a geometric domain containing many subcells and the storage for that domain is private to the processor. Within this scheme, the interdomain (i.e., interprocessor) communication is minimized.
Linux Kernel Co-Scheduling For Bulk Synchronous Parallel Applications
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jones, Terry R
2011-01-01
This paper describes a kernel scheduling algorithm that is based on co-scheduling principles and that is intended for parallel applications running on 1000 cores or more where inter-node scalability is key. Experimental results for a Linux implementation on a Cray XT5 machine are presented.1 The results indicate that Linux is a suitable operating system for this new scheduling scheme, and that this design provides a dramatic improvement in scaling performance for synchronizing collective operations at scale.
NASA Astrophysics Data System (ADS)
Chow, Sherman S. M.
Traceable signature scheme extends a group signature scheme with an enhanced anonymity management mechanism. The group manager can compute a tracing trapdoor which enables anyone to test if a signature is signed by a given misbehaving user, while the only way to do so for group signatures requires revealing the signer of all signatures. Nevertheless, it is not tracing in a strict sense. For all existing schemes, T tracing agents need to recollect all N' signatures ever produced and perform RN' “checks” for R revoked users. This involves a high volume of transfer and computations. Increasing T increases the degree of parallelism for tracing but also the probability of “missing” some signatures in case some of the agents are dishonest.
Differential pulse amplitude modulation for multiple-input single-output OWVLC
NASA Astrophysics Data System (ADS)
Yang, S. H.; Kwon, D. H.; Kim, S. J.; Son, Y. H.; Han, S. K.
2015-01-01
White light-emitting diodes (LEDs) are widely used for lighting due to their energy efficiency, eco-friendly, and small size than previously light sources such as incandescent, fluorescent bulbs and so on. Optical wireless visible light communication (OWVLC) based on LED merges lighting and communications in applications such as indoor lighting, traffic signals, vehicles, and underwater communications because LED can be easily modulated. However, physical bandwidth of LED is limited about several MHz by slow time constant of the phosphor and characteristics of device. Therefore, using the simplest modulation format which is non-return-zero on-off-keying (NRZ-OOK), the data rate reaches only to dozens Mbit/s. Thus, to improve the transmission capacity, optical filtering and pre-, post-equalizer are adapted. Also, high-speed wireless connectivity is implemented using spectrally efficient modulation methods: orthogonal frequency division multiplexing (OFDM) or discrete multi-tone (DMT). However, these modulation methods need additional digital signal processing such as FFT and IFFT, thus complexity of transmitter and receiver is increasing. To reduce the complexity of transmitter and receiver, we proposed a novel modulation scheme which is named differential pulse amplitude modulation. The proposed modulation scheme transmits different NRZ-OOK signals with same amplitude and unit time delay using each LED chip, respectively. The `N' parallel signals from LEDs are overlapped and directly detected at optical receiver. Received signal is demodulated by power difference between unit time slots. The proposed scheme can overcome the bandwidth limitation of LEDs and data rate can be improved according to number of LEDs without complex digital signal processing.
A 2D MTF approach to evaluate and guide dynamic imaging developments.
Chao, Tzu-Cheng; Chung, Hsiao-Wen; Hoge, W Scott; Madore, Bruno
2010-02-01
As the number and complexity of partially sampled dynamic imaging methods continue to increase, reliable strategies to evaluate performance may prove most useful. In the present work, an analytical framework to evaluate given reconstruction methods is presented. A perturbation algorithm allows the proposed evaluation scheme to perform robustly without requiring knowledge about the inner workings of the method being evaluated. A main output of the evaluation process consists of a two-dimensional modulation transfer function, an easy-to-interpret visual rendering of a method's ability to capture all combinations of spatial and temporal frequencies. Approaches to evaluate noise properties and artifact content at all spatial and temporal frequencies are also proposed. One fully sampled phantom and three fully sampled cardiac cine datasets were subsampled (R = 4 and 8) and reconstructed with the different methods tested here. A hybrid method, which combines the main advantageous features observed in our assessments, was proposed and tested in a cardiac cine application, with acceleration factors of 3.5 and 6.3 (skip factors of 4 and 8, respectively). This approach combines features from methods such as k-t sensitivity encoding, unaliasing by Fourier encoding the overlaps in the temporal dimension-sensitivity encoding, generalized autocalibrating partially parallel acquisition, sensitivity profiles from an array of coils for encoding and reconstruction in parallel, self, hybrid referencing with unaliasing by Fourier encoding the overlaps in the temporal dimension and generalized autocalibrating partially parallel acquisition, and generalized autocalibrating partially parallel acquisition-enhanced sensitivity maps for sensitivity encoding reconstructions.
Distributed Scene Analysis For Autonomous Road Vehicle Guidance
NASA Astrophysics Data System (ADS)
Mysliwetz, Birger D.; Dickmanns, E. D.
1987-01-01
An efficient distributed processing scheme has been developed for visual road boundary tracking by 'VaMoRs', a testbed vehicle for autonomous mobility and computer vision. Ongoing work described here is directed to improving the robustness of the road boundary detection process in the presence of shadows, ill-defined edges and other disturbing real world effects. The system structure and the techniques applied for real-time scene analysis are presented along with experimental results. All subfunctions of road boundary detection for vehicle guidance, such as edge extraction, feature aggregation and camera pointing control, are executed in parallel by an onboard multiprocessor system. On the image processing level local oriented edge extraction is performed in multiple 'windows', tighly controlled from a hierarchically higher, modelbased level. The interpretation process involving a geometric road model and the observer's relative position to the road boundaries is capable of coping with ambiguity in measurement data. By using only selected measurements to update the model parameters even high noise levels can be dealt with and misleading edges be rejected.
NASA Technical Reports Server (NTRS)
Greenberg, Albert G.; Lubachevsky, Boris D.; Nicol, David M.; Wright, Paul E.
1994-01-01
Fast, efficient parallel algorithms are presented for discrete event simulations of dynamic channel assignment schemes for wireless cellular communication networks. The driving events are call arrivals and departures, in continuous time, to cells geographically distributed across the service area. A dynamic channel assignment scheme decides which call arrivals to accept, and which channels to allocate to the accepted calls, attempting to minimize call blocking while ensuring co-channel interference is tolerably low. Specifically, the scheme ensures that the same channel is used concurrently at different cells only if the pairwise distances between those cells are sufficiently large. Much of the complexity of the system comes from ensuring this separation. The network is modeled as a system of interacting continuous time automata, each corresponding to a cell. To simulate the model, conservative methods are used; i.e., methods in which no errors occur in the course of the simulation and so no rollback or relaxation is needed. Implemented on a 16K processor MasPar MP-1, an elegant and simple technique provides speedups of about 15 times over an optimized serial simulation running on a high speed workstation. A drawback of this technique, typical of conservative methods, is that processor utilization is rather low. To overcome this, new methods were developed that exploit slackness in event dependencies over short intervals of time, thereby raising the utilization to above 50 percent and the speedup over the optimized serial code to about 120 times.
Quantum transport modelling of silicon nanobeams using heterogeneous computing scheme
DOE Office of Scientific and Technical Information (OSTI.GOV)
Harb, M., E-mail: harbm@physics.mcgill.ca; Michaud-Rioux, V., E-mail: vincentm@physics.mcgill.ca; Guo, H., E-mail: guo@physics.mcgill.ca
We report the development of a powerful method for quantum transport calculations of nanowire/nanobeam structures with large cross sectional area. Our approach to quantum transport is based on Green's functions and tight-binding potentials. A linear algebraic formulation allows us to harness the massively parallel nature of Graphics Processing Units (GPUs) and our implementation is based on a heterogeneous parallel computing scheme with traditional processors and GPUs working together. Using our software tool, the electronic and quantum transport properties of silicon nanobeams with a realistic cross sectional area of ∼22.7 nm{sup 2} and a length of ∼81.5 nm—comprising 105 000 Si atoms and 24 000more » passivating H atoms in the scattering region—are investigated. The method also allows us to perform significant averaging over impurity configurations—all possible configurations were considered in the case of single impurities. Finally, the effect of the position and number of vacancy defects on the transport properties was considered. It is found that the configurations with the vacancies lying closer to the local density of states (LDOS) maxima have lower transmission functions than the configurations with the vacancies located at LDOS minima or far away from LDOS maxima, suggesting both a qualitative method to tune or estimate optimal impurity configurations as well as a physical picture that accounts for device variability. Finally, we provide performance benchmarks for structures as large as ∼42.5 nm{sup 2} cross section and ∼81.5 nm length.« less
Fujita, Masahiko
2013-06-01
A new supervised learning theory is proposed for a hierarchical neural network with a single hidden layer of threshold units, which can approximate any continuous transformation, and applied to a cerebellar function to suppress the end-point variability of saccades. In motor systems, feedback control can reduce noise effects if the noise is added in a pathway from a motor center to a peripheral effector; however, it cannot reduce noise effects if the noise is generated in the motor center itself: a new control scheme is necessary for such noise. The cerebellar cortex is well known as a supervised learning system, and a novel theory of cerebellar cortical function developed in this study can explain the capability of the cerebellum to feedforwardly reduce noise effects, such as end-point variability of saccades. This theory assumes that a Golgi-granule cell system can encode the strength of a mossy fiber input as the state of neuronal activity of parallel fibers. By combining these parallel fiber signals with appropriate connection weights to produce a Purkinje cell output, an arbitrary continuous input-output relationship can be obtained. By incorporating such flexible computation and learning ability in a process of saccadic gain adaptation, a new control scheme in which the cerebellar cortex feedforwardly suppresses the end-point variability when it detects a variation in saccadic commands can be devised. Computer simulation confirmed the efficiency of such learning and showed a reduction in the variability of saccadic end points, similar to results obtained from experimental data.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Boman, Erik G.
This LDRD project was a campus exec fellowship to fund (in part) Donald Nguyen’s PhD research at UT-Austin. His work has focused on parallel programming models, and scheduling irregular algorithms on shared-memory systems using the Galois framework. Galois provides a simple but powerful way for users and applications to automatically obtain good parallel performance using certain supported data containers. The naïve user can write serial code, while advanced users can optimize performance by advanced features, such as specifying the scheduling policy. Galois was used to parallelize two sparse matrix reordering schemes: RCM and Sloan. Such reordering is important in high-performancemore » computing to obtain better data locality and thus reduce run times.« less
NASA Astrophysics Data System (ADS)
Ishida, H.; Ota, Y.; Sekiguchi, M.; Sato, Y.
2016-12-01
A three-dimensional (3D) radiative transfer calculation scheme is developed to estimate horizontal transport of radiation energy in a very high resolution (with the order of 10 m in spatial grid) simulation of cloud evolution, especially for horizontally inhomogeneous clouds such as shallow cumulus and stratocumulus. Horizontal radiative transfer due to inhomogeneous clouds seems to cause local heating/cooling in an atmosphere with a fine spatial scale. It is, however, usually difficult to estimate the 3D effects, because the 3D radiative transfer often needs a large resource for computation compared to a plane-parallel approximation. This study attempts to incorporate a solution scheme that explicitly solves the 3D radiative transfer equation into a numerical simulation, because this scheme has an advantage in calculation for a sequence of time evolution (i.e., the scene at a time is little different from that at the previous time step). This scheme is also appropriate to calculation of radiation with strong absorption, such as the infrared regions. For efficient computation, this scheme utilizes several techniques, e.g., the multigrid method for iteration solution, and a correlated-k distribution method refined for efficient approximation of the wavelength integration. For a case study, the scheme is applied to an infrared broadband radiation calculation in a broken cloud field generated with a large eddy simulation model. The horizontal transport of infrared radiation, which cannot be estimated by the plane-parallel approximation, and its variation in time can be retrieved. The calculation result elucidates that the horizontal divergences and convergences of infrared radiation flux are not negligible, especially at the boundaries of clouds and within optically thin clouds, and the radiative cooling at lateral boundaries of clouds may reduce infrared radiative heating in clouds. In a future work, the 3D effects on radiative heating/cooling will be able to be included into atmospheric numerical models.
Gaussian mixtures on tensor fields for segmentation: applications to medical imaging.
de Luis-García, Rodrigo; Westin, Carl-Fredrik; Alberola-López, Carlos
2011-01-01
In this paper, we introduce a new approach for tensor field segmentation based on the definition of mixtures of Gaussians on tensors as a statistical model. Working over the well-known Geodesic Active Regions segmentation framework, this scheme presents several interesting advantages. First, it yields a more flexible model than the use of a single Gaussian distribution, which enables the method to better adapt to the complexity of the data. Second, it can work directly on tensor-valued images or, through a parallel scheme that processes independently the intensity and the local structure tensor, on scalar textured images. Two different applications have been considered to show the suitability of the proposed method for medical imaging segmentation. First, we address DT-MRI segmentation on a dataset of 32 volumes, showing a successful segmentation of the corpus callosum and favourable comparisons with related approaches in the literature. Second, the segmentation of bones from hand radiographs is studied, and a complete automatic-semiautomatic approach has been developed that makes use of anatomical prior knowledge to produce accurate segmentation results. Copyright © 2010 Elsevier Ltd. All rights reserved.
Addressing the Big-Earth-Data Variety Challenge with the Hierarchical Triangular Mesh
NASA Technical Reports Server (NTRS)
Rilee, Michael L.; Kuo, Kwo-Sen; Clune, Thomas; Oloso, Amidu; Brown, Paul G.; Yu, Honfeng
2016-01-01
We have implemented an updated Hierarchical Triangular Mesh (HTM) as the basis for a unified data model and an indexing scheme for geoscience data to address the variety challenge of Big Earth Data. We observe that, in the absence of variety, the volume challenge of Big Data is relatively easily addressable with parallel processing. The more important challenge in achieving optimal value with a Big Data solution for Earth Science (ES) data analysis, however, is being able to achieve good scalability with variety. With HTM unifying at least the three popular data models, i.e. Grid, Swath, and Point, used by current ES data products, data preparation time for integrative analysis of diverse datasets can be drastically reduced and better variety scaling can be achieved. In addition, since HTM is also an indexing scheme, when it is used to index all ES datasets, data placement alignment (or co-location) on the shared nothing architecture, which most Big Data systems are based on, is guaranteed and better performance is ensured. Moreover, our updated HTM encoding turns most geospatial set operations into integer interval operations, gaining further performance advantages.
NASA Astrophysics Data System (ADS)
Hayakawa, Hitoshi; Ogawa, Makoto; Shibata, Tadashi
2005-04-01
A very large scale integrated circuit (VLSI) architecture for a multiple-instruction-stream multiple-data-stream (MIMD) associative processor has been proposed. The processor employs an architecture that enables seamless switching from associative operations to arithmetic operations. The MIMD element is convertible to a regular central processing unit (CPU) while maintaining its high performance as an associative processor. Therefore, the MIMD associative processor can perform not only on-chip perception, i.e., searching for the vector most similar to an input vector throughout the on-chip cache memory, but also arithmetic and logic operations similar to those in ordinary CPUs, both simultaneously in parallel processing. Three key technologies have been developed to generate the MIMD element: associative-operation-and-arithmetic-operation switchable calculation units, a versatile register control scheme within the MIMD element for flexible operations, and a short instruction set for minimizing the memory size for program storage. Key circuit blocks were designed and fabricated using 0.18 μm complementary metal-oxide-semiconductor (CMOS) technology. As a result, the full-featured MIMD element is estimated to be 3 mm2, showing the feasibility of an 8-parallel-MIMD-element associative processor in a single chip of 5 mm× 5 mm.
A New Approach for Constructing Highly Stable High Order CESE Schemes
NASA Technical Reports Server (NTRS)
Chang, Sin-Chung
2010-01-01
A new approach is devised to construct high order CESE schemes which would avoid the common shortcomings of traditional high order schemes including: (a) susceptibility to computational instabilities; (b) computational inefficiency due to their local implicit nature (i.e., at each mesh points, need to solve a system of linear/nonlinear equations involving all the mesh variables associated with this mesh point); (c) use of large and elaborate stencils which complicates boundary treatments and also makes efficient parallel computing much harder; (d) difficulties in applications involving complex geometries; and (e) use of problem-specific techniques which are needed to overcome stability problems but often cause undesirable side effects. In fact it will be shown that, with the aid of a conceptual leap, one can build from a given 2nd-order CESE scheme its 4th-, 6th-, 8th-,... order versions which have the same stencil and same stability conditions of the 2nd-order scheme, and also retain all other advantages of the latter scheme. A sketch of multidimensional extensions will also be provided.
High Order Semi-Lagrangian Advection Scheme
NASA Astrophysics Data System (ADS)
Malaga, Carlos; Mandujano, Francisco; Becerra, Julian
2014-11-01
In most fluid phenomena, advection plays an important roll. A numerical scheme capable of making quantitative predictions and simulations must compute correctly the advection terms appearing in the equations governing fluid flow. Here we present a high order forward semi-Lagrangian numerical scheme specifically tailored to compute material derivatives. The scheme relies on the geometrical interpretation of material derivatives to compute the time evolution of fields on grids that deform with the material fluid domain, an interpolating procedure of arbitrary order that preserves the moments of the interpolated distributions, and a nonlinear mapping strategy to perform interpolations between undeformed and deformed grids. Additionally, a discontinuity criterion was implemented to deal with discontinuous fields and shocks. Tests of pure advection, shock formation and nonlinear phenomena are presented to show performance and convergence of the scheme. The high computational cost is considerably reduced when implemented on massively parallel architectures found in graphic cards. The authors acknowledge funding from Fondo Sectorial CONACYT-SENER Grant Number 42536 (DGAJ-SPI-34-170412-217).
ITG-TEM turbulence simulation with bounce-averaged kinetic electrons in tokamak geometry
NASA Astrophysics Data System (ADS)
Kwon, Jae-Min; Qi, Lei; Yi, S.; Hahm, T. S.
2017-06-01
We develop a novel numerical scheme to simulate electrostatic turbulence with kinetic electron responses in magnetically confined toroidal plasmas. Focusing on ion gyro-radius scale turbulences with slower frequencies than the time scales for electron parallel motions, we employ and adapt the bounce-averaged kinetic equation to model trapped electrons for nonlinear turbulence simulation with Coulomb collisions. Ions are modeled by employing the gyrokinetic equation. The newly developed scheme is implemented on a global δf particle in cell code gKPSP. By performing linear and nonlinear simulations, it is demonstrated that the new scheme can reproduce key physical properties of Ion Temperature Gradient (ITG) and Trapped Electron Mode (TEM) instabilities, and resulting turbulent transport. The overall computational cost of kinetic electrons using this novel scheme is limited to 200%-300% of the cost for simulations with adiabatic electrons. Therefore the new scheme allows us to perform kinetic simulations with trapped electrons very efficiently in magnetized plasmas.
Jung, Jaewoon; Mori, Takaharu; Kobayashi, Chigusa; Matsunaga, Yasuhiro; Yoda, Takao; Feig, Michael; Sugita, Yuji
2015-01-01
GENESIS (Generalized-Ensemble Simulation System) is a new software package for molecular dynamics (MD) simulations of macromolecules. It has two MD simulators, called ATDYN and SPDYN. ATDYN is parallelized based on an atomic decomposition algorithm for the simulations of all-atom force-field models as well as coarse-grained Go-like models. SPDYN is highly parallelized based on a domain decomposition scheme, allowing large-scale MD simulations on supercomputers. Hybrid schemes combining OpenMP and MPI are used in both simulators to target modern multicore computer architectures. Key advantages of GENESIS are (1) the highly parallel performance of SPDYN for very large biological systems consisting of more than one million atoms and (2) the availability of various REMD algorithms (T-REMD, REUS, multi-dimensional REMD for both all-atom and Go-like models under the NVT, NPT, NPAT, and NPγT ensembles). The former is achieved by a combination of the midpoint cell method and the efficient three-dimensional Fast Fourier Transform algorithm, where the domain decomposition space is shared in real-space and reciprocal-space calculations. Other features in SPDYN, such as avoiding concurrent memory access, reducing communication times, and usage of parallel input/output files, also contribute to the performance. We show the REMD simulation results of a mixed (POPC/DMPC) lipid bilayer as a real application using GENESIS. GENESIS is released as free software under the GPLv2 licence and can be easily modified for the development of new algorithms and molecular models. WIREs Comput Mol Sci 2015, 5:310–323. doi: 10.1002/wcms.1220 PMID:26753008
An Integrated Approach to Parameter Learning in Infinite-Dimensional Space
DOE Office of Scientific and Technical Information (OSTI.GOV)
Boyd, Zachary M.; Wendelberger, Joanne Roth
The availability of sophisticated modern physics codes has greatly extended the ability of domain scientists to understand the processes underlying their observations of complicated processes, but it has also introduced the curse of dimensionality via the many user-set parameters available to tune. Many of these parameters are naturally expressed as functional data, such as initial temperature distributions, equations of state, and controls. Thus, when attempting to find parameters that match observed data, being able to navigate parameter-space becomes highly non-trivial, especially considering that accurate simulations can be expensive both in terms of time and money. Existing solutions include batch-parallel simulations,more » high-dimensional, derivative-free optimization, and expert guessing, all of which make some contribution to solving the problem but do not completely resolve the issue. In this work, we explore the possibility of coupling together all three of the techniques just described by designing user-guided, batch-parallel optimization schemes. Our motivating example is a neutron diffusion partial differential equation where the time-varying multiplication factor serves as the unknown control parameter to be learned. We find that a simple, batch-parallelizable, random-walk scheme is able to make some progress on the problem but does not by itself produce satisfactory results. After reducing the dimensionality of the problem using functional principal component analysis (fPCA), we are able to track the progress of the solver in a visually simple way as well as viewing the associated principle components. This allows a human to make reasonable guesses about which points in the state space the random walker should try next. Thus, by combining the random walker's ability to find descent directions with the human's understanding of the underlying physics, it is possible to use expensive simulations more efficiently and more quickly arrive at the desired parameter set.« less
Convergence issues in domain decomposition parallel computation of hovering rotor
NASA Astrophysics Data System (ADS)
Xiao, Zhongyun; Liu, Gang; Mou, Bin; Jiang, Xiong
2018-05-01
Implicit LU-SGS time integration algorithm has been widely used in parallel computation in spite of its lack of information from adjacent domains. When applied to parallel computation of hovering rotor flows in a rotating frame, it brings about convergence issues. To remedy the problem, three LU factorization-based implicit schemes (consisting of LU-SGS, DP-LUR and HLU-SGS) are investigated comparatively. A test case of pure grid rotation is designed to verify these algorithms, which show that LU-SGS algorithm introduces errors on boundary cells. When partition boundaries are circumferential, errors arise in proportion to grid speed, accumulating along with the rotation, and leading to computational failure in the end. Meanwhile, DP-LUR and HLU-SGS methods show good convergence owing to boundary treatment which are desirable in domain decomposition parallel computations.
Li, Bingyi; Chen, Liang; Wei, Chunpeng; Xie, Yizhuang; Chen, He; Yu, Wenyue
2017-01-01
With the development of satellite load technology and very large scale integrated (VLSI) circuit technology, onboard real-time synthetic aperture radar (SAR) imaging systems have become a solution for allowing rapid response to disasters. A key goal of the onboard SAR imaging system design is to achieve high real-time processing performance with severe size, weight, and power consumption constraints. In this paper, we analyse the computational burden of the commonly used chirp scaling (CS) SAR imaging algorithm. To reduce the system hardware cost, we propose a partial fixed-point processing scheme. The fast Fourier transform (FFT), which is the most computation-sensitive operation in the CS algorithm, is processed with fixed-point, while other operations are processed with single precision floating-point. With the proposed fixed-point processing error propagation model, the fixed-point processing word length is determined. The fidelity and accuracy relative to conventional ground-based software processors is verified by evaluating both the point target imaging quality and the actual scene imaging quality. As a proof of concept, a field- programmable gate array—application-specific integrated circuit (FPGA-ASIC) hybrid heterogeneous parallel accelerating architecture is designed and realized. The customized fixed-point FFT is implemented using the 130 nm complementary metal oxide semiconductor (CMOS) technology as a co-processor of the Xilinx xc6vlx760t FPGA. A single processing board requires 12 s and consumes 21 W to focus a 50-km swath width, 5-m resolution stripmap SAR raw data with a granularity of 16,384 × 16,384. PMID:28672813
Yang, Chen; Li, Bingyi; Chen, Liang; Wei, Chunpeng; Xie, Yizhuang; Chen, He; Yu, Wenyue
2017-06-24
With the development of satellite load technology and very large scale integrated (VLSI) circuit technology, onboard real-time synthetic aperture radar (SAR) imaging systems have become a solution for allowing rapid response to disasters. A key goal of the onboard SAR imaging system design is to achieve high real-time processing performance with severe size, weight, and power consumption constraints. In this paper, we analyse the computational burden of the commonly used chirp scaling (CS) SAR imaging algorithm. To reduce the system hardware cost, we propose a partial fixed-point processing scheme. The fast Fourier transform (FFT), which is the most computation-sensitive operation in the CS algorithm, is processed with fixed-point, while other operations are processed with single precision floating-point. With the proposed fixed-point processing error propagation model, the fixed-point processing word length is determined. The fidelity and accuracy relative to conventional ground-based software processors is verified by evaluating both the point target imaging quality and the actual scene imaging quality. As a proof of concept, a field- programmable gate array-application-specific integrated circuit (FPGA-ASIC) hybrid heterogeneous parallel accelerating architecture is designed and realized. The customized fixed-point FFT is implemented using the 130 nm complementary metal oxide semiconductor (CMOS) technology as a co-processor of the Xilinx xc6vlx760t FPGA. A single processing board requires 12 s and consumes 21 W to focus a 50-km swath width, 5-m resolution stripmap SAR raw data with a granularity of 16,384 × 16,384.
ParaView visualization of Abaqus output on the mechanical deformation of complex microstructures
NASA Astrophysics Data System (ADS)
Liu, Qingbin; Li, Jiang; Liu, Jie
2017-02-01
Abaqus® is a popular software suite for finite element analysis. It delivers linear and nonlinear analyses of mechanical and fluid dynamics, includes multi-body system and multi-physics coupling. However, the visualization capability of Abaqus using its CAE module is limited. Models from microtomography have extremely complicated structures, and datasets of Abaqus output are huge, requiring a visualization tool more powerful than Abaqus/CAE. We convert Abaqus output into the XML-based VTK format by developing a Python script and then using ParaView to visualize the results. Such capabilities as volume rendering, tensor glyphs, superior animation and other filters allow ParaView to offer excellent visualizing manifestations. ParaView's parallel visualization makes it possible to visualize very big data. To support full parallel visualization, the Python script achieves data partitioning by reorganizing all nodes, elements and the corresponding results on those nodes and elements. The data partition scheme minimizes data redundancy and works efficiently. Given its good readability and extendibility, the script can be extended to the processing of more different problems in Abaqus. We share the script with Abaqus users on GitHub.
A dual-processor multi-frequency implementation of the FINDS algorithm
NASA Technical Reports Server (NTRS)
Godiwala, Pankaj M.; Caglayan, Alper K.
1987-01-01
This report presents a parallel processing implementation of the FINDS (Fault Inferring Nonlinear Detection System) algorithm on a dual processor configured target flight computer. First, a filter initialization scheme is presented which allows the no-fail filter (NFF) states to be initialized using the first iteration of the flight data. A modified failure isolation strategy, compatible with the new failure detection strategy reported earlier, is discussed and the performance of the new FDI algorithm is analyzed using flight recorded data from the NASA ATOPS B-737 aircraft in a Microwave Landing System (MLS) environment. The results show that low level MLS, IMU, and IAS sensor failures are detected and isolated instantaneously, while accelerometer and rate gyro failures continue to take comparatively longer to detect and isolate. The parallel implementation is accomplished by partitioning the FINDS algorithm into two parts: one based on the translational dynamics and the other based on the rotational kinematics. Finally, a multi-rate implementation of the algorithm is presented yielding significantly low execution times with acceptable estimation and FDI performance.
Kockmann, Tobias; Trachsel, Christian; Panse, Christian; Wahlander, Asa; Selevsek, Nathalie; Grossmann, Jonas; Wolski, Witold E; Schlapbach, Ralph
2016-08-01
Quantitative mass spectrometry is a rapidly evolving methodology applied in a large number of omics-type research projects. During the past years, new designs of mass spectrometers have been developed and launched as commercial systems while in parallel new data acquisition schemes and data analysis paradigms have been introduced. Core facilities provide access to such technologies, but also actively support the researchers in finding and applying the best-suited analytical approach. In order to implement a solid fundament for this decision making process, core facilities need to constantly compare and benchmark the various approaches. In this article we compare the quantitative accuracy and precision of current state of the art targeted proteomics approaches single reaction monitoring (SRM), parallel reaction monitoring (PRM) and data independent acquisition (DIA) across multiple liquid chromatography mass spectrometry (LC-MS) platforms, using a readily available commercial standard sample. All workflows are able to reproducibly generate accurate quantitative data. However, SRM and PRM workflows show higher accuracy and precision compared to DIA approaches, especially when analyzing low concentrated analytes. © 2016 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
NASA Astrophysics Data System (ADS)
Sadovskaya, Oxana; Sadovskii, Vladimir
2017-04-01
Under modeling the wave propagation processes in geomaterials (granular and porous media, soils and rocks) it is necessary to take into account the structural inhomogeneity of these materials. Parallel program systems for numerical solution of 2D and 3D problems of the dynamics of deformable media with constitutive relationships of rather general form on the basis of universal mathematical model describing small strains of elastic, elastic-plastic, granular and porous materials are worked out. In the case of an elastic material, the model is reduced to the system of equations, hyperbolic by Friedrichs, written in terms of velocities and stresses in a symmetric form. In the case of an elastic-plastic material, the model is a special formulation of the Prandtl-Reuss theory in the form of variational inequality with one-sided constraints on the stress tensor. Generalization of the model to describe granularity and the collapse of pores is obtained by means of the rheological approach, taking into account different resistance of materials to tension and compression. Rotational motion of particles in the material microstructure is considered within the framework of a mathematical model of the Cosserat continuum. Computational domain may have a blocky structure, composed of an arbitrary number of layers, strips in a layer and blocks in a strip from different materials with self-consistent curvilinear interfaces. At the external boundaries of computational domain the main types of dissipative boundary conditions in terms of velocities, stresses or mixed boundary conditions can be given. Shock-capturing algorithm is proposed for implementation of the model on supercomputers with cluster architecture. It is based on the two-cyclic splitting method with respect to spatial variables and the special procedures of the stresses correction to take into account plasticity, granularity or porosity of a material. An explicit monotone ENO-scheme is applied for solving one-dimensional systems of equations at the stages of splitting method. The parallelizing of computations is carried out using the MPI library and the SPMD technology. The data exchange between processors occurs at step "predictor" of the finite-difference scheme. Program systems allow simulate the propagation of waves produced by external mechanical effects in a medium, aggregated of arbitrary number of heterogeneous blocks. Some computations of dynamic problems with and without taking into account the moment properties of a material were performed on clusters of ICM SB RAS (Krasnoyarsk) and JSCC RAS (Moscow). Parallel program systems 2Dyn_Granular, 3Dyn_Granular, 2Dyn_Cosserat, 3Dyn_Cosserat and 2Dyn_Blocks_MPI for numerical solution of 2D and 3D elastic-plastic problems of the dynamics of granular media and problems of the Cosserat elasticity theory, as well as for modeling of the dynamic processes in multi-blocky media with pliant viscoelastic, porous and fluid-saturated interlayers on cluster systems were registered by Rospatent.
NASA Astrophysics Data System (ADS)
Raeli, Alice; Bergmann, Michel; Iollo, Angelo
2018-02-01
We consider problems governed by a linear elliptic equation with varying coefficients across internal interfaces. The solution and its normal derivative can undergo significant variations through these internal boundaries. We present a compact finite-difference scheme on a tree-based adaptive grid that can be efficiently solved using a natively parallel data structure. The main idea is to optimize the truncation error of the discretization scheme as a function of the local grid configuration to achieve second-order accuracy. Numerical illustrations are presented in two and three-dimensional configurations.
Programming scheme based optimization of hybrid 4T-2R OxRAM NVSRAM
NASA Astrophysics Data System (ADS)
Majumdar, Swatilekha; Kingra, Sandeep Kaur; Suri, Manan
2017-09-01
In this paper, we present a novel single-cycle programming scheme for 4T-2R NVSRAM, exploiting pulse engineered input signals. OxRAM devices based on 3 nm thick bi-layer active switching oxide and 90 nm CMOS technology node were used for all simulations. The cell design is implemented for real-time non-volatility rather than last-bit, or power-down non-volatility. Detailed analysis of the proposed single-cycle, parallel RRAM device programming scheme is presented in comparison to the two-cycle sequential RRAM programming used for similar 4T-2R NVSRAM bit-cells. The proposed single-cycle programming scheme coupled with the 4T-2R architecture leads to several benefits such as- possibility of unconventional transistor sizing, 50% lower latency, 20% improvement in SNM and ∼20× reduced energy requirements, when compared against two-cycle programming approach.
NASA Technical Reports Server (NTRS)
Hussaini, M. Y. (Editor); Kumar, A. (Editor); Salas, M. D. (Editor)
1993-01-01
The purpose here is to assess the state of the art in the areas of numerical analysis that are particularly relevant to computational fluid dynamics (CFD), to identify promising new developments in various areas of numerical analysis that will impact CFD, and to establish a long-term perspective focusing on opportunities and needs. Overviews are given of discretization schemes, computational fluid dynamics, algorithmic trends in CFD for aerospace flow field calculations, simulation of compressible viscous flow, and massively parallel computation. Also discussed are accerelation methods, spectral and high-order methods, multi-resolution and subcell resolution schemes, and inherently multidimensional schemes.
A conjugate gradient method for solving the non-LTE line radiation transfer problem
NASA Astrophysics Data System (ADS)
Paletou, F.; Anterrieu, E.
2009-12-01
This study concerns the fast and accurate solution of the line radiation transfer problem, under non-LTE conditions. We propose and evaluate an alternative iterative scheme to the classical ALI-Jacobi method, and to the more recently proposed Gauss-Seidel and successive over-relaxation (GS/SOR) schemes. Our study is indeed based on applying a preconditioned bi-conjugate gradient method (BiCG-P). Standard tests, in 1D plane parallel geometry and in the frame of the two-level atom model with monochromatic scattering are discussed. Rates of convergence between the previously mentioned iterative schemes are compared, as are their respective timing properties. The smoothing capability of the BiCG-P method is also demonstrated.
Distributed polar-coded OFDM based on Plotkin's construction for half duplex wireless communication
NASA Astrophysics Data System (ADS)
Umar, Rahim; Yang, Fengfan; Mughal, Shoaib; Xu, HongJun
2018-07-01
A Plotkin-based polar-coded orthogonal frequency division multiplexing (P-PC-OFDM) scheme is proposed and its bit error rate (BER) performance over additive white gaussian noise (AWGN), frequency selective Rayleigh, Rician and Nakagami-m fading channels has been evaluated. The considered Plotkin's construction possesses a parallel split in its structure, which motivated us to extend the proposed P-PC-OFDM scheme in a coded cooperative scenario. As the relay's effective collaboration has always been pivotal in the design of cooperative communication therefore, an efficient selection criterion for choosing the information bits has been inculcated at the relay node. To assess the BER performance of the proposed cooperative scheme, we have also upgraded conventional polar-coded cooperative scheme in the context of OFDM as an appropriate bench marker. The Monte Carlo simulated results revealed that the proposed Plotkin-based polar-coded cooperative OFDM scheme convincingly outperforms the conventional polar-coded cooperative OFDM scheme by 0.5 0.6 dBs over AWGN channel. This prominent gain in BER performance is made possible due to the bit-selection criteria and the joint successive cancellation decoding adopted at the relay and the destination nodes, respectively. Furthermore, the proposed coded cooperative schemes outperform their corresponding non-cooperative schemes by a gain of 1 dB under an identical condition.
Large-Scale Parallel Viscous Flow Computations using an Unstructured Multigrid Algorithm
NASA Technical Reports Server (NTRS)
Mavriplis, Dimitri J.
1999-01-01
The development and testing of a parallel unstructured agglomeration multigrid algorithm for steady-state aerodynamic flows is discussed. The agglomeration multigrid strategy uses a graph algorithm to construct the coarse multigrid levels from the given fine grid, similar to an algebraic multigrid approach, but operates directly on the non-linear system using the FAS (Full Approximation Scheme) approach. The scalability and convergence rate of the multigrid algorithm are examined on the SGI Origin 2000 and the Cray T3E. An argument is given which indicates that the asymptotic scalability of the multigrid algorithm should be similar to that of its underlying single grid smoothing scheme. For medium size problems involving several million grid points, near perfect scalability is obtained for the single grid algorithm, while only a slight drop-off in parallel efficiency is observed for the multigrid V- and W-cycles, using up to 128 processors on the SGI Origin 2000, and up to 512 processors on the Cray T3E. For a large problem using 25 million grid points, good scalability is observed for the multigrid algorithm using up to 1450 processors on a Cray T3E, even when the coarsest grid level contains fewer points than the total number of processors.
A Novel Passive Tracking Scheme Exploiting Geometric and Intercept Theorems
Zhou, Biao; Sun, Chao; Ahn, Deockhyeon; Kim, Youngok
2018-01-01
Passive tracking aims to track targets without assistant devices, that is, device-free targets. Passive tracking based on Radio Frequency (RF) Tomography in wireless sensor networks has recently been addressed as an emerging field. The passive tracking scheme using geometric theorems (GTs) is one of the most popular RF Tomography schemes, because the GT-based method can effectively mitigate the demand for a high density of wireless nodes. In the GT-based tracking scheme, the tracking scenario is considered as a two-dimensional geometric topology and then geometric theorems are applied to estimate crossing points (CPs) of the device-free target on line-of-sight links (LOSLs), which reveal the target’s trajectory information in a discrete form. In this paper, we review existing GT-based tracking schemes, and then propose a novel passive tracking scheme by exploiting the Intercept Theorem (IT). To create an IT-based CP estimation scheme available in the noisy non-parallel LOSL situation, we develop the equal-ratio traverse (ERT) method. Finally, we analyze properties of three GT-based tracking algorithms and the performance of these schemes is evaluated experimentally under various trajectories, node densities, and noisy topologies. Analysis of experimental results shows that tracking schemes exploiting geometric theorems can achieve remarkable positioning accuracy even under rather a low density of wireless nodes. Moreover, the proposed IT scheme can provide generally finer tracking accuracy under even lower node density and noisier topologies, in comparison to other schemes. PMID:29562621
Ghosh, Arup; Qin, Shiming; Lee, Jooyeoun; Wang, Gi-Nam
2016-01-01
Operational faults and behavioural anomalies associated with PLC control processes take place often in a manufacturing system. Real time identification of these operational faults and behavioural anomalies is necessary in the manufacturing industry. In this paper, we present an automated tool, called PLC Log-Data Analysis Tool (PLAT) that can detect them by using log-data records of the PLC signals. PLAT automatically creates a nominal model of the PLC control process and employs a novel hash table based indexing and searching scheme to satisfy those purposes. Our experiments show that PLAT is significantly fast, provides real time identification of operational faults and behavioural anomalies, and can execute within a small memory footprint. In addition, PLAT can easily handle a large manufacturing system with a reasonable computing configuration and can be installed in parallel to the data logging system to identify operational faults and behavioural anomalies effectively.
Ghosh, Arup; Qin, Shiming; Lee, Jooyeoun
2016-01-01
Operational faults and behavioural anomalies associated with PLC control processes take place often in a manufacturing system. Real time identification of these operational faults and behavioural anomalies is necessary in the manufacturing industry. In this paper, we present an automated tool, called PLC Log-Data Analysis Tool (PLAT) that can detect them by using log-data records of the PLC signals. PLAT automatically creates a nominal model of the PLC control process and employs a novel hash table based indexing and searching scheme to satisfy those purposes. Our experiments show that PLAT is significantly fast, provides real time identification of operational faults and behavioural anomalies, and can execute within a small memory footprint. In addition, PLAT can easily handle a large manufacturing system with a reasonable computing configuration and can be installed in parallel to the data logging system to identify operational faults and behavioural anomalies effectively. PMID:27974882
unmarked: An R package for fitting hierarchical models of wildlife occurrence and abundance
Fiske, Ian J.; Chandler, Richard B.
2011-01-01
Ecological research uses data collection techniques that are prone to substantial and unique types of measurement error to address scientific questions about species abundance and distribution. These data collection schemes include a number of survey methods in which unmarked individuals are counted, or determined to be present, at spatially- referenced sites. Examples include site occupancy sampling, repeated counts, distance sampling, removal sampling, and double observer sampling. To appropriately analyze these data, hierarchical models have been developed to separately model explanatory variables of both a latent abundance or occurrence process and a conditional detection process. Because these models have a straightforward interpretation paralleling mechanisms under which the data arose, they have recently gained immense popularity. The common hierarchical structure of these models is well-suited for a unified modeling interface. The R package unmarked provides such a unified modeling framework, including tools for data exploration, model fitting, model criticism, post-hoc analysis, and model comparison.
CMOS serial link for fully duplexed data communication
NASA Astrophysics Data System (ADS)
Lee, Kyeongho; Kim, Sungjoon; Ahn, Gijung; Jeong, Deog-Kyoon
1995-04-01
This paper describes a CMOS serial link allowing fully duplexed 500 Mbaud serial data communication. The CMOS serial link is a robust and low-cost solution to high data rate requirements. A central charge pump PLL for generating multiphase clocks for oversampling is shared by several serial link channels. Fully duplexed serial data communication is realized in the bidirectional bridge by separating incoming data from the mixed signal on the cable end. The digital PLL accomplishes process-independent data recovery by using a low-ratio oversampling, a majority voting, and a parallel data recovery scheme. Mostly, digital approach could extend its bandwidth further with scaled CMOS technology. A single channel serial link and a charge pump PLL are integrated in a test chip using 1.2 micron CMOS process technology. The test chip confirms upto 500 Mbaud unidirectional mode operation and 320 Mbaud fully duplexed mode operation with pseudo random data patterns.
Gaussian process surrogates for failure detection: A Bayesian experimental design approach
NASA Astrophysics Data System (ADS)
Wang, Hongqiao; Lin, Guang; Li, Jinglai
2016-05-01
An important task of uncertainty quantification is to identify the probability of undesired events, in particular, system failures, caused by various sources of uncertainties. In this work we consider the construction of Gaussian process surrogates for failure detection and failure probability estimation. In particular, we consider the situation that the underlying computer models are extremely expensive, and in this setting, determining the sampling points in the state space is of essential importance. We formulate the problem as an optimal experimental design for Bayesian inferences of the limit state (i.e., the failure boundary) and propose an efficient numerical scheme to solve the resulting optimization problem. In particular, the proposed limit-state inference method is capable of determining multiple sampling points at a time, and thus it is well suited for problems where multiple computer simulations can be performed in parallel. The accuracy and performance of the proposed method is demonstrated by both academic and practical examples.
Kinetics and evolved gas analysis for pyrolysis of food processing wastes using TGA/MS/FT-IR.
Özsin, Gamzenur; Pütün, Ayşe Eren
2017-06-01
The objective of this study was to identify the pyrolysis of different bio-waste produced by food processing industry in a comprehensible manner. For this purpose, pyrolysis behaviors of chestnut shells (CNS), cherry stones (CS) and grape seeds (GS) were investigated by thermogravimetric analysis (TGA) combined with a Fourier-transform infrared (FT-IR) spectrometer and a mass spectrometer (MS). In order to make available theoretical groundwork for biomass pyrolysis, activation energies were calculated with the help of four different model-free kinetic methods. The results are attributed to the complex reaction schemes which imply parallel, competitive and complex reactions during pyrolysis. During pyrolysis, the evolution of volatiles was also characterized by FT-IR and MS. The main evolved gases were determined as H 2 O, CO 2 and hydrocarbons such as CH 4 and temperature dependent profiles of the species were obtained. Copyright © 2017 Elsevier Ltd. All rights reserved.
NASA Technical Reports Server (NTRS)
Rajagopalan, J.; Xing, K.; Guo, Y.; Lee, F. C.; Manners, Bruce
1996-01-01
A simple, application-oriented, transfer function model of paralleled converters employing Master-Slave Current-sharing (MSC) control is developed. Dynamically, the Master converter retains its original design characteristics; all the Slave converters are forced to depart significantly from their original design characteristics into current-controlled current sources. Five distinct loop gains to assess system stability and performance are identified and their physical significance is described. A design methodology for the current share compensator is presented. The effect of this current sharing scheme on 'system output impedance' is analyzed.
NASA Astrophysics Data System (ADS)
Li, Xinhua; Song, Zhenyu; Zhan, Yongjie; Wu, Qiongzhi
2009-12-01
Since the system capacity is severely limited, reducing the multiple access interfere (MAI) is necessary in the multiuser direct-sequence code division multiple access (DS-CDMA) system which is used in the telecommunication terminals data-transferred link system. In this paper, we adopt an adaptive multistage parallel interference cancellation structure in the demodulator based on the least mean square (LMS) algorithm to eliminate the MAI on the basis of overviewing various of multiuser dectection schemes. Neither a training sequence nor a pilot signal is needed in the proposed scheme, and its implementation complexity can be greatly reduced by a LMS approximate algorithm. The algorithm and its FPGA implementation is then derived. Simulation results of the proposed adaptive PIC can outperform some of the existing interference cancellation methods in AWGN channels. The hardware setup of mutiuser demodulator is described, and the experimental results based on it demonstrate that the simulation results shows large performance gains over the conventional single-user demodulator.
Compile-time estimation of communication costs in multicomputers
NASA Technical Reports Server (NTRS)
Gupta, Manish; Banerjee, Prithviraj
1991-01-01
An important problem facing numerous research projects on parallelizing compilers for distributed memory machines is that of automatically determining a suitable data partitioning scheme for a program. Any strategy for automatic data partitioning needs a mechanism for estimating the performance of a program under a given partitioning scheme, the most crucial part of which involves determining the communication costs incurred by the program. A methodology is described for estimating the communication costs at compile-time as functions of the numbers of processors over which various arrays are distributed. A strategy is described along with its theoretical basis, for making program transformations that expose opportunities for combining of messages, leading to considerable savings in the communication costs. For certain loops with regular dependences, the compiler can detect the possibility of pipelining, and thus estimate communication costs more accurately than it could otherwise. These results are of great significance to any parallelization system supporting numeric applications on multicomputers. In particular, they lay down a framework for effective synthesis of communication on multicomputers from sequential program references.
Evaluation of Hamaker coefficients using Diffusion Monte Carlo method
NASA Astrophysics Data System (ADS)
Maezono, Ryo; Hongo, Kenta
We evaluated the Hamaker's constant for Cyclohexasilane to investigate its wettability, which is used as an ink of 'liquid silicon' in 'printed electronics'. Taking three representative geometries of the dimer coalescence (parallel, lined, and T-shaped), we evaluated these binding curves using diffusion Monte Carlo method. The parallel geometry gave the most long-ranged exponent, ~ 1 /r6 , in its asymptotic behavior. Evaluated binding lengths are fairly consistent with the experimental density of the molecule. The fitting of the asymptotic curve gave an estimation of Hamaker's constant being around 100 [zJ]. We also performed a CCSD(T) evaluation and got almost similar result. To check its justification, we applied the same scheme to Benzene and compared the estimation with those by other established methods, Lifshitz theory and SAPT (Symmetry-adopted perturbation theory). The result by the fitting scheme turned to be twice larger than those by Lifshitz and SAPT, both of which coincide with each other. It is hence implied that the present evaluation for Cyclohexasilane would be overestimated.
New scheme for image edge detection using the switching mechanism of nonlinear optical material
NASA Astrophysics Data System (ADS)
Pahari, Nirmalya; Mukhopadhyay, Sourangshu
2006-03-01
The limitations of electronics in conducting parallel arithmetic, algebraic, and logic processing are well known. Very high-speed (terahertz) performance cannot be expected in conventional electronic mechanisms. To achieve such performance we can introduce optics instead of electronics for information processing, computing, and data handling. Nonlinear optical material (NOM) is a successful candidate in this regard to play a major role in the domain of optically controlled switching systems. The character of some NOMs is such as to reflect the probe beam in the presence of two read beams (or pump beams) exciting the material from opposite directions, using the principle of four-wave mixing. In image processing, edge extraction from an image is an important and essential task. Several optical methods of digital image processing are used for properly evaluating the image edges. We propose here a new method of image edge detection, extraction, and enhancement by use of AND-based switching operations with NOM. In this process we have used the optically inverted image of a supplied image. This can be obtained by the EXOR switching operation of the NOM.
Lee, Kenneth K C; Mariampillai, Adrian; Yu, Joe X Z; Cadotte, David W; Wilson, Brian C; Standish, Beau A; Yang, Victor X D
2012-07-01
Advances in swept source laser technology continues to increase the imaging speed of swept-source optical coherence tomography (SS-OCT) systems. These fast imaging speeds are ideal for microvascular detection schemes, such as speckle variance (SV), where interframe motion can cause severe imaging artifacts and loss of vascular contrast. However, full utilization of the laser scan speed has been hindered by the computationally intensive signal processing required by SS-OCT and SV calculations. Using a commercial graphics processing unit that has been optimized for parallel data processing, we report a complete high-speed SS-OCT platform capable of real-time data acquisition, processing, display, and saving at 108,000 lines per second. Subpixel image registration of structural images was performed in real-time prior to SV calculations in order to reduce decorrelation from stationary structures induced by the bulk tissue motion. The viability of the system was successfully demonstrated in a high bulk tissue motion scenario of human fingernail root imaging where SV images (512 × 512 pixels, n = 4) were displayed at 54 frames per second.
Toluene in sewage and sludge in wastewater treatment plants.
Mrowiec, Bozena
2014-01-01
Toluene is a compound that often occurs in municipal wastewater ranging from detectable levels up to 237 μg/L. Before the year 2000, the presence of the aromatic hydrocarbons was assigned only to external sources. The Enhanced Biological Nutrients Removal Processes (EBNRP) work according to many different schemes and technologies. For high-efficiency biological denitrification and dephosphatation processes, the presence of volatile fatty acids (VFAs) in sewage is required. VFAs are the main product of organic matter hydrolysis from sewage sludge. However, no attention has been given to other products of the process. It has been found that in parallel to VFA production, toluene formation occurred. The formation of toluene in municipal anaerobic sludge digestion processes was investigated. Experiments were performed on a laboratory scale using sludge from primary and secondary settling tanks of municipal treatment plants. The concentration of toluene in the digested sludge from primary settling tanks was found to be about 42,000 μg/L. The digested sludge supernatant liquor returned to the biological dephosphatation and denitrification processes for sewage enrichment can contain up to 16,500 μg/L of toluene.
Tough2{_}MP: A parallel version of TOUGH2
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zhang, Keni; Wu, Yu-Shu; Ding, Chris
2003-04-09
TOUGH2{_}MP is a massively parallel version of TOUGH2. It was developed for running on distributed-memory parallel computers to simulate large simulation problems that may not be solved by the standard, single-CPU TOUGH2 code. The new code implements an efficient massively parallel scheme, while preserving the full capacity and flexibility of the original TOUGH2 code. The new software uses the METIS software package for grid partitioning and AZTEC software package for linear-equation solving. The standard message-passing interface is adopted for communication among processors. Numerical performance of the current version code has been tested on CRAY-T3E and IBM RS/6000 SP platforms. Inmore » addition, the parallel code has been successfully applied to real field problems of multi-million-cell simulations for three-dimensional multiphase and multicomponent fluid and heat flow, as well as solute transport. In this paper, we will review the development of the TOUGH2{_}MP, and discuss the basic features, modules, and their applications.« less
Kasahara, Kota; Ma, Benson; Goto, Kota; Dasgupta, Bhaskar; Higo, Junichi; Fukuda, Ikuo; Mashimo, Tadaaki; Akiyama, Yutaka; Nakamura, Haruki
2016-01-01
Molecular dynamics (MD) is a promising computational approach to investigate dynamical behavior of molecular systems at the atomic level. Here, we present a new MD simulation engine named "myPresto/omegagene" that is tailored for enhanced conformational sampling methods with a non-Ewald electrostatic potential scheme. Our enhanced conformational sampling methods, e.g. , the virtual-system-coupled multi-canonical MD (V-McMD) method, replace a multi-process parallelized run with multiple independent runs to avoid inter-node communication overhead. In addition, adopting the non-Ewald-based zero-multipole summation method (ZMM) makes it possible to eliminate the Fourier space calculations altogether. The combination of these state-of-the-art techniques realizes efficient and accurate calculations of the conformational ensemble at an equilibrium state. By taking these advantages, myPresto/omegagene is specialized for the single process execution with Graphics Processing Unit (GPU). We performed benchmark simulations for the 20-mer peptide, Trp-cage, with explicit solvent. One of the most thermodynamically stable conformations generated by the V-McMD simulation is very similar to an experimentally solved native conformation. Furthermore, the computation speed is four-times faster than that of our previous simulation engine, myPresto/psygene-G. The new simulator, myPresto/omegagene, is freely available at the following URLs: http://www.protein.osaka-u.ac.jp/rcsfp/pi/omegagene/ and http://presto.protein.osaka-u.ac.jp/myPresto4/.
Beam delivery system with a non-digitized diffractive beam splitter for laser-drilling of silicon
NASA Astrophysics Data System (ADS)
Amako, J.; Fujii, E.
2016-02-01
We report a beam-delivery system consisting of a non-digitized diffractive beam splitter and a Fourier transform lens. The system is applied to the deep-drilling of silicon using a nanosecond pulse laser in the manufacture of inkjet printer heads. In this process, a circularly polarized pulse beam is divided into an array of uniform beams, which are then delivered precisely to the process points. To meet these requirements, the splitter was designed to be polarization-independent with an efficiency>95%. The optical elements were assembled so as to allow the fine tuning of the effective overall focal length by adjusting the wavefront curvature of the beam. Using the system, a beam alignment accuracy of<5 μm was achieved for a 12-mm-wide beam array and the throughput was substantially improved (10,000 points on a silicon wafer drilled in ~1 min). This beam-delivery scheme works for a variety of laser applications that require parallel processing.
Enhanced High Performance Power Compensation Methodology by IPFC Using PIGBT-IDVR
Arumugom, Subramanian; Rajaram, Marimuthu
2015-01-01
Currently, power systems are involuntarily controlled without high speed control and are frequently initiated, therefore resulting in a slow process when compared with static electronic devices. Among various power interruptions in power supply systems, voltage dips play a central role in causing disruption. The dynamic voltage restorer (DVR) is a process based on voltage control that compensates for line transients in the distributed system. To overcome these issues and to achieve a higher speed, a new methodology called the Parallel IGBT-Based Interline Dynamic Voltage Restorer (PIGBT-IDVR) method has been proposed, which mainly spotlights the dynamic processing of energy reloads in common dc-linked energy storage with less adaptive transition. The interline power flow controller (IPFC) scheme has been employed to manage the power transmission between the lines and the restorer method for controlling the reactive power in the individual lines. By employing the proposed methodology, the failure of a distributed system has been avoided and provides better performance than the existing methodologies. PMID:26613101
NASA Astrophysics Data System (ADS)
Hassan Asemani, Mohammad; Johari Majd, Vahid
2015-12-01
This paper addresses a robust H∞ fuzzy observer-based tracking design problem for uncertain Takagi-Sugeno fuzzy systems with external disturbances. To have a practical observer-based controller, the premise variables of the system are assumed to be not measurable in general, which leads to a more complex design process. The tracker is synthesised based on a fuzzy Lyapunov function approach and non-parallel distributed compensation (non-PDC) scheme. Using the descriptor redundancy approach, the robust stability conditions are derived in the form of strict linear matrix inequalities (LMIs) even in the presence of uncertainties in the system, input, and output matrices simultaneously. Numerical simulations are provided to show the effectiveness of the proposed method.
High-performance parallel analysis of coupled problems for aircraft propulsion
NASA Technical Reports Server (NTRS)
Felippa, C. A.; Farhat, C.; Lanteri, S.; Gumaste, U.; Ronaghi, M.
1994-01-01
Applications are described of high-performance parallel, computation for the analysis of complete jet engines, considering its multi-discipline coupled problem. The coupled problem involves interaction of structures with gas dynamics, heat conduction and heat transfer in aircraft engines. The methodology issues addressed include: consistent discrete formulation of coupled problems with emphasis on coupling phenomena; effect of partitioning strategies, augmentation and temporal solution procedures; sensitivity of response to problem parameters; and methods for interfacing multiscale discretizations in different single fields. The computer implementation issues addressed include: parallel treatment of coupled systems; domain decomposition and mesh partitioning strategies; data representation in object-oriented form and mapping to hardware driven representation, and tradeoff studies between partitioning schemes and fully coupled treatment.
Global Load Balancing with Parallel Mesh Adaption on Distributed-Memory Systems
NASA Technical Reports Server (NTRS)
Biswas, Rupak; Oliker, Leonid; Sohn, Andrew
1996-01-01
Dynamic mesh adaption on unstructured grids is a powerful tool for efficiently computing unsteady problems to resolve solution features of interest. Unfortunately, this causes load imbalance among processors on a parallel machine. This paper describes the parallel implementation of a tetrahedral mesh adaption scheme and a new global load balancing method. A heuristic remapping algorithm is presented that assigns partitions to processors such that the redistribution cost is minimized. Results indicate that the parallel performance of the mesh adaption code depends on the nature of the adaption region and show a 35.5X speedup on 64 processors of an SP2 when 35% of the mesh is randomly adapted. For large-scale scientific computations, our load balancing strategy gives almost a sixfold reduction in solver execution times over non-balanced loads. Furthermore, our heuristic remapper yields processor assignments that are less than 3% off the optimal solutions but requires only 1% of the computational time.
Fast parallel 3D profilometer with DMD technology
NASA Astrophysics Data System (ADS)
Hou, Wenmei; Zhang, Yunbo
2011-12-01
Confocal microscope has been a powerful tool for three-dimensional profile analysis. Single mode confocal microscope is limited by scanning speed. This paper presents a 3D profilometer prototype of parallel confocal microscope based on DMD (Digital Micromirror Device). In this system the DMD takes the place of Nipkow Disk which is a classical parallel scanning scheme to realize parallel lateral scanning technique. Operated with certain pattern, the DMD generates a virtual pinholes array which separates the light into multi-beams. The key parameters that affect the measurement (pinhole size and the lateral scanning distance) can be configured conveniently by different patterns sent to DMD chip. To avoid disturbance between two virtual pinholes working at the same time, a scanning strategy is adopted. Depth response curve both axial and abaxial were extract. Measurement experiments have been carried out on silicon structured sample, and axial resolution of 55nm is achieved.
A highly parallel multigrid-like method for the solution of the Euler equations
NASA Technical Reports Server (NTRS)
Tuminaro, Ray S.
1989-01-01
We consider a highly parallel multigrid-like method for the solution of the two dimensional steady Euler equations. The new method, introduced as filtering multigrid, is similar to a standard multigrid scheme in that convergence on the finest grid is accelerated by iterations on coarser grids. In the filtering method, however, additional fine grid subproblems are processed concurrently with coarse grid computations to further accelerate convergence. These additional problems are obtained by splitting the residual into a smooth and an oscillatory component. The smooth component is then used to form a coarse grid problem (similar to standard multigrid) while the oscillatory component is used for a fine grid subproblem. The primary advantage in the filtering approach is that fewer iterations are required and that most of the additional work per iteration can be performed in parallel with the standard coarse grid computations. We generalize the filtering algorithm to a version suitable for nonlinear problems. We emphasize that this generalization is conceptually straight-forward and relatively easy to implement. In particular, no explicit linearization (e.g., formation of Jacobians) needs to be performed (similar to the FAS multigrid approach). We illustrate the nonlinear version by applying it to the Euler equations, and presenting numerical results. Finally, a performance evaluation is made based on execution time models and convergence information obtained from numerical experiments.
NASA Astrophysics Data System (ADS)
Zhang, Jilin; Sha, Chaoqun; Wu, Yusen; Wan, Jian; Zhou, Li; Ren, Yongjian; Si, Huayou; Yin, Yuyu; Jing, Ya
2017-02-01
GPU not only is used in the field of graphic technology but also has been widely used in areas needing a large number of numerical calculations. In the energy industry, because of low carbon, high energy density, high duration and other characteristics, the development of nuclear energy cannot easily be replaced by other energy sources. Management of core fuel is one of the major areas of concern in a nuclear power plant, and it is directly related to the economic benefits and cost of nuclear power. The large-scale reactor core expansion equation is large and complicated, so the calculation of the diffusion equation is crucial in the core fuel management process. In this paper, we use CUDA programming technology on a GPU cluster to run the LU-SGS parallel iterative calculation against the background of the diffusion equation of the reactor. We divide one-dimensional and two-dimensional mesh into a plurality of domains, with each domain evenly distributed on the GPU blocks. A parallel collision scheme is put forward that defines the virtual boundary of the grid exchange information and data transmission by non-stop collision. Compared with the serial program, the experiment shows that GPU greatly improves the efficiency of program execution and verifies that GPU is playing a much more important role in the field of numerical calculations.
NASA Astrophysics Data System (ADS)
Rerucha, Simon; Sarbort, Martin; Hola, Miroslava; Cizek, Martin; Hucl, Vaclav; Cip, Ondrej; Lazar, Josef
2016-12-01
The homodyne detection with only a single detector represents a promising approach in the interferometric application which enables a significant reduction of the optical system complexity while preserving the fundamental resolution and dynamic range of the single frequency laser interferometers. We present the design, implementation and analysis of algorithmic methods for computational processing of the single-detector interference signal based on parallel pipelined processing suitable for real time implementation on a programmable hardware platform (e.g. the FPGA - Field Programmable Gate Arrays or the SoC - System on Chip). The algorithmic methods incorporate (a) the single detector signal (sine) scaling, filtering, demodulations and mixing necessary for the second (cosine) quadrature signal reconstruction followed by a conic section projection in Cartesian plane as well as (a) the phase unwrapping together with the goniometric and linear transformations needed for the scale linearization and periodic error correction. The digital computing scheme was designed for bandwidths up to tens of megahertz which would allow to measure the displacements at the velocities around half metre per second. The algorithmic methods were tested in real-time operation with a PC-based reference implementation that employed the advantage pipelined processing by balancing the computational load among multiple processor cores. The results indicate that the algorithmic methods are suitable for a wide range of applications [3] and that they are bringing the fringe counting interferometry closer to the industrial applications due to their optical setup simplicity and robustness, computational stability, scalability and also a cost-effectiveness.
NASA Astrophysics Data System (ADS)
Ouyang, Lizhi
A systematic improvement and extension of the orthogonalized linear combinations of atomic orbitals method was carried out using a combined computational and theoretical approach. For high performance parallel computing, a Beowulf class personal computer cluster was constructed. It also served as a parallel program development platform that helped us to port the programs of the method to the national supercomputer facilities. The program, received a language upgrade from Fortran 77 to Fortran 90, and a dynamic memory allocation feature. A preliminary parallel High Performance Fortran version of the program has been developed as well. To be of more benefit though, scalability improvements are needed. In order to circumvent the difficulties of the analytical force calculation in the method, we developed a geometry optimization scheme using the finite difference approximation based on the total energy calculation. The implementation of this scheme was facilitated by the powerful general utility lattice program, which offers many desired features such as multiple optimization schemes and usage of space group symmetry. So far, many ceramic oxides have been tested with the geometry optimization program. Their optimized geometries were in excellent agreement with the experimental data. For nine ceramic oxide crystals, the optimized cell parameters differ from the experimental ones within 0.5%. Moreover, the geometry optimization was recently used to predict a new phase of TiNx. The method has also been used to investigate a complex Vitamin B12-derivative, the OHCbl crystals. In order to overcome the prohibitive disk I/O demand, an on-demand version of the method was developed. Based on the electronic structure calculation of the OHCbl crystal, a partial density of states analysis and a bond order analysis were carried out. The calculated bonding of the corrin ring of OHCbl model was coincident with the big open-ring pi bond. One interesting find of the calculation was that the Co-OH bond was weak. This, together with the ongoing projects studying different Vitamin B12 derivatives, might help us to answer questions about the Co-C cleavage of the B12 coenzyme, which is involved in many important B12 enzymatic reactions.
A FAST ITERATIVE METHOD FOR SOLVING THE EIKONAL EQUATION ON TRIANGULATED SURFACES*
Fu, Zhisong; Jeong, Won-Ki; Pan, Yongsheng; Kirby, Robert M.; Whitaker, Ross T.
2012-01-01
This paper presents an efficient, fine-grained parallel algorithm for solving the Eikonal equation on triangular meshes. The Eikonal equation, and the broader class of Hamilton–Jacobi equations to which it belongs, have a wide range of applications from geometric optics and seismology to biological modeling and analysis of geometry and images. The ability to solve such equations accurately and efficiently provides new capabilities for exploring and visualizing parameter spaces and for solving inverse problems that rely on such equations in the forward model. Efficient solvers on state-of-the-art, parallel architectures require new algorithms that are not, in many cases, optimal, but are better suited to synchronous updates of the solution. In previous work [W. K. Jeong and R. T. Whitaker, SIAM J. Sci. Comput., 30 (2008), pp. 2512–2534], the authors proposed the fast iterative method (FIM) to efficiently solve the Eikonal equation on regular grids. In this paper we extend the fast iterative method to solve Eikonal equations efficiently on triangulated domains on the CPU and on parallel architectures, including graphics processors. We propose a new local update scheme that provides solutions of first-order accuracy for both architectures. We also propose a novel triangle-based update scheme and its corresponding data structure for efficient irregular data mapping to parallel single-instruction multiple-data (SIMD) processors. We provide detailed descriptions of the implementations on a single CPU, a multicore CPU with shared memory, and SIMD architectures with comparative results against state-of-the-art Eikonal solvers. PMID:22641200
Parallel Aircraft Trajectory Optimization with Analytic Derivatives
NASA Technical Reports Server (NTRS)
Falck, Robert D.; Gray, Justin S.; Naylor, Bret
2016-01-01
Trajectory optimization is an integral component for the design of aerospace vehicles, but emerging aircraft technologies have introduced new demands on trajectory analysis that current tools are not well suited to address. Designing aircraft with technologies such as hybrid electric propulsion and morphing wings requires consideration of the operational behavior as well as the physical design characteristics of the aircraft. The addition of operational variables can dramatically increase the number of design variables which motivates the use of gradient based optimization with analytic derivatives to solve the larger optimization problems. In this work we develop an aircraft trajectory analysis tool using a Legendre-Gauss-Lobatto based collocation scheme, providing analytic derivatives via the OpenMDAO multidisciplinary optimization framework. This collocation method uses an implicit time integration scheme that provides a high degree of sparsity and thus several potential options for parallelization. The performance of the new implementation was investigated via a series of single and multi-trajectory optimizations using a combination of parallel computing and constraint aggregation. The computational performance results show that in order to take full advantage of the sparsity in the problem it is vital to parallelize both the non-linear analysis evaluations and the derivative computations themselves. The constraint aggregation results showed a significant numerical challenge due to difficulty in achieving tight convergence tolerances. Overall, the results demonstrate the value of applying analytic derivatives to trajectory optimization problems and lay the foundation for future application of this collocation based method to the design of aircraft with where operational scheduling of technologies is key to achieving good performance.
A High-Resolution Capability for Large-Eddy Simulation of Jet Flows
NASA Technical Reports Server (NTRS)
DeBonis, James R.
2011-01-01
A large-eddy simulation (LES) code that utilizes high-resolution numerical schemes is described and applied to a compressible jet flow. The code is written in a general manner such that the accuracy/resolution of the simulation can be selected by the user. Time discretization is performed using a family of low-dispersion Runge-Kutta schemes, selectable from first- to fourth-order. Spatial discretization is performed using central differencing schemes. Both standard schemes, second- to twelfth-order (3 to 13 point stencils) and Dispersion Relation Preserving schemes from 7 to 13 point stencils are available. The code is written in Fortran 90 and uses hybrid MPI/OpenMP parallelization. The code is applied to the simulation of a Mach 0.9 jet flow. Four-stage third-order Runge-Kutta time stepping and the 13 point DRP spatial discretization scheme of Bogey and Bailly are used. The high resolution numerics used allows for the use of relatively sparse grids. Three levels of grid resolution are examined, 3.5, 6.5, and 9.2 million points. Mean flow, first-order turbulent statistics and turbulent spectra are reported. Good agreement with experimental data for mean flow and first-order turbulent statistics is shown.
Research to Assembly Scheme for Satellite Deck Based on Robot Flexibility Control Principle
NASA Astrophysics Data System (ADS)
Guo, Tao; Hu, Ruiqin; Xiao, Zhengyi; Zhao, Jingjing; Fang, Zhikai
2018-03-01
Deck assembly is critical quality control point in final satellite assembly process, and cable extrusion and structure collision problems in assembly process will affect development quality and progress of satellite directly. Aimed at problems existing in deck assembly process, assembly project scheme for satellite deck based on robot flexibility control principle is proposed in this paper. Scheme is introduced firstly; secondly, key technologies on end force perception and flexible docking control in the scheme are studied; then, implementation process of assembly scheme for satellite deck is described in detail; finally, actual application case of assembly scheme is given. Result shows that compared with traditional assembly scheme, assembly scheme for satellite deck based on robot flexibility control principle has obvious advantages in work efficiency, reliability and universality aspects etc.
Accelerating NBODY6 with graphics processing units
NASA Astrophysics Data System (ADS)
Nitadori, Keigo; Aarseth, Sverre J.
2012-07-01
We describe the use of graphics processing units (GPUs) for speeding up the code NBODY6 which is widely used for direct N-body simulations. Over the years, the N2 nature of the direct force calculation has proved a barrier for extending the particle number. Following an early introduction of force polynomials and individual time steps, the calculation cost was first reduced by the introduction of a neighbour scheme. After a decade of GRAPE computers which speeded up the force calculation further, we are now in the era of GPUs where relatively small hardware systems are highly cost effective. A significant gain in efficiency is achieved by employing the GPU to obtain the so-called regular force which typically involves some 99 per cent of the particles, while the remaining local forces are evaluated on the host. However, the latter operation is performed up to 20 times more frequently and may still account for a significant cost. This effort is reduced by parallel SSE/AVX procedures where each interaction term is calculated using mainly single precision. We also discuss further strategies connected with coordinate and velocity prediction required by the integration scheme. This leaves hard binaries and multiple close encounters which are treated by several regularization methods. The present NBODY6-GPU code is well balanced for simulations in the particle range 104-2 × 105 for a dual-GPU system attached to a standard PC.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gong, Zhenhuan; Boyuka, David; Zou, X
Download Citation Email Print Request Permissions Save to Project The size and scope of cutting-edge scientific simulations are growing much faster than the I/O and storage capabilities of their run-time environments. The growing gap is exacerbated by exploratory, data-intensive analytics, such as querying simulation data with multivariate, spatio-temporal constraints, which induces heterogeneous access patterns that stress the performance of the underlying storage system. Previous work addresses data layout and indexing techniques to improve query performance for a single access pattern, which is not sufficient for complex analytics jobs. We present PARLO a parallel run-time layout optimization framework, to achieve multi-levelmore » data layout optimization for scientific applications at run-time before data is written to storage. The layout schemes optimize for heterogeneous access patterns with user-specified priorities. PARLO is integrated with ADIOS, a high-performance parallel I/O middleware for large-scale HPC applications, to achieve user-transparent, light-weight layout optimization for scientific datasets. It offers simple XML-based configuration for users to achieve flexible layout optimization without the need to modify or recompile application codes. Experiments show that PARLO improves performance by 2 to 26 times for queries with heterogeneous access patterns compared to state-of-the-art scientific database management systems. Compared to traditional post-processing approaches, its underlying run-time layout optimization achieves a 56% savings in processing time and a reduction in storage overhead of up to 50%. PARLO also exhibits a low run-time resource requirement, while also limiting the performance impact on running applications to a reasonable level.« less
Parallel heterogeneous architectures for efficient OMP compressive sensing reconstruction
NASA Astrophysics Data System (ADS)
Kulkarni, Amey; Stanislaus, Jerome L.; Mohsenin, Tinoosh
2014-05-01
Compressive Sensing (CS) is a novel scheme, in which a signal that is sparse in a known transform domain can be reconstructed using fewer samples. The signal reconstruction techniques are computationally intensive and have sluggish performance, which make them impractical for real-time processing applications . The paper presents novel architectures for Orthogonal Matching Pursuit algorithm, one of the popular CS reconstruction algorithms. We show the implementation results of proposed architectures on FPGA, ASIC and on a custom many-core platform. For FPGA and ASIC implementation, a novel thresholding method is used to reduce the processing time for the optimization problem by at least 25%. Whereas, for the custom many-core platform, efficient parallelization techniques are applied, to reconstruct signals with variant signal lengths of N and sparsity of m. The algorithm is divided into three kernels. Each kernel is parallelized to reduce execution time, whereas efficient reuse of the matrix operators allows us to reduce area. Matrix operations are efficiently paralellized by taking advantage of blocked algorithms. For demonstration purpose, all architectures reconstruct a 256-length signal with maximum sparsity of 8 using 64 measurements. Implementation on Xilinx Virtex-5 FPGA, requires 27.14 μs to reconstruct the signal using basic OMP. Whereas, with thresholding method it requires 18 μs. ASIC implementation reconstructs the signal in 13 μs. However, our custom many-core, operating at 1.18 GHz, takes 18.28 μs to complete. Our results show that compared to the previous published work of the same algorithm and matrix size, proposed architectures for FPGA and ASIC implementations perform 1.3x and 1.8x respectively faster. Also, the proposed many-core implementation performs 3000x faster than the CPU and 2000x faster than the GPU.
A simplified filterless photonic frequency octupling scheme based on cascaded modulators
NASA Astrophysics Data System (ADS)
Zhang, Wu; Wen, Aijun; Gao, Yongsheng; Zheng, Hanxiao; Chen, Wei; He, Hongye
2017-04-01
A simplified filterless frequency octupling scheme by connecting an intensity modulator (IM) with a dual-parallel Mach-Zehnder (DPMZM) in series is proposed in this paper. The LO signal is distributed into two parts, and one part is used to drive the IM and the other part is applied to drive the DPMZM's upper sub-modulator, both at the peak point. The lower sub-modulator is only driven by dc bias, and the parent modulator works at null point. By properly adjusting dc bias of the lower sub-modulator, only ±4th-order optical sidebands dominate at the output of the DPMZM. The approach is verified by experiments, and 32-GHz and 40-GHz millimetre waves (mm-waves) are generated using 4-GHz and 5-GHz LO signals, respectively. We acquire a 15-dB electrical spurious suppression ratio (ESSR) and a relatively good phase noise of the signal. Compared with other schemes, the scheme is simple in configuration because only an IM and a DPMZM are needed. What's more, the scheme is tunable in frequency as no filter is used.
GALLAUDET'S NEW HEARING AND SPEECH CENTER.
ERIC Educational Resources Information Center
FRISINA, D. ROBERT
THIS REPROT DESCRIBES THE DESIGN OF A NEW SPEECH AND HEARING CENTER AND ITS INTEGRATION INTO THE OVERALL ARCHITECTURAL SCHEME OF THE CAMPUS. THE CIRCULAR SHAPE WAS SELECTED TO COMPLEMENT THE SURROUNDING STRUCTURES AND COMPENSATE FOR DIFFERENCES IN SITE, WHILE PROVIDING THE ACOUSTICAL ADVANTAGES OF NON-PARALLEL WALLS, AND FACILITATING TRAFFIC FLOW.…
Sun, Xiaole; Djordjevic, Ivan B; Neifeld, Mark A
2016-11-28
We investigate a multiple spatial modes based quantum key distribution (QKD) scheme that employs multiple independent parallel beams through a marine free-space optical channel over open ocean. This approach provides the potential to increase secret key rate (SKR) linearly with the number of channels. To improve the SKR performance, we describe a back-propagation mode (BPM) method to mitigate the atmospheric turbulence effects. Our simulation results indicate that the secret key rate can be improved significantly by employing the proposed BPM-based multi-channel QKD scheme.
NASA Astrophysics Data System (ADS)
Hirai, K.; Katoh, Y.; Terada, N.; Kawai, S.
2016-12-01
In accretion disks, magneto-rotational instability (MRI; Balbus & Hawley, 1991) makes the disk gas in the magnetic turbulent state and drives efficient mass accretion into a central star. MRI drives turbulence through the evolution of the parasitic instability (PI; Goodman & Xu, 1994), which is related to both Kelvin-Helmholtz (K-H) instability and magnetic reconnection. The wave number vector of PI is strongly affected by both magnetic diffusivity and fluid viscosity (Pessah, 2010). This fact makes MHD simulation of MRI difficult, because we need to employ the numerical diffusivity for treating discontinuities in compressible MHD simulation schemes. Therefore, it is necessary to use an MHD scheme that has both high-order accuracy so as to resolve MRI driven turbulence and small numerical diffusivity enough to treat discontinuities. We have originally developed an MHD code by employing the scheme proposed by Kawai (2013). This scheme focuses on resolving turbulence accurately by using a high-order compact difference scheme (Lele, 1992), and meanwhile, the scheme treats discontinuities by using the localized artificial diffusivity method (Kawai, 2013). Our code also employs the pipeline algorithm (Matsuura & Kato, 2007) for MPI parallelization without diminishing the accuracy of the compact difference scheme. We carry out a 3-dimensional ideal MHD simulation with a net vertical magnetic field in the local shearing box disk model. We use 256x256x128 grids. Simulation results show that the spatially averaged turbulent stress induced by MRI linearly grows until around 2.8 orbital periods, and decreases after the saturation. We confirm the strong enhancement of the K-H mode PI at a timing just before the saturation, identified by the enhancement of its anisotropic wavenumber spectra in the 2-dimensional wavenumber space. The wave number of the maximum growth of PI reproduced in the simulation result is larger than the linear analysis. This discrepancy is explained by the simulation result that a shear flow created by MRI locally becomes thinner and faster due to interactions between antiparallel vortices induced by K-H mode PI, and this structure induces small scale waves which break the shear flow itself. We report the results of the simulation, and discuss how the saturation amplitude of MRI is determined.
Yu, Yinan; Diamantaras, Konstantinos I; McKelvey, Tomas; Kung, Sun-Yuan
2018-02-01
In kernel-based classification models, given limited computational power and storage capacity, operations over the full kernel matrix becomes prohibitive. In this paper, we propose a new supervised learning framework using kernel models for sequential data processing. The framework is based on two components that both aim at enhancing the classification capability with a subset selection scheme. The first part is a subspace projection technique in the reproducing kernel Hilbert space using a CLAss-specific Subspace Kernel representation for kernel approximation. In the second part, we propose a novel structural risk minimization algorithm called the adaptive margin slack minimization to iteratively improve the classification accuracy by an adaptive data selection. We motivate each part separately, and then integrate them into learning frameworks for large scale data. We propose two such frameworks: the memory efficient sequential processing for sequential data processing and the parallelized sequential processing for distributed computing with sequential data acquisition. We test our methods on several benchmark data sets and compared with the state-of-the-art techniques to verify the validity of the proposed techniques.
Reaction behaviors of decomposition of monocrotophos in aqueous solution by UV and UV/O processes.
Ku, Y; Wang, W; Shen, Y S
2000-02-01
The decomposition of monocrotophos (cis-3-dimethoxyphosphinyloxy-N-methyl-crotonamide) in aqueous solution by UV and UV/O(3) processes was studied. The experiments were carried out under various solution pH values to investigate the decomposition efficiencies of the reactant and organic intermediates in order to determine the completeness of decomposition. The photolytic decomposition rate of monocrotophos was increased with increasing solution pH because the solution pH affects the distribution and light absorbance of monocrotophos species. The combination of O(3) with UV light apparently promoted the decomposition and mineralization of monocrotophos in aqueous solution. For the UV/O(3) process, the breakage of the >C=C< bond of monocrotophos by ozone molecules was found to occur first, followed by mineralization by hydroxyl radicals to generate CO(3)(2-), PO4(3-), and NO(3)(-) anions in sequence. The quasi-global kinetics based on a simplified consecutive-parallel reaction scheme was developed to describe the temporal behavior of monocrotophos decomposition in aqueous solution by the UV/O(3) process.
Parallel processing considerations for image recognition tasks
NASA Astrophysics Data System (ADS)
Simske, Steven J.
2011-01-01
Many image recognition tasks are well-suited to parallel processing. The most obvious example is that many imaging tasks require the analysis of multiple images. From this standpoint, then, parallel processing need be no more complicated than assigning individual images to individual processors. However, there are three less trivial categories of parallel processing that will be considered in this paper: parallel processing (1) by task; (2) by image region; and (3) by meta-algorithm. Parallel processing by task allows the assignment of multiple workflows-as diverse as optical character recognition [OCR], document classification and barcode reading-to parallel pipelines. This can substantially decrease time to completion for the document tasks. For this approach, each parallel pipeline is generally performing a different task. Parallel processing by image region allows a larger imaging task to be sub-divided into a set of parallel pipelines, each performing the same task but on a different data set. This type of image analysis is readily addressed by a map-reduce approach. Examples include document skew detection and multiple face detection and tracking. Finally, parallel processing by meta-algorithm allows different algorithms to be deployed on the same image simultaneously. This approach may result in improved accuracy.
31 CFR 592.301 - Controlled through the Kimberley Process Certification Scheme.
Code of Federal Regulations, 2011 CFR
2011-07-01
... Process Certification Scheme. 592.301 Section 592.301 Money and Finance: Treasury Regulations Relating to... Certification Scheme. (a) Except as otherwise provided in paragraph (b) of this section, the term controlled through the Kimberley Process Certification Scheme refers to the following requirements that apply, as...
31 CFR 592.301 - Controlled through the Kimberley Process Certification Scheme.
Code of Federal Regulations, 2013 CFR
2013-07-01
... Process Certification Scheme. 592.301 Section 592.301 Money and Finance: Treasury Regulations Relating to... Certification Scheme. (a) Except as otherwise provided in paragraph (b) of this section, the term controlled through the Kimberley Process Certification Scheme refers to the following requirements that apply, as...
31 CFR 592.301 - Controlled through the Kimberley Process Certification Scheme.
Code of Federal Regulations, 2014 CFR
2014-07-01
... Process Certification Scheme. 592.301 Section 592.301 Money and Finance: Treasury Regulations Relating to... Certification Scheme. (a) Except as otherwise provided in paragraph (b) of this section, the term controlled through the Kimberley Process Certification Scheme refers to the following requirements that apply, as...
31 CFR 592.301 - Controlled through the Kimberley Process Certification Scheme.
Code of Federal Regulations, 2010 CFR
2010-07-01
... Process Certification Scheme. 592.301 Section 592.301 Money and Finance: Treasury Regulations Relating to... Certification Scheme. (a) Except as otherwise provided in paragraph (b) of this section, the term controlled through the Kimberley Process Certification Scheme refers to the following requirements that apply, as...
31 CFR 592.301 - Controlled through the Kimberley Process Certification Scheme.
Code of Federal Regulations, 2012 CFR
2012-07-01
... Process Certification Scheme. 592.301 Section 592.301 Money and Finance: Treasury Regulations Relating to... Certification Scheme. (a) Except as otherwise provided in paragraph (b) of this section, the term controlled through the Kimberley Process Certification Scheme refers to the following requirements that apply, as...
Parallelization and checkpointing of GPU applications through program transformation
DOE Office of Scientific and Technical Information (OSTI.GOV)
Solano-Quinde, Lizandro Damian
2012-01-01
GPUs have emerged as a powerful tool for accelerating general-purpose applications. The availability of programming languages that makes writing general-purpose applications for running on GPUs tractable have consolidated GPUs as an alternative for accelerating general purpose applications. Among the areas that have benefited from GPU acceleration are: signal and image processing, computational fluid dynamics, quantum chemistry, and, in general, the High Performance Computing (HPC) Industry. In order to continue to exploit higher levels of parallelism with GPUs, multi-GPU systems are gaining popularity. In this context, single-GPU applications are parallelized for running in multi-GPU systems. Furthermore, multi-GPU systems help to solvemore » the GPU memory limitation for applications with large application memory footprint. Parallelizing single-GPU applications has been approached by libraries that distribute the workload at runtime, however, they impose execution overhead and are not portable. On the other hand, on traditional CPU systems, parallelization has been approached through application transformation at pre-compile time, which enhances the application to distribute the workload at application level and does not have the issues of library-based approaches. Hence, a parallelization scheme for GPU systems based on application transformation is needed. Like any computing engine of today, reliability is also a concern in GPUs. GPUs are vulnerable to transient and permanent failures. Current checkpoint/restart techniques are not suitable for systems with GPUs. Checkpointing for GPU systems present new and interesting challenges, primarily due to the natural differences imposed by the hardware design, the memory subsystem architecture, the massive number of threads, and the limited amount of synchronization among threads. Therefore, a checkpoint/restart technique suitable for GPU systems is needed. The goal of this work is to exploit higher levels of parallelism and to develop support for application-level fault tolerance in applications using multiple GPUs. Our techniques reduce the burden of enhancing single-GPU applications to support these features. To achieve our goal, this work designs and implements a framework for enhancing a single-GPU OpenCL application through application transformation.« less
Feedback control of plasma instabilities with charged particle beams and study of plasma turbulence
NASA Technical Reports Server (NTRS)
Tham, Philip Kin-Wah
1994-01-01
A new non-perturbing technique for feedback control of plasma instabilities has been developed in the Columbia Linear Machine (CLM). The feedback control scheme involves the injection of a feedback modulated ion beam as a remote suppressor. The ion beam was obtained from a compact ion beam source which was developed for this purpose. A Langmuir probe was used as the feedback sensor. The feedback controller consisted of a phase-shifter and amplifiers. This technique was demonstrated by stabilizing various plasma instabilities to the background noise level, like the trapped particle instability, the ExB instability and the ion-temperature-gradient (ITG) driven instability. An important feature of this scheme is that the injected ion beam is non-perturbing to the plasma equilibrium parameters. The robustness of this feedback stabilization scheme was also investigated. The principal result is that the scheme is fairly robust, tolerating about 100% variation about the nominal parameter values. Next, this scheme is extended to the unsolved general problem of controlling multimode plasma instabilities simultaneously with a single sensor-suppressor pair. A single sensor-suppressor pair of feedback probes is desirable to reduce the perturbation caused by the probes. Two plasma instabilities the ExB and the ITG modes, were simultaneously stabilized. A simple 'state' feedback type method was used where more state information was generated from the single sensor Langmuir probe by appropriate signal processing, in this case, by differentiation. This proof-of-principle experiment demonstrated for the first time that by designing a more sophisticated electronic feedback controller, many plasma instabilities may be simultaneously controlled. Simple theoretical models showed generally good agreement with the feedback experimental results. On a parallel research front, a better understanding of the saturated state of a plasma instability was sought partly with the help of feedback. A plasma instability is usually observed in its saturated state and appears as a single feature in the frequency spectrum with a single azimuthal and parallel wavenumbers. The physics of the non-zero spectral width was investigated in detail because the finite spectral width can cause "turbulent" transport. One aspect of the "turbulence" was investigated by obtaining the scaling of the linear growth rate of the instabilities with the fluctuation levels. The linear growth rates were measured with the established gated feedback technique. The research showed that the ExB instability evolves into a quasi-coherent state when the fluctuation level is high. The coherent aspects were studied with a bispectral analysis. Moreover, the single spectral feature was discovered to be actually composed of a few radial harmonics. The radial harmonics play a role in the nonlinear saturation of the instability via three-wave coupling.
QoS support for end users of I/O-intensive applications using shared storage systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Davis, Marion Kei; Zhang, Xuechen; Jiang, Song
2011-01-19
I/O-intensive applications are becoming increasingly common on today's high-performance computing systems. While performance of compute-bound applications can be effectively guaranteed with techniques such as space sharing or QoS-aware process scheduling, it remains a challenge to meet QoS requirements for end users of I/O-intensive applications using shared storage systems because it is difficult to differentiate I/O services for different applications with individual quality requirements. Furthermore, it is difficult for end users to accurately specify performance goals to the storage system using I/O-related metrics such as request latency or throughput. As access patterns, request rates, and the system workload change in time,more » a fixed I/O performance goal, such as bounds on throughput or latency, can be expensive to achieve and may not lead to a meaningful performance guarantees such as bounded program execution time. We propose a scheme supporting end-users QoS goals, specified in terms of program execution time, in shared storage environments. We automatically translate the users performance goals into instantaneous I/O throughput bounds using a machine learning technique, and use dynamically determined service time windows to efficiently meet the throughput bounds. We have implemented this scheme in the PVFS2 parallel file system and have conducted an extensive evaluation. Our results show that this scheme can satisfy realistic end-user QoS requirements by making highly efficient use of the I/O resources. The scheme seeks to balance programs attainment of QoS requirements, and saves as much of the remaining I/O capacity as possible for best-effort programs.« less
Wang, Qian; Hisatomi, Takashi; Suzuki, Yohichi; Pan, Zhenhua; Seo, Jeongsuk; Katayama, Masao; Minegishi, Tsutomu; Nishiyama, Hiroshi; Takata, Tsuyoshi; Seki, Kazuhiko; Kudo, Akihiko; Yamada, Taro; Domen, Kazunari
2017-02-01
Development of sunlight-driven water splitting systems with high efficiency, scalability, and cost-competitiveness is a central issue for mass production of solar hydrogen as a renewable and storable energy carrier. Photocatalyst sheets comprising a particulate hydrogen evolution photocatalyst (HEP) and an oxygen evolution photocatalyst (OEP) embedded in a conductive thin film can realize efficient and scalable solar hydrogen production using Z-scheme water splitting. However, the use of expensive precious metal thin films that also promote reverse reactions is a major obstacle to developing a cost-effective process at ambient pressure. In this study, we present a standalone particulate photocatalyst sheet based on an earth-abundant, relatively inert, and conductive carbon film for efficient Z-scheme water splitting at ambient pressure. A SrTiO 3 :La,Rh/C/BiVO 4 :Mo sheet is shown to achieve unassisted pure-water (pH 6.8) splitting with a solar-to-hydrogen energy conversion efficiency (STH) of 1.2% at 331 K and 10 kPa, while retaining 80% of this efficiency at 91 kPa. The STH value of 1.0% is the highest among Z-scheme pure water splitting operating at ambient pressure. The working mechanism of the photocatalyst sheet is discussed on the basis of band diagram simulation. In addition, the photocatalyst sheet split pure water more efficiently than conventional powder suspension systems and photoelectrochemical parallel cells because H + and OH - concentration overpotentials and an IR drop between the HEP and OEP were effectively suppressed. The proposed carbon-based photocatalyst sheet, which can be used at ambient pressure, is an important alternative to (photo)electrochemical systems for practical solar hydrogen production.
Efficient algorithms and implementations of entropy-based moment closures for rarefied gases
NASA Astrophysics Data System (ADS)
Schaerer, Roman Pascal; Bansal, Pratyuksh; Torrilhon, Manuel
2017-07-01
We present efficient algorithms and implementations of the 35-moment system equipped with the maximum-entropy closure in the context of rarefied gases. While closures based on the principle of entropy maximization have been shown to yield very promising results for moderately rarefied gas flows, the computational cost of these closures is in general much higher than for closure theories with explicit closed-form expressions of the closing fluxes, such as Grad's classical closure. Following a similar approach as Garrett et al. (2015) [13], we investigate efficient implementations of the computationally expensive numerical quadrature method used for the moment evaluations of the maximum-entropy distribution by exploiting its inherent fine-grained parallelism with the parallelism offered by multi-core processors and graphics cards. We show that using a single graphics card as an accelerator allows speed-ups of two orders of magnitude when compared to a serial CPU implementation. To accelerate the time-to-solution for steady-state problems, we propose a new semi-implicit time discretization scheme. The resulting nonlinear system of equations is solved with a Newton type method in the Lagrange multipliers of the dual optimization problem in order to reduce the computational cost. Additionally, fully explicit time-stepping schemes of first and second order accuracy are presented. We investigate the accuracy and efficiency of the numerical schemes for several numerical test cases, including a steady-state shock-structure problem.
Application of the θ-method to a telegraphic model of fluid flow in a dual-porosity medium
NASA Astrophysics Data System (ADS)
González-Calderón, Alfredo; Vivas-Cruz, Luis X.; Herrera-Hernández, Erik César
2018-01-01
This work focuses mainly on the study of numerical solutions, which are obtained using the θ-method, of a generalized Warren and Root model that includes a second-order wave-like equation in its formulation. The solutions approximately describe the single-phase hydraulic head in fractures by considering the finite velocity of propagation by means of a Cattaneo-like equation. The corresponding discretized model is obtained by utilizing a non-uniform grid and a non-uniform time step. A simple relationship is proposed to give the time-step distribution. Convergence is analyzed by comparing results from explicit, fully implicit, and Crank-Nicolson schemes with exact solutions: a telegraphic model of fluid flow in a single-porosity reservoir with relaxation dynamics, the Warren and Root model, and our studied model, which is solved with the inverse Laplace transform. We find that the flux and the hydraulic head have spurious oscillations that most often appear in small-time solutions but are attenuated as the solution time progresses. Furthermore, we show that the finite difference method is unable to reproduce the exact flux at time zero. Obtaining results for oilfield production times, which are in the order of months in real units, is only feasible using parallel implicit schemes. In addition, we propose simple parallel algorithms for the memory flux and for the explicit scheme.
NASA Astrophysics Data System (ADS)
Wang, Feiyan; Morten, Jan Petter; Spitzer, Klaus
2018-05-01
In this paper, we present a recently developed anisotropic 3-D inversion framework for interpreting controlled-source electromagnetic (CSEM) data in the frequency domain. The framework integrates a high-order finite-element forward operator and a Gauss-Newton inversion algorithm. Conductivity constraints are applied using a parameter transformation. We discretize the continuous forward and inverse problems on unstructured grids for a flexible treatment of arbitrarily complex geometries. Moreover, an unstructured mesh is more desirable in comparison to a single rectilinear mesh for multisource problems because local grid refinement will not significantly influence the mesh density outside the region of interest. The non-uniform spatial discretization facilitates parametrization of the inversion domain at a suitable scale. For a rapid simulation of multisource EM data, we opt to use a parallel direct solver. We further accelerate the inversion process by decomposing the entire data set into subsets with respect to frequencies (and transmitters if memory requirement is affordable). The computational tasks associated with each data subset are distributed to different processes and run in parallel. We validate the scheme using a synthetic marine CSEM model with rough bathymetry, and finally, apply it to an industrial-size 3-D data set from the Troll field oil province in the North Sea acquired in 2008 to examine its robustness and practical applicability.
NASA Technical Reports Server (NTRS)
Rajagopalan, J.; Xing, K.; Guo, Y.; Lee, F. C.; Manners, Bruce
1996-01-01
A simple, application-oriented, transfer function model of paralleled converters employing Master-Slave Current-sharing (MSC) control is developed. Dynamically, the Master converter retains its original design characteristics; all the Slave converters are forced to depart significantly from their original design characteristics into current-controlled current sources. Five distinct loop gains to assess system stability and performance are identified and their physical significance is described. A design methodology for the current share compensator is presented. The effect of this current sharing scheme on 'system output impedance' is analyzed.
Modified signed-digit arithmetic based on redundant bit representation.
Huang, H; Itoh, M; Yatagai, T
1994-09-10
Fully parallel modified signed-digit arithmetic operations are realized based on redundant bit representation of the digits proposed. A new truth-table minimizing technique is presented based on redundant-bitrepresentation coding. It is shown that only 34 minterms are enough for implementing one-step modified signed-digit addition and subtraction with this new representation. Two optical implementation schemes, correlation and matrix multiplication, are described. Experimental demonstrations of the correlation architecture are presented. Both architectures use fixed minterm masks for arbitrary-length operands, taking full advantage of the parallelism of the modified signed-digit number system and optics.
Multiplexed EFPI sensors with ultra-high resolution
NASA Astrophysics Data System (ADS)
Ushakov, Nikolai; Liokumovich, Leonid
2014-05-01
An investigation of performance of multiplexed displacement sensors based on extrinsic Fabry-Perot interferometers has been carried out. We have considered serial and parallel configurations and analyzed the issues and advantages of the both. We have also extended the previously developed baseline demodulation algorithm for the case of a system of multiplexed sensors. Serial and parallel multiplexing schemes have been experimentally implemented with 3 and 4 sensing elements, respectively. For both configurations the achieved baseline standard deviations were between 30 and 200 pm, which is, to the best of our knowledge, more than an order less than any other multiplexed EFPI resolution ever reported.
Atlan, Michael; Desbiolles, Pierre; Gross, Michel; Coppey-Moisan, Maïté
2010-03-01
We developed a microscope intended to probe, using a parallel heterodyne receiver, the fluctuation spectrum of light quasi-elastically scattered by gold nanoparticles diffusing in viscous fluids. The cutoff frequencies of the recorded spectra scale up linearly with those expected from single-scattering formalism in a wide range of dynamic viscosities (1 to 15 times water viscosity at room temperature). Our scheme enables ensemble-averaged optical fluctuations measurements over multispeckle recordings in low light, at temporal frequencies up to 10 kHz, with a 12 Hz framerate array detector.
Dual-thread parallel control strategy for ophthalmic adaptive optics.
Yu, Yongxin; Zhang, Yuhua
To improve ophthalmic adaptive optics speed and compensate for ocular wavefront aberration of high temporal frequency, the adaptive optics wavefront correction has been implemented with a control scheme including 2 parallel threads; one is dedicated to wavefront detection and the other conducts wavefront reconstruction and compensation. With a custom Shack-Hartmann wavefront sensor that measures the ocular wave aberration with 193 subapertures across the pupil, adaptive optics has achieved a closed loop updating frequency up to 110 Hz, and demonstrated robust compensation for ocular wave aberration up to 50 Hz in an adaptive optics scanning laser ophthalmoscope.
Dual-thread parallel control strategy for ophthalmic adaptive optics
Yu, Yongxin; Zhang, Yuhua
2015-01-01
To improve ophthalmic adaptive optics speed and compensate for ocular wavefront aberration of high temporal frequency, the adaptive optics wavefront correction has been implemented with a control scheme including 2 parallel threads; one is dedicated to wavefront detection and the other conducts wavefront reconstruction and compensation. With a custom Shack-Hartmann wavefront sensor that measures the ocular wave aberration with 193 subapertures across the pupil, adaptive optics has achieved a closed loop updating frequency up to 110 Hz, and demonstrated robust compensation for ocular wave aberration up to 50 Hz in an adaptive optics scanning laser ophthalmoscope. PMID:25866498
NASA Technical Reports Server (NTRS)
Kemeny, Sabrina E.
1994-01-01
Electronic and optoelectronic hardware implementations of highly parallel computing architectures address several ill-defined and/or computation-intensive problems not easily solved by conventional computing techniques. The concurrent processing architectures developed are derived from a variety of advanced computing paradigms including neural network models, fuzzy logic, and cellular automata. Hardware implementation technologies range from state-of-the-art digital/analog custom-VLSI to advanced optoelectronic devices such as computer-generated holograms and e-beam fabricated Dammann gratings. JPL's concurrent processing devices group has developed a broad technology base in hardware implementable parallel algorithms, low-power and high-speed VLSI designs and building block VLSI chips, leading to application-specific high-performance embeddable processors. Application areas include high throughput map-data classification using feedforward neural networks, terrain based tactical movement planner using cellular automata, resource optimization (weapon-target assignment) using a multidimensional feedback network with lateral inhibition, and classification of rocks using an inner-product scheme on thematic mapper data. In addition to addressing specific functional needs of DOD and NASA, the JPL-developed concurrent processing device technology is also being customized for a variety of commercial applications (in collaboration with industrial partners), and is being transferred to U.S. industries. This viewgraph p resentation focuses on two application-specific processors which solve the computation intensive tasks of resource allocation (weapon-target assignment) and terrain based tactical movement planning using two extremely different topologies. Resource allocation is implemented as an asynchronous analog competitive assignment architecture inspired by the Hopfield network. Hardware realization leads to a two to four order of magnitude speed-up over conventional techniques and enables multiple assignments, (many to many), not achievable with standard statistical approaches. Tactical movement planning (finding the best path from A to B) is accomplished with a digital two-dimensional concurrent processor array. By exploiting the natural parallel decomposition of the problem in silicon, a four order of magnitude speed-up over optimized software approaches has been demonstrated.
Development of iterative techniques for the solution of unsteady compressible viscous flows
NASA Technical Reports Server (NTRS)
Sankar, Lakshmi N.; Hixon, Duane
1991-01-01
Efficient iterative solution methods are being developed for the numerical solution of two- and three-dimensional compressible Navier-Stokes equations. Iterative time marching methods have several advantages over classical multi-step explicit time marching schemes, and non-iterative implicit time marching schemes. Iterative schemes have better stability characteristics than non-iterative explicit and implicit schemes. Thus, the extra work required by iterative schemes can also be designed to perform efficiently on current and future generation scalable, missively parallel machines. An obvious candidate for iteratively solving the system of coupled nonlinear algebraic equations arising in CFD applications is the Newton method. Newton's method was implemented in existing finite difference and finite volume methods. Depending on the complexity of the problem, the number of Newton iterations needed per step to solve the discretized system of equations can, however, vary dramatically from a few to several hundred. Another popular approach based on the classical conjugate gradient method, known as the GMRES (Generalized Minimum Residual) algorithm is investigated. The GMRES algorithm was used in the past by a number of researchers for solving steady viscous and inviscid flow problems with considerable success. Here, the suitability of this algorithm is investigated for solving the system of nonlinear equations that arise in unsteady Navier-Stokes solvers at each time step. Unlike the Newton method which attempts to drive the error in the solution at each and every node down to zero, the GMRES algorithm only seeks to minimize the L2 norm of the error. In the GMRES algorithm the changes in the flow properties from one time step to the next are assumed to be the sum of a set of orthogonal vectors. By choosing the number of vectors to a reasonably small value N (between 5 and 20) the work required for advancing the solution from one time step to the next may be kept to (N+1) times that of a noniterative scheme. Many of the operations required by the GMRES algorithm such as matrix-vector multiplies, matrix additions and subtractions can all be vectorized and parallelized efficiently.
The scheme machine: A case study in progress in design derivation at system levels
NASA Technical Reports Server (NTRS)
Johnson, Steven D.
1995-01-01
The Scheme Machine is one of several design projects of the Digital Design Derivation group at Indiana University. It differs from the other projects in its focus on issues of system design and its connection to surrounding research in programming language semantics, compiler construction, and programming methodology underway at Indiana and elsewhere. The genesis of the project dates to the early 1980's, when digital design derivation research branched from the surrounding research effort in programming languages. Both branches have continued to develop in parallel, with this particular project serving as a bridge. However, by 1990 there remained little real interaction between the branches and recently we have undertaken to reintegrate them. On the software side, researchers have refined a mathematically rigorous (but not mechanized) treatment starting with the fully abstract semantic definition of Scheme and resulting in an efficient implementation consisting of a compiler and virtual machine model, the latter typically realized with a general purpose microprocessor. The derivation includes a number of sophisticated factorizations and representations and is also deep example of the underlying engineering methodology. The hardware research has created a mechanized algebra supporting the tedious and massive transformations often seen at lower levels of design. This work has progressed to the point that large scale devices, such as processors, can be derived from first-order finite state machine specifications. This is roughly where the language oriented research stops; thus, together, the two efforts establish a thread from the highest levels of abstract specification to detailed digital implementation. The Scheme Machine project challenges hardware derivation research in several ways, although the individual components of the system are of a similar scale to those we have worked with before. The machine has a custom dual-ported memory to support garbage collection. It consists of four tightly coupled processes--processor, collector, allocator, memory--with a very non-trivial synchronization relationship. Finally, there are deep issues of representation for the run-time objects of a symbolic processing language. The research centers on verification through integrated formal reasoning systems, but is also involved with modeling and prototyping environments. Since the derivation algebra is basd on an executable modeling language, there is opportunity to incorporate design animation in the design process. We are looking for ways to move smoothly and incrementally from executable specifications into hardware realization. For example, we can run the garbage collector specification, a Scheme program, directly against the physical memory prototype, and similarly, the instruction processor model against the heap implementation.
Simulation of Hypervelocity Impact on Aluminum-Nextel-Kevlar Orbital Debris Shields
NASA Technical Reports Server (NTRS)
Fahrenthold, Eric P.
2000-01-01
An improved hybrid particle-finite element method has been developed for hypervelocity impact simulation. The method combines the general contact-impact capabilities of particle codes with the true Lagrangian kinematics of large strain finite element formulations. Unlike some alternative schemes which couple Lagrangian finite element models with smooth particle hydrodynamics, the present formulation makes no use of slidelines or penalty forces. The method has been implemented in a parallel, three dimensional computer code. Simulations of three dimensional orbital debris impact problems using this parallel hybrid particle-finite element code, show good agreement with experiment and good speedup in parallel computation. The simulations included single and multi-plate shields as well as aluminum and composite shielding materials. at an impact velocity of eleven kilometers per second.
A GPU-paralleled implementation of an enhanced face recognition algorithm
NASA Astrophysics Data System (ADS)
Chen, Hao; Liu, Xiyang; Shao, Shuai; Zan, Jiguo
2013-03-01
Face recognition algorithm based on compressed sensing and sparse representation is hotly argued in these years. The scheme of this algorithm increases recognition rate as well as anti-noise capability. However, the computational cost is expensive and has become a main restricting factor for real world applications. In this paper, we introduce a GPU-accelerated hybrid variant of face recognition algorithm named parallel face recognition algorithm (pFRA). We describe here how to carry out parallel optimization design to take full advantage of many-core structure of a GPU. The pFRA is tested and compared with several other implementations under different data sample size. Finally, Our pFRA, implemented with NVIDIA GPU and Computer Unified Device Architecture (CUDA) programming model, achieves a significant speedup over the traditional CPU implementations.
Boosting flood warning schemes with fast emulator of detailed hydrodynamic models
NASA Astrophysics Data System (ADS)
Bellos, V.; Carbajal, J. P.; Leitao, J. P.
2017-12-01
Floods are among the most destructive catastrophic events and their frequency has incremented over the last decades. To reduce flood impact and risks, flood warning schemes are installed in flood prone areas. Frequently, these schemes are based on numerical models which quickly provide predictions of water levels and other relevant observables. However, the high complexity of flood wave propagation in the real world and the need of accurate predictions in urban environments or in floodplains hinders the use of detailed simulators. This sets the difficulty, we need fast predictions that meet the accuracy requirements. Most physics based detailed simulators although accurate, will not fulfill the speed demand. Even if High Performance Computing techniques are used (the magnitude of required simulation time is minutes/hours). As a consequence, most flood warning schemes are based in coarse ad-hoc approximations that cannot take advantage a detailed hydrodynamic simulation. In this work, we present a methodology for developing a flood warning scheme using an Gaussian Processes based emulator of a detailed hydrodynamic model. The methodology consists of two main stages: 1) offline stage to build the emulator; 2) online stage using the emulator to predict and generate warnings. The offline stage consists of the following steps: a) definition of the critical sites of the area under study, and the specification of the observables to predict at those sites, e.g. water depth, flow velocity, etc.; b) generation of a detailed simulation dataset to train the emulator; c) calibration of the required parameters (if measurements are available). The online stage is carried on using the emulator to predict the relevant observables quickly, and the detailed simulator is used in parallel to verify key predictions of the emulator. The speed gain given by the emulator allows also to quantify uncertainty in predictions using ensemble methods. The above methodology is applied in real world scenario.
A summary of image segmentation techniques
NASA Technical Reports Server (NTRS)
Spirkovska, Lilly
1993-01-01
Machine vision systems are often considered to be composed of two subsystems: low-level vision and high-level vision. Low level vision consists primarily of image processing operations performed on the input image to produce another image with more favorable characteristics. These operations may yield images with reduced noise or cause certain features of the image to be emphasized (such as edges). High-level vision includes object recognition and, at the highest level, scene interpretation. The bridge between these two subsystems is the segmentation system. Through segmentation, the enhanced input image is mapped into a description involving regions with common features which can be used by the higher level vision tasks. There is no theory on image segmentation. Instead, image segmentation techniques are basically ad hoc and differ mostly in the way they emphasize one or more of the desired properties of an ideal segmenter and in the way they balance and compromise one desired property against another. These techniques can be categorized in a number of different groups including local vs. global, parallel vs. sequential, contextual vs. noncontextual, interactive vs. automatic. In this paper, we categorize the schemes into three main groups: pixel-based, edge-based, and region-based. Pixel-based segmentation schemes classify pixels based solely on their gray levels. Edge-based schemes first detect local discontinuities (edges) and then use that information to separate the image into regions. Finally, region-based schemes start with a seed pixel (or group of pixels) and then grow or split the seed until the original image is composed of only homogeneous regions. Because there are a number of survey papers available, we will not discuss all segmentation schemes. Rather than a survey, we take the approach of a detailed overview. We focus only on the more common approaches in order to give the reader a flavor for the variety of techniques available yet present enough details to facilitate implementation and experimentation.
NASA Astrophysics Data System (ADS)
Deng, Liang; Bai, Hanli; Wang, Fang; Xu, Qingxin
2016-06-01
CPU/GPU computing allows scientists to tremendously accelerate their numerical codes. In this paper, we port and optimize a double precision alternating direction implicit (ADI) solver for three-dimensional compressible Navier-Stokes equations from our in-house Computational Fluid Dynamics (CFD) software on heterogeneous platform. First, we implement a full GPU version of the ADI solver to remove a lot of redundant data transfers between CPU and GPU, and then design two fine-grain schemes, namely “one-thread-one-point” and “one-thread-one-line”, to maximize the performance. Second, we present a dual-level parallelization scheme using the CPU/GPU collaborative model to exploit the computational resources of both multi-core CPUs and many-core GPUs within the heterogeneous platform. Finally, considering the fact that memory on a single node becomes inadequate when the simulation size grows, we present a tri-level hybrid programming pattern MPI-OpenMP-CUDA that merges fine-grain parallelism using OpenMP and CUDA threads with coarse-grain parallelism using MPI for inter-node communication. We also propose a strategy to overlap the computation with communication using the advanced features of CUDA and MPI programming. We obtain speedups of 6.0 for the ADI solver on one Tesla M2050 GPU in contrast to two Xeon X5670 CPUs. Scalability tests show that our implementation can offer significant performance improvement on heterogeneous platform.
Combining image-processing and image compression schemes
NASA Technical Reports Server (NTRS)
Greenspan, H.; Lee, M.-C.
1995-01-01
An investigation into the combining of image-processing schemes, specifically an image enhancement scheme, with existing compression schemes is discussed. Results are presented on the pyramid coding scheme, the subband coding scheme, and progressive transmission. Encouraging results are demonstrated for the combination of image enhancement and pyramid image coding schemes, especially at low bit rates. Adding the enhancement scheme to progressive image transmission allows enhanced visual perception at low resolutions. In addition, further progressing of the transmitted images, such as edge detection schemes, can gain from the added image resolution via the enhancement.
Jiang, Chao; Zhang, Hongyan; Wang, Jia; Wang, Yaru; He, Heng; Liu, Rui; Zhou, Fangyuan; Deng, Jialiang; Li, Pengcheng; Luo, Qingming
2011-11-01
Laser speckle imaging (LSI) is a noninvasive and full-field optical imaging technique which produces two-dimensional blood flow maps of tissues from the raw laser speckle images captured by a CCD camera without scanning. We present a hardware-friendly algorithm for the real-time processing of laser speckle imaging. The algorithm is developed and optimized specifically for LSI processing in the field programmable gate array (FPGA). Based on this algorithm, we designed a dedicated hardware processor for real-time LSI in FPGA. The pipeline processing scheme and parallel computing architecture are introduced into the design of this LSI hardware processor. When the LSI hardware processor is implemented in the FPGA running at the maximum frequency of 130 MHz, up to 85 raw images with the resolution of 640×480 pixels can be processed per second. Meanwhile, we also present a system on chip (SOC) solution for LSI processing by integrating the CCD controller, memory controller, LSI hardware processor, and LCD display controller into a single FPGA chip. This SOC solution also can be used to produce an application specific integrated circuit for LSI processing.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kotalczyk, G., E-mail: Gregor.Kotalczyk@uni-due.de; Kruis, F.E.
Monte Carlo simulations based on weighted simulation particles can solve a variety of population balance problems and allow thus to formulate a solution-framework for many chemical engineering processes. This study presents a novel concept for the calculation of coagulation rates of weighted Monte Carlo particles by introducing a family of transformations to non-weighted Monte Carlo particles. The tuning of the accuracy (named ‘stochastic resolution’ in this paper) of those transformations allows the construction of a constant-number coagulation scheme. Furthermore, a parallel algorithm for the inclusion of newly formed Monte Carlo particles due to nucleation is presented in the scope ofmore » a constant-number scheme: the low-weight merging. This technique is found to create significantly less statistical simulation noise than the conventional technique (named ‘random removal’ in this paper). Both concepts are combined into a single GPU-based simulation method which is validated by comparison with the discrete-sectional simulation technique. Two test models describing a constant-rate nucleation coupled to a simultaneous coagulation in 1) the free-molecular regime or 2) the continuum regime are simulated for this purpose.« less
A fast combination calibration of foreground and background for pipelined ADCs
NASA Astrophysics Data System (ADS)
Kexu, Sun; Lenian, He
2012-06-01
This paper describes a fast digital calibration scheme for pipelined analog-to-digital converters (ADCs). The proposed method corrects the nonlinearity caused by finite opamp gain and capacitor mismatch in multiplying digital-to-analog converters (MDACs). The considered calibration technique takes the advantages of both foreground and background calibration schemes. In this combination calibration algorithm, a novel parallel background calibration with signal-shifted correlation is proposed, and its calibration cycle is very short. The details of this technique are described in the example of a 14-bit 100 Msample/s pipelined ADC. The high convergence speed of this background calibration is achieved by three means. First, a modified 1.5-bit stage is proposed in order to allow the injection of a large pseudo-random dithering without missing code. Second, before correlating the signal, it is shifted according to the input signal so that the correlation error converges quickly. Finally, the front pipeline stages are calibrated simultaneously rather than stage by stage to reduce the calibration tracking constants. Simulation results confirm that the combination calibration has a fast startup process and a short background calibration cycle of 2 × 221 conversions.
Simplified models of the symmetric single-pass parallel-plate counterflow heat exchanger: a tutorial
Abraham-Shrauner, Barbara
2018-01-01
The heat exchanger is important in practical thermal processes, especially those of (i) the molten-salt storage schemes, (ii) compressed air energy storage schemes and (iii) other load-shifting thermal storage presumed to undergird a Smart Grid. Such devices, although central to the utilization of energy from sustainable (but intermittent) renewable sources, will be unfamiliar to many scientists, who nevertheless need a working knowledge of them. This tutorial paper provides a largely self-contained conceptual introduction for such persons. It begins by modelling a novel quantized exchanger,1 impractical as a device, but useful for comprehending the underlying thermophysics. It then reviews the one-dimensional steady-state idealization which demonstrates that effectiveness of heat transfer increases monotonically with (device length)/(device throughput). Next, it presents a two-dimensional steady-state idealization for plug flow and from it derives a novel formula for effectiveness of transfer; this formula is then shown to agree well with a finite-difference time-domain solution of the two-dimensional idealization under Hagen–Poiseuille flow. These results are consistent with a conclusion that effectiveness of heat exchange can approach unity, but may involve unwelcome trade-offs among device cost, size and throughput. PMID:29657769
Pickard, William F; Abraham-Shrauner, Barbara
2018-03-01
The heat exchanger is important in practical thermal processes, especially those of (i) the molten-salt storage schemes, (ii) compressed air energy storage schemes and (iii) other load-shifting thermal storage presumed to undergird a Smart Grid. Such devices, although central to the utilization of energy from sustainable (but intermittent) renewable sources, will be unfamiliar to many scientists, who nevertheless need a working knowledge of them. This tutorial paper provides a largely self-contained conceptual introduction for such persons. It begins by modelling a novel quantized exchanger, impractical as a device, but useful for comprehending the underlying thermophysics. It then reviews the one-dimensional steady-state idealization which demonstrates that effectiveness of heat transfer increases monotonically with (device length)/(device throughput). Next, it presents a two-dimensional steady-state idealization for plug flow and from it derives a novel formula for effectiveness of transfer; this formula is then shown to agree well with a finite-difference time-domain solution of the two-dimensional idealization under Hagen-Poiseuille flow. These results are consistent with a conclusion that effectiveness of heat exchange can approach unity, but may involve unwelcome trade-offs among device cost, size and throughput.