Computationally efficient multibody simulations
NASA Technical Reports Server (NTRS)
Ramakrishnan, Jayant; Kumar, Manoj
1994-01-01
Computationally efficient approaches to the solution of the dynamics of multibody systems are presented in this work. The computational efficiency is derived from both the algorithmic and implementational standpoint. Order(n) approaches provide a new formulation of the equations of motion eliminating the assembly and numerical inversion of a system mass matrix as required by conventional algorithms. Computational efficiency is also gained in the implementation phase by the symbolic processing and parallel implementation of these equations. Comparison of this algorithm with existing multibody simulation programs illustrates the increased computational efficiency.
Parallel implementation of geometrical shock dynamics for two dimensional converging shock waves
NASA Astrophysics Data System (ADS)
Qiu, Shi; Liu, Kuang; Eliasson, Veronica
2016-10-01
Geometrical shock dynamics (GSD) theory is an appealing method to predict the shock motion in the sense that it is more computationally efficient than solving the traditional Euler equations, especially for converging shock waves. However, to solve and optimize large scale configurations, the main bottleneck is the computational cost. Among the existing numerical GSD schemes, there is only one that has been implemented on parallel computers, with the purpose to analyze detonation waves. To extend the computational advantage of the GSD theory to more general applications such as converging shock waves, a numerical implementation using a spatial decomposition method has been coupled with a front tracking approach on parallel computers. In addition, an efficient tridiagonal system solver for massively parallel computers has been applied to resolve the most expensive function in this implementation, resulting in an efficiency of 0.93 while using 32 HPCC cores. Moreover, symmetric boundary conditions have been developed to further reduce the computational cost, achieving a speedup of 19.26 for a 12-sided polygonal converging shock.
Bhanot, Gyan V [Princeton, NJ; Chen, Dong [Croton-On-Hudson, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Philip [Cortlandt Manor, NY; Steinmacher-Burow, Burkhard D [Mount Kisco, NY; Vranas, Pavlos M [Bedford Hills, NY
2012-01-10
The present in invention is directed to a method, system and program storage device for efficiently implementing a multidimensional Fast Fourier Transform (FFT) of a multidimensional array comprising a plurality of elements initially distributed in a multi-node computer system comprising a plurality of nodes in communication over a network, comprising: distributing the plurality of elements of the array in a first dimension across the plurality of nodes of the computer system over the network to facilitate a first one-dimensional FFT; performing the first one-dimensional FFT on the elements of the array distributed at each node in the first dimension; re-distributing the one-dimensional FFT-transformed elements at each node in a second dimension via "all-to-all" distribution in random order across other nodes of the computer system over the network; and performing a second one-dimensional FFT on elements of the array re-distributed at each node in the second dimension, wherein the random order facilitates efficient utilization of the network thereby efficiently implementing the multidimensional FFT. The "all-to-all" re-distribution of array elements is further efficiently implemented in applications other than the multidimensional FFT on the distributed-memory parallel supercomputer.
Bhanot, Gyan V [Princeton, NJ; Chen, Dong [Croton-On-Hudson, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Philip [Cortlandt Manor, NY; Steinmacher-Burow, Burkhard D [Mount Kisco, NY; Vranas, Pavlos M [Bedford Hills, NY
2008-01-01
The present in invention is directed to a method, system and program storage device for efficiently implementing a multidimensional Fast Fourier Transform (FFT) of a multidimensional array comprising a plurality of elements initially distributed in a multi-node computer system comprising a plurality of nodes in communication over a network, comprising: distributing the plurality of elements of the array in a first dimension across the plurality of nodes of the computer system over the network to facilitate a first one-dimensional FFT; performing the first one-dimensional FFT on the elements of the array distributed at each node in the first dimension; re-distributing the one-dimensional FFT-transformed elements at each node in a second dimension via "all-to-all" distribution in random order across other nodes of the computer system over the network; and performing a second one-dimensional FFT on elements of the array re-distributed at each node in the second dimension, wherein the random order facilitates efficient utilization of the network thereby efficiently implementing the multidimensional FFT. The "all-to-all" re-distribution of array elements is further efficiently implemented in applications other than the multidimensional FFT on the distributed-memory parallel supercomputer.
Simulation of synthetic discriminant function optical implementation
NASA Astrophysics Data System (ADS)
Riggins, J.; Butler, S.
1984-12-01
The optical implementation of geometrical shape and synthetic discriminant function matched filters is computer modeled. The filter implementation utilizes the Allebach-Keegan computer-generated hologram algorithm. Signal-to-noise and efficiency measurements were made on the resultant correlation planes.
Application of microarray analysis on computer cluster and cloud platforms.
Bernau, C; Boulesteix, A-L; Knaus, J
2013-01-01
Analysis of recent high-dimensional biological data tends to be computationally intensive as many common approaches such as resampling or permutation tests require the basic statistical analysis to be repeated many times. A crucial advantage of these methods is that they can be easily parallelized due to the computational independence of the resampling or permutation iterations, which has induced many statistics departments to establish their own computer clusters. An alternative is to rent computing resources in the cloud, e.g. at Amazon Web Services. In this article we analyze whether a selection of statistical projects, recently implemented at our department, can be efficiently realized on these cloud resources. Moreover, we illustrate an opportunity to combine computer cluster and cloud resources. In order to compare the efficiency of computer cluster and cloud implementations and their respective parallelizations we use microarray analysis procedures and compare their runtimes on the different platforms. Amazon Web Services provide various instance types which meet the particular needs of the different statistical projects we analyzed in this paper. Moreover, the network capacity is sufficient and the parallelization is comparable in efficiency to standard computer cluster implementations. Our results suggest that many statistical projects can be efficiently realized on cloud resources. It is important to mention, however, that workflows can change substantially as a result of a shift from computer cluster to cloud computing.
Efficiently modeling neural networks on massively parallel computers
NASA Technical Reports Server (NTRS)
Farber, Robert M.
1993-01-01
Neural networks are a very useful tool for analyzing and modeling complex real world systems. Applying neural network simulations to real world problems generally involves large amounts of data and massive amounts of computation. To efficiently handle the computational requirements of large problems, we have implemented at Los Alamos a highly efficient neural network compiler for serial computers, vector computers, vector parallel computers, and fine grain SIMD computers such as the CM-2 connection machine. This paper describes the mapping used by the compiler to implement feed-forward backpropagation neural networks for a SIMD (Single Instruction Multiple Data) architecture parallel computer. Thinking Machines Corporation has benchmarked our code at 1.3 billion interconnects per second (approximately 3 gigaflops) on a 64,000 processor CM-2 connection machine (Singer 1990). This mapping is applicable to other SIMD computers and can be implemented on MIMD computers such as the CM-5 connection machine. Our mapping has virtually no communications overhead with the exception of the communications required for a global summation across the processors (which has a sub-linear runtime growth on the order of O(log(number of processors)). We can efficiently model very large neural networks which have many neurons and interconnects and our mapping can extend to arbitrarily large networks (within memory limitations) by merging the memory space of separate processors with fast adjacent processor interprocessor communications. This paper will consider the simulation of only feed forward neural network although this method is extendable to recurrent networks.
NASA Astrophysics Data System (ADS)
Jimenez, Edward S.; Goodman, Eric L.; Park, Ryeojin; Orr, Laurel J.; Thompson, Kyle R.
2014-09-01
This paper will investigate energy-efficiency for various real-world industrial computed-tomography reconstruction algorithms, both CPU- and GPU-based implementations. This work shows that the energy required for a given reconstruction is based on performance and problem size. There are many ways to describe performance and energy efficiency, thus this work will investigate multiple metrics including performance-per-watt, energy-delay product, and energy consumption. This work found that irregular GPU-based approaches1 realized tremendous savings in energy consumption when compared to CPU implementations while also significantly improving the performance-per- watt and energy-delay product metrics. Additional energy savings and other metric improvement was realized on the GPU-based reconstructions by improving storage I/O by implementing a parallel MIMD-like modularization of the compute and I/O tasks.
Adly, Amr A.; Abd-El-Hafiz, Salwa K.
2012-01-01
Incorporation of hysteresis models in electromagnetic analysis approaches is indispensable to accurate field computation in complex magnetic media. Throughout those computations, vector nature and computational efficiency of such models become especially crucial when sophisticated geometries requiring massive sub-region discretization are involved. Recently, an efficient vector Preisach-type hysteresis model constructed from only two scalar models having orthogonally coupled elementary operators has been proposed. This paper presents a novel Hopfield neural network approach for the implementation of Stoner–Wohlfarth-like operators that could lead to a significant enhancement in the computational efficiency of the aforementioned model. Advantages of this approach stem from the non-rectangular nature of these operators that substantially minimizes the number of operators needed to achieve an accurate vector hysteresis model. Details of the proposed approach, its identification and experimental testing are presented in the paper. PMID:25685446
Implementation of GAMMON - An efficient load balancing strategy for a local computer system
NASA Technical Reports Server (NTRS)
Baumgartner, Katherine M.; Kling, Ralph M.; Wah, Benjamin W.
1989-01-01
GAMMON (Global Allocation from Maximum to Minimum in cONstant time), an efficient load-balancing algorithm, is described. GAMMON uses the available broadcast capability of multiaccess networks to implement an efficient search technique for finding hosts with maximal and minimal loads. The search technique has an average overhead which is independent of the number of participating stations. The transition from the theoretical concept to a practical, reliable, and efficient implementation is described.
Multiple multicontrol unitary operations: Implementation and applications
NASA Astrophysics Data System (ADS)
Lin, Qing
2018-04-01
The efficient implementation of computational tasks is critical to quantum computations. In quantum circuits, multicontrol unitary operations are important components. Here, we present an extremely efficient and direct approach to multiple multicontrol unitary operations without decomposition to CNOT and single-photon gates. With the proposed approach, the necessary two-photon operations could be reduced from O( n 3) with the traditional decomposition approach to O( n), which will greatly relax the requirements and make large-scale quantum computation feasible. Moreover, we propose the potential application to the ( n- k)-uniform hypergraph state.
Efficient Parallelization of a Dynamic Unstructured Application on the Tera MTA
NASA Technical Reports Server (NTRS)
Oliker, Leonid; Biswas, Rupak
1999-01-01
The success of parallel computing in solving real-life computationally-intensive problems relies on their efficient mapping and execution on large-scale multiprocessor architectures. Many important applications are both unstructured and dynamic in nature, making their efficient parallel implementation a daunting task. This paper presents the parallelization of a dynamic unstructured mesh adaptation algorithm using three popular programming paradigms on three leading supercomputers. We examine an MPI message-passing implementation on the Cray T3E and the SGI Origin2OOO, a shared-memory implementation using cache coherent nonuniform memory access (CC-NUMA) of the Origin2OOO, and a multi-threaded version on the newly-released Tera Multi-threaded Architecture (MTA). We compare several critical factors of this parallel code development, including runtime, scalability, programmability, and memory overhead. Our overall results demonstrate that multi-threaded systems offer tremendous potential for quickly and efficiently solving some of the most challenging real-life problems on parallel computers.
NASA Astrophysics Data System (ADS)
Lashkin, S. V.; Kozelkov, A. S.; Yalozo, A. V.; Gerasimov, V. Yu.; Zelensky, D. K.
2017-12-01
This paper describes the details of the parallel implementation of the SIMPLE algorithm for numerical solution of the Navier-Stokes system of equations on arbitrary unstructured grids. The iteration schemes for the serial and parallel versions of the SIMPLE algorithm are implemented. In the description of the parallel implementation, special attention is paid to computational data exchange among processors under the condition of the grid model decomposition using fictitious cells. We discuss the specific features for the storage of distributed matrices and implementation of vector-matrix operations in parallel mode. It is shown that the proposed way of matrix storage reduces the number of interprocessor exchanges. A series of numerical experiments illustrates the effect of the multigrid SLAE solver tuning on the general efficiency of the algorithm; the tuning involves the types of the cycles used (V, W, and F), the number of iterations of a smoothing operator, and the number of cells for coarsening. Two ways (direct and indirect) of efficiency evaluation for parallelization of the numerical algorithm are demonstrated. The paper presents the results of solving some internal and external flow problems with the evaluation of parallelization efficiency by two algorithms. It is shown that the proposed parallel implementation enables efficient computations for the problems on a thousand processors. Based on the results obtained, some general recommendations are made for the optimal tuning of the multigrid solver, as well as for selecting the optimal number of cells per processor.
CFD Analysis and Design Optimization Using Parallel Computers
NASA Technical Reports Server (NTRS)
Martinelli, Luigi; Alonso, Juan Jose; Jameson, Antony; Reuther, James
1997-01-01
A versatile and efficient multi-block method is presented for the simulation of both steady and unsteady flow, as well as aerodynamic design optimization of complete aircraft configurations. The compressible Euler and Reynolds Averaged Navier-Stokes (RANS) equations are discretized using a high resolution scheme on body-fitted structured meshes. An efficient multigrid implicit scheme is implemented for time-accurate flow calculations. Optimum aerodynamic shape design is achieved at very low cost using an adjoint formulation. The method is implemented on parallel computing systems using the MPI message passing interface standard to ensure portability. The results demonstrate that, by combining highly efficient algorithms with parallel computing, it is possible to perform detailed steady and unsteady analysis as well as automatic design for complex configurations using the present generation of parallel computers.
Scemama, Anthony; Caffarel, Michel; Oseret, Emmanuel; Jalby, William
2013-04-30
Various strategies to implement efficiently quantum Monte Carlo (QMC) simulations for large chemical systems are presented. These include: (i) the introduction of an efficient algorithm to calculate the computationally expensive Slater matrices. This novel scheme is based on the use of the highly localized character of atomic Gaussian basis functions (not the molecular orbitals as usually done), (ii) the possibility of keeping the memory footprint minimal, (iii) the important enhancement of single-core performance when efficient optimization tools are used, and (iv) the definition of a universal, dynamic, fault-tolerant, and load-balanced framework adapted to all kinds of computational platforms (massively parallel machines, clusters, or distributed grids). These strategies have been implemented in the QMC=Chem code developed at Toulouse and illustrated with numerical applications on small peptides of increasing sizes (158, 434, 1056, and 1731 electrons). Using 10-80 k computing cores of the Curie machine (GENCI-TGCC-CEA, France), QMC=Chem has been shown to be capable of running at the petascale level, thus demonstrating that for this machine a large part of the peak performance can be achieved. Implementation of large-scale QMC simulations for future exascale platforms with a comparable level of efficiency is expected to be feasible. Copyright © 2013 Wiley Periodicals, Inc.
Multiresolution molecular mechanics: Implementation and efficiency
NASA Astrophysics Data System (ADS)
Biyikli, Emre; To, Albert C.
2017-01-01
Atomistic/continuum coupling methods combine accurate atomistic methods and efficient continuum methods to simulate the behavior of highly ordered crystalline systems. Coupled methods utilize the advantages of both approaches to simulate systems at a lower computational cost, while retaining the accuracy associated with atomistic methods. Many concurrent atomistic/continuum coupling methods have been proposed in the past; however, their true computational efficiency has not been demonstrated. The present work presents an efficient implementation of a concurrent coupling method called the Multiresolution Molecular Mechanics (MMM) for serial, parallel, and adaptive analysis. First, we present the features of the software implemented along with the associated technologies. The scalability of the software implementation is demonstrated, and the competing effects of multiscale modeling and parallelization are discussed. Then, the algorithms contributing to the efficiency of the software are presented. These include algorithms for eliminating latent ghost atoms from calculations and measurement-based dynamic balancing of parallel workload. The efficiency improvements made by these algorithms are demonstrated by benchmark tests. The efficiency of the software is found to be on par with LAMMPS, a state-of-the-art Molecular Dynamics (MD) simulation code, when performing full atomistic simulations. Speed-up of the MMM method is shown to be directly proportional to the reduction of the number of the atoms visited in force computation. Finally, an adaptive MMM analysis on a nanoindentation problem, containing over a million atoms, is performed, yielding an improvement of 6.3-8.5 times in efficiency, over the full atomistic MD method. For the first time, the efficiency of a concurrent atomistic/continuum coupling method is comprehensively investigated and demonstrated.
Efficient parallelization of analytic bond-order potentials for large-scale atomistic simulations
NASA Astrophysics Data System (ADS)
Teijeiro, C.; Hammerschmidt, T.; Drautz, R.; Sutmann, G.
2016-07-01
Analytic bond-order potentials (BOPs) provide a way to compute atomistic properties with controllable accuracy. For large-scale computations of heterogeneous compounds at the atomistic level, both the computational efficiency and memory demand of BOP implementations have to be optimized. Since the evaluation of BOPs is a local operation within a finite environment, the parallelization concepts known from short-range interacting particle simulations can be applied to improve the performance of these simulations. In this work, several efficient parallelization methods for BOPs that use three-dimensional domain decomposition schemes are described. The schemes are implemented into the bond-order potential code BOPfox, and their performance is measured in a series of benchmarks. Systems of up to several millions of atoms are simulated on a high performance computing system, and parallel scaling is demonstrated for up to thousands of processors.
Information processing using a single dynamical node as complex system
Appeltant, L.; Soriano, M.C.; Van der Sande, G.; Danckaert, J.; Massar, S.; Dambre, J.; Schrauwen, B.; Mirasso, C.R.; Fischer, I.
2011-01-01
Novel methods for information processing are highly desired in our information-driven society. Inspired by the brain's ability to process information, the recently introduced paradigm known as 'reservoir computing' shows that complex networks can efficiently perform computation. Here we introduce a novel architecture that reduces the usually required large number of elements to a single nonlinear node with delayed feedback. Through an electronic implementation, we experimentally and numerically demonstrate excellent performance in a speech recognition benchmark. Complementary numerical studies also show excellent performance for a time series prediction benchmark. These results prove that delay-dynamical systems, even in their simplest manifestation, can perform efficient information processing. This finding paves the way to feasible and resource-efficient technological implementations of reservoir computing. PMID:21915110
Improving the Efficiency of Free Energy Calculations in the Amber Molecular Dynamics Package.
Kaus, Joseph W; Pierce, Levi T; Walker, Ross C; McCammont, J Andrew
2013-09-10
Alchemical transformations are widely used methods to calculate free energies. Amber has traditionally included support for alchemical transformations as part of the sander molecular dynamics (MD) engine. Here we describe the implementation of a more efficient approach to alchemical transformations in the Amber MD package. Specifically we have implemented this new approach within the more computational efficient and scalable pmemd MD engine that is included with the Amber MD package. The majority of the gain in efficiency comes from the improved design of the calculation, which includes better parallel scaling and reduction in the calculation of redundant terms. This new implementation is able to reproduce results from equivalent simulations run with the existing functionality, but at 2.5 times greater computational efficiency. This new implementation is also able to run softcore simulations at the λ end states making direct calculation of free energies more accurate, compared to the extrapolation required in the existing implementation. The updated alchemical transformation functionality will be included in the next major release of Amber (scheduled for release in Q1 2014) and will be available at http://ambermd.org, under the Amber license.
Improving the Efficiency of Free Energy Calculations in the Amber Molecular Dynamics Package
Pierce, Levi T.; Walker, Ross C.; McCammont, J. Andrew
2013-01-01
Alchemical transformations are widely used methods to calculate free energies. Amber has traditionally included support for alchemical transformations as part of the sander molecular dynamics (MD) engine. Here we describe the implementation of a more efficient approach to alchemical transformations in the Amber MD package. Specifically we have implemented this new approach within the more computational efficient and scalable pmemd MD engine that is included with the Amber MD package. The majority of the gain in efficiency comes from the improved design of the calculation, which includes better parallel scaling and reduction in the calculation of redundant terms. This new implementation is able to reproduce results from equivalent simulations run with the existing functionality, but at 2.5 times greater computational efficiency. This new implementation is also able to run softcore simulations at the λ end states making direct calculation of free energies more accurate, compared to the extrapolation required in the existing implementation. The updated alchemical transformation functionality will be included in the next major release of Amber (scheduled for release in Q1 2014) and will be available at http://ambermd.org, under the Amber license. PMID:24185531
Efficient FIR Filter Implementations for Multichannel BCIs Using Xilinx System Generator.
Ghani, Usman; Wasim, Muhammad; Khan, Umar Shahbaz; Mubasher Saleem, Muhammad; Hassan, Ali; Rashid, Nasir; Islam Tiwana, Mohsin; Hamza, Amir; Kashif, Amir
2018-01-01
Background . Brain computer interface (BCI) is a combination of software and hardware communication protocols that allow brain to control external devices. Main purpose of BCI controlled external devices is to provide communication medium for disabled persons. Now these devices are considered as a new way to rehabilitate patients with impunities. There are certain potentials present in electroencephalogram (EEG) that correspond to specific event. Main issue is to detect such event related potentials online in such a low signal to noise ratio (SNR). In this paper we propose a method that will facilitate the concept of online processing by providing an efficient filtering implementation in a hardware friendly environment by switching to finite impulse response (FIR). Main focus of this research is to minimize latency and computational delay of preprocessing related to any BCI application. Four different finite impulse response (FIR) implementations along with large Laplacian filter are implemented in Xilinx System Generator. Efficiency of 25% is achieved in terms of reduced number of coefficients and multiplications which in turn reduce computational delays accordingly.
Computing rank-revealing QR factorizations of dense matrices.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bischof, C. H.; Quintana-Orti, G.; Mathematics and Computer Science
1998-06-01
We develop algorithms and implementations for computing rank-revealing QR (RRQR) factorizations of dense matrices. First, we develop an efficient block algorithm for approximating an RRQR factorization, employing a windowed version of the commonly used Golub pivoting strategy, aided by incremental condition estimation. Second, we develop efficiently implementable variants of guaranteed reliable RRQR algorithms for triangular matrices originally suggested by Chandrasekaran and Ipsen and by Pan and Tang. We suggest algorithmic improvements with respect to condition estimation, termination criteria, and Givens updating. By combining the block algorithm with one of the triangular postprocessing steps, we arrive at an efficient and reliablemore » algorithm for computing an RRQR factorization of a dense matrix. Experimental results on IBM RS/6000 SGI R8000 platforms show that this approach performs up to three times faster that the less reliable QR factorization with column pivoting as it is currently implemented in LAPACK, and comes within 15% of the performance of the LAPACK block algorithm for computing a QR factorization without any column exchanges. Thus, we expect this routine to be useful in may circumstances where numerical rank deficiency cannot be ruled out, but currently has been ignored because of the computational cost of dealing with it.« less
Analyzing Team Based Engineering Design Process in Computer Supported Collaborative Learning
ERIC Educational Resources Information Center
Lee, Dong-Kuk; Lee, Eun-Sang
2016-01-01
The engineering design process has been largely implemented in a collaborative project format. Recently, technological advancement has helped collaborative problem solving processes such as engineering design to have efficient implementation using computers or online technology. In this study, we investigated college students' interaction and…
NASA Technical Reports Server (NTRS)
Miura, H.; Schmit, L. A., Jr.
1976-01-01
The program documentation and user's guide for the ACCESS-1 computer program is presented. ACCESS-1 is a research oriented program which implements a collection of approximation concepts to achieve excellent efficiency in structural synthesis. The finite element method is used for structural analysis and general mathematical programming algorithms are applied in the design optimization procedure. Implementation of the computer program, preparation of input data and basic program structure are described, and three illustrative examples are given.
I/O-Efficient Scientific Computation Using TPIE
NASA Technical Reports Server (NTRS)
Vengroff, Darren Erik; Vitter, Jeffrey Scott
1996-01-01
In recent years, input/output (I/O)-efficient algorithms for a wide variety of problems have appeared in the literature. However, systems specifically designed to assist programmers in implementing such algorithms have remained scarce. TPIE is a system designed to support I/O-efficient paradigms for problems from a variety of domains, including computational geometry, graph algorithms, and scientific computation. The TPIE interface frees programmers from having to deal not only with explicit read and write calls, but also the complex memory management that must be performed for I/O-efficient computation. In this paper we discuss applications of TPIE to problems in scientific computation. We discuss algorithmic issues underlying the design and implementation of the relevant components of TPIE and present performance results of programs written to solve a series of benchmark problems using our current TPIE prototype. Some of the benchmarks we present are based on the NAS parallel benchmarks while others are of our own creation. We demonstrate that the central processing unit (CPU) overhead required to manage I/O is small and that even with just a single disk, the I/O overhead of I/O-efficient computation ranges from negligible to the same order of magnitude as CPU time. We conjecture that if we use a number of disks in parallel this overhead can be all but eliminated.
FastMag: Fast micromagnetic simulator for complex magnetic structures (invited)
NASA Astrophysics Data System (ADS)
Chang, R.; Li, S.; Lubarda, M. V.; Livshitz, B.; Lomakin, V.
2011-04-01
A fast micromagnetic simulator (FastMag) for general problems is presented. FastMag solves the Landau-Lifshitz-Gilbert equation and can handle multiscale problems with a high computational efficiency. The simulator derives its high performance from efficient methods for evaluating the effective field and from implementations on massively parallel graphics processing unit (GPU) architectures. FastMag discretizes the computational domain into tetrahedral elements and therefore is highly flexible for general problems. The magnetostatic field is computed via the superposition principle for both volume and surface parts of the computational domain. This is accomplished by implementing efficient quadrature rules and analytical integration for overlapping elements in which the integral kernel is singular. Thus, discretized superposition integrals are computed using a nonuniform grid interpolation method, which evaluates the field from N sources at N collocated observers in O(N) operations. This approach allows handling objects of arbitrary shape, allows easily calculating of the field outside the magnetized domains, does not require solving a linear system of equations, and requires little memory. FastMag is implemented on GPUs with ?> GPU-central processing unit speed-ups of 2 orders of magnitude. Simulations are shown of a large array of magnetic dots and a recording head fully discretized down to the exchange length, with over a hundred million tetrahedral elements on an inexpensive desktop computer.
Optoelectronic Reservoir Computing
Paquot, Y.; Duport, F.; Smerieri, A.; Dambre, J.; Schrauwen, B.; Haelterman, M.; Massar, S.
2012-01-01
Reservoir computing is a recently introduced, highly efficient bio-inspired approach for processing time dependent data. The basic scheme of reservoir computing consists of a non linear recurrent dynamical system coupled to a single input layer and a single output layer. Within these constraints many implementations are possible. Here we report an optoelectronic implementation of reservoir computing based on a recently proposed architecture consisting of a single non linear node and a delay line. Our implementation is sufficiently fast for real time information processing. We illustrate its performance on tasks of practical importance such as nonlinear channel equalization and speech recognition, and obtain results comparable to state of the art digital implementations. PMID:22371825
Efficient quantum walk on a quantum processor
Qiang, Xiaogang; Loke, Thomas; Montanaro, Ashley; Aungskunsiri, Kanin; Zhou, Xiaoqi; O'Brien, Jeremy L.; Wang, Jingbo B.; Matthews, Jonathan C. F.
2016-01-01
The random walk formalism is used across a wide range of applications, from modelling share prices to predicting population genetics. Likewise, quantum walks have shown much potential as a framework for developing new quantum algorithms. Here we present explicit efficient quantum circuits for implementing continuous-time quantum walks on the circulant class of graphs. These circuits allow us to sample from the output probability distributions of quantum walks on circulant graphs efficiently. We also show that solving the same sampling problem for arbitrary circulant quantum circuits is intractable for a classical computer, assuming conjectures from computational complexity theory. This is a new link between continuous-time quantum walks and computational complexity theory and it indicates a family of tasks that could ultimately demonstrate quantum supremacy over classical computers. As a proof of principle, we experimentally implement the proposed quantum circuit on an example circulant graph using a two-qubit photonics quantum processor. PMID:27146471
Implementing Parquet equations using HPX
NASA Astrophysics Data System (ADS)
Kellar, Samuel; Wagle, Bibek; Yang, Shuxiang; Tam, Ka-Ming; Kaiser, Hartmut; Moreno, Juana; Jarrell, Mark
A new C++ runtime system (HPX) enables simulations of complex systems to run more efficiently on parallel and heterogeneous systems. This increased efficiency allows for solutions to larger simulations of the parquet approximation for a system with impurities. The relevancy of the parquet equations depends upon the ability to solve systems which require long runs and large amounts of memory. These limitations, in addition to numerical complications arising from stability of the solutions, necessitate running on large distributed systems. As the computational resources trend towards the exascale and the limitations arising from computational resources vanish efficiency of large scale simulations becomes a focus. HPX facilitates efficient simulations through intelligent overlapping of computation and communication. Simulations such as the parquet equations which require the transfer of large amounts of data should benefit from HPX implementations. Supported by the the NSF EPSCoR Cooperative Agreement No. EPS-1003897 with additional support from the Louisiana Board of Regents.
Computationally efficient method for optical simulation of solar cells and their applications
NASA Astrophysics Data System (ADS)
Semenikhin, I.; Zanuccoli, M.; Fiegna, C.; Vyurkov, V.; Sangiorgi, E.
2013-01-01
This paper presents two novel implementations of the Differential method to solve the Maxwell equations in nanostructured optoelectronic solid state devices. The first proposed implementation is based on an improved and computationally efficient T-matrix formulation that adopts multiple-precision arithmetic to tackle the numerical instability problem which arises due to evanescent modes. The second implementation adopts the iterative approach that allows to achieve low computational complexity O(N logN) or better. The proposed algorithms may work with structures with arbitrary spatial variation of the permittivity. The developed two-dimensional numerical simulator is applied to analyze the dependence of the absorption characteristics of a thin silicon slab on the morphology of the front interface and on the angle of incidence of the radiation with respect to the device surface.
Parallel Computing Strategies for Irregular Algorithms
NASA Technical Reports Server (NTRS)
Biswas, Rupak; Oliker, Leonid; Shan, Hongzhang; Biegel, Bryan (Technical Monitor)
2002-01-01
Parallel computing promises several orders of magnitude increase in our ability to solve realistic computationally-intensive problems, but relies on their efficient mapping and execution on large-scale multiprocessor architectures. Unfortunately, many important applications are irregular and dynamic in nature, making their effective parallel implementation a daunting task. Moreover, with the proliferation of parallel architectures and programming paradigms, the typical scientist is faced with a plethora of questions that must be answered in order to obtain an acceptable parallel implementation of the solution algorithm. In this paper, we consider three representative irregular applications: unstructured remeshing, sparse matrix computations, and N-body problems, and parallelize them using various popular programming paradigms on a wide spectrum of computer platforms ranging from state-of-the-art supercomputers to PC clusters. We present the underlying problems, the solution algorithms, and the parallel implementation strategies. Smart load-balancing, partitioning, and ordering techniques are used to enhance parallel performance. Overall results demonstrate the complexity of efficiently parallelizing irregular algorithms.
New Media Resistance: Barriers to Implementation of Computer Video Games in the Classroom
ERIC Educational Resources Information Center
Rice, John W.
2007-01-01
Computer video games are an emerging instructional medium offering strong degrees of cognitive efficiencies for experiential learning, team building, and greater understanding of abstract concepts. As with other new media adopted for use by instructional technologists for pedagogical purposes, barriers to classroom implementation have manifested…
Efficient implementation of a 3-dimensional ADI method on the iPSC/860
DOE Office of Scientific and Technical Information (OSTI.GOV)
Van der Wijngaart, R.F.
1993-12-31
A comparison is made between several domain decomposition strategies for the solution of three-dimensional partial differential equations on a MIMD distributed memory parallel computer. The grids used are structured, and the numerical algorithm is ADI. Important implementation issues regarding load balancing, storage requirements, network latency, and overlap of computations and communications are discussed. Results of the solution of the three-dimensional heat equation on the Intel iPSC/860 are presented for the three most viable methods. It is found that the Bruno-Cappello decomposition delivers optimal computational speed through an almost complete elimination of processor idle time, while providing good memory efficiency.
Automated Development of Accurate Algorithms and Efficient Codes for Computational Aeroacoustics
NASA Technical Reports Server (NTRS)
Goodrich, John W.; Dyson, Rodger W.
1999-01-01
The simulation of sound generation and propagation in three space dimensions with realistic aircraft components is a very large time dependent computation with fine details. Simulations in open domains with embedded objects require accurate and robust algorithms for propagation, for artificial inflow and outflow boundaries, and for the definition of geometrically complex objects. The development, implementation, and validation of methods for solving these demanding problems is being done to support the NASA pillar goals for reducing aircraft noise levels. Our goal is to provide algorithms which are sufficiently accurate and efficient to produce usable results rapidly enough to allow design engineers to study the effects on sound levels of design changes in propulsion systems, and in the integration of propulsion systems with airframes. There is a lack of design tools for these purposes at this time. Our technical approach to this problem combines the development of new, algorithms with the use of Mathematica and Unix utilities to automate the algorithm development, code implementation, and validation. We use explicit methods to ensure effective implementation by domain decomposition for SPMD parallel computing. There are several orders of magnitude difference in the computational efficiencies of the algorithms which we have considered. We currently have new artificial inflow and outflow boundary conditions that are stable, accurate, and unobtrusive, with implementations that match the accuracy and efficiency of the propagation methods. The artificial numerical boundary treatments have been proven to have solutions which converge to the full open domain problems, so that the error from the boundary treatments can be driven as low as is required. The purpose of this paper is to briefly present a method for developing highly accurate algorithms for computational aeroacoustics, the use of computer automation in this process, and a brief survey of the algorithms that have resulted from this work. A review of computational aeroacoustics has recently been given by Lele.
NASA Technical Reports Server (NTRS)
Saleeb, A. F.; Chang, T. Y. P.; Wilt, T.; Iskovitz, I.
1989-01-01
The research work performed during the past year on finite element implementation and computational techniques pertaining to high temperature composites is outlined. In the present research, two main issues are addressed: efficient geometric modeling of composite structures and expedient numerical integration techniques dealing with constitutive rate equations. In the first issue, mixed finite elements for modeling laminated plates and shells were examined in terms of numerical accuracy, locking property and computational efficiency. Element applications include (currently available) linearly elastic analysis and future extension to material nonlinearity for damage predictions and large deformations. On the material level, various integration methods to integrate nonlinear constitutive rate equations for finite element implementation were studied. These include explicit, implicit and automatic subincrementing schemes. In all cases, examples are included to illustrate the numerical characteristics of various methods that were considered.
NASA Astrophysics Data System (ADS)
Shcherbakov, V.; Ahlkrona, J.
2016-12-01
In this work we develop a highly efficient meshfree approach to ice sheet modeling. Traditionally mesh based methods such as finite element methods are employed to simulate glacier and ice sheet dynamics. These methods are mature and well developed. However, despite of numerous advantages these methods suffer from some drawbacks such as necessity to remesh the computational domain every time it changes its shape, which significantly complicates the implementation on moving domains, or a costly assembly procedure for nonlinear problems. We introduce a novel meshfree approach that frees us from all these issues. The approach is built upon a radial basis function (RBF) method that, thanks to its meshfree nature, allows for an efficient handling of moving margins and free ice surface. RBF methods are also accurate and easy to implement. Since the formulation is stated in strong form it allows for a substantial reduction of the computational cost associated with the linear system assembly inside the nonlinear solver. We implement a global RBF method that defines an approximation on the entire computational domain. This method exhibits high accuracy properties. However, it suffers from a disadvantage that the coefficient matrix is dense, and therefore the computational efficiency decreases. In order to overcome this issue we also implement a localized RBF method that rests upon a partition of unity approach to subdivide the domain into several smaller subdomains. The radial basis function partition of unity method (RBF-PUM) inherits high approximation characteristics form the global RBF method while resulting in a sparse system of equations, which essentially increases the computational efficiency. To demonstrate the usefulness of the RBF methods we model the velocity field of ice flow in the Haut Glacier d'Arolla. We assume that the flow is governed by the nonlinear Blatter-Pattyn equations. We test the methods for different basal conditions and for a free moving surface. Both RBF methods are compared with a classical finite element method in terms of accuracy and efficiency. We find that the RBF methods are more efficient than the finite element method and well suited for ice dynamics modeling, especially the partition of unity approach.
Efficient implementation of the many-body Reactive Bond Order (REBO) potential on GPU
NASA Astrophysics Data System (ADS)
Trędak, Przemysław; Rudnicki, Witold R.; Majewski, Jacek A.
2016-09-01
The second generation Reactive Bond Order (REBO) empirical potential is commonly used to accurately model a wide range hydrocarbon materials. It is also extensible to other atom types and interactions. REBO potential assumes complex multi-body interaction model, that is difficult to represent efficiently in the SIMD or SIMT programming model. Hence, despite its importance, no efficient GPGPU implementation has been developed for this potential. Here we present a detailed description of a highly efficient GPGPU implementation of molecular dynamics algorithm using REBO potential. The presented algorithm takes advantage of rarely used properties of the SIMT architecture of a modern GPU to solve difficult synchronizations issues that arise in computations of multi-body potential. Techniques developed for this problem may be also used to achieve efficient solutions of different problems. The performance of proposed algorithm is assessed using a range of model systems. It is compared to highly optimized CPU implementation (both single core and OpenMP) available in LAMMPS package. These experiments show up to 6x improvement in forces computation time using single processor of the NVIDIA Tesla K80 compared to high end 16-core Intel Xeon processor.
Efficient computation of hashes
NASA Astrophysics Data System (ADS)
Lopes, Raul H. C.; Franqueira, Virginia N. L.; Hobson, Peter R.
2014-06-01
The sequential computation of hashes at the core of many distributed storage systems and found, for example, in grid services can hinder efficiency in service quality and even pose security challenges that can only be addressed by the use of parallel hash tree modes. The main contributions of this paper are, first, the identification of several efficiency and security challenges posed by the use of sequential hash computation based on the Merkle-Damgard engine. In addition, alternatives for the parallel computation of hash trees are discussed, and a prototype for a new parallel implementation of the Keccak function, the SHA-3 winner, is introduced.
Scientific Discovery through Advanced Computing (SciDAC-3) Partnership Project Annual Report
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hoffman, Forest M.; Bochev, Pavel B.; Cameron-Smith, Philip J..
The Applying Computationally Efficient Schemes for BioGeochemical Cycles ACES4BGC Project is advancing the predictive capabilities of Earth System Models (ESMs) by reducing two of the largest sources of uncertainty, aerosols and biospheric feedbacks, with a highly efficient computational approach. In particular, this project is implementing and optimizing new computationally efficient tracer advection algorithms for large numbers of tracer species; adding important biogeochemical interactions between the atmosphere, land, and ocean models; and applying uncertainty quanti cation (UQ) techniques to constrain process parameters and evaluate uncertainties in feedbacks between biogeochemical cycles and the climate system.
Macías-Díaz, J E; Macías, Siegfried; Medina-Ramírez, I E
2013-12-01
In this manuscript, we present a computational model to approximate the solutions of a partial differential equation which describes the growth dynamics of microbial films. The numerical technique reported in this work is an explicit, nonlinear finite-difference methodology which is computationally implemented using Newton's method. Our scheme is compared numerically against an implicit, linear finite-difference discretization of the same partial differential equation, whose computer coding requires an implementation of the stabilized bi-conjugate gradient method. Our numerical results evince that the nonlinear approach results in a more efficient approximation to the solutions of the biofilm model considered, and demands less computer memory. Moreover, the positivity of initial profiles is preserved in the practice by the nonlinear scheme proposed. Copyright © 2013 Elsevier Ltd. All rights reserved.
Aguilar, I; Misztal, I; Legarra, A; Tsuruta, S
2011-12-01
Genomic evaluations can be calculated using a unified procedure that combines phenotypic, pedigree and genomic information. Implementation of such a procedure requires the inverse of the relationship matrix based on pedigree and genomic relationships. The objective of this study was to investigate efficient computing options to create relationship matrices based on genomic markers and pedigree information as well as their inverses. SNP maker information was simulated for a panel of 40 K SNPs, with the number of genotyped animals up to 30 000. Matrix multiplication in the computation of the genomic relationship was by a simple 'do' loop, by two optimized versions of the loop, and by a specific matrix multiplication subroutine. Inversion was by a generalized inverse algorithm and by a LAPACK subroutine. With the most efficient choices and parallel processing, creation of matrices for 30 000 animals would take a few hours. Matrices required to implement a unified approach can be computed efficiently. Optimizations can be either by modifications of existing code or by the use of efficient automatic optimizations provided by open source or third-party libraries. © 2011 Blackwell Verlag GmbH.
Implementation of Premixed Equilibrium Chemistry Capability in OVERFLOW
NASA Technical Reports Server (NTRS)
Olsen, M. E.; Liu, Y.; Vinokur, M.; Olsen, T.
2003-01-01
An implementation of premixed equilibrium chemistry has been completed for the OVERFLOW code, a chimera capable, complex geometry flow code widely used to predict transonic flowfields. The implementation builds on the computational efficiency and geometric generality of the solver.
Implementation of Premixed Equilibrium Chemistry Capability in OVERFLOW
NASA Technical Reports Server (NTRS)
Olsen, Mike E.; Liu, Yen; Vinokur, M.; Olsen, Tom
2004-01-01
An implementation of premixed equilibrium chemistry has been completed for the OVERFLOW code, a chimera capable, complex geometry flow code widely used to predict transonic flowfields. The implementation builds on the computational efficiency and geometric generality of the solver.
On the impact of approximate computation in an analog DeSTIN architecture.
Young, Steven; Lu, Junjie; Holleman, Jeremy; Arel, Itamar
2014-05-01
Deep machine learning (DML) holds the potential to revolutionize machine learning by automating rich feature extraction, which has become the primary bottleneck of human engineering in pattern recognition systems. However, the heavy computational burden renders DML systems implemented on conventional digital processors impractical for large-scale problems. The highly parallel computations required to implement large-scale deep learning systems are well suited to custom hardware. Analog computation has demonstrated power efficiency advantages of multiple orders of magnitude relative to digital systems while performing nonideal computations. In this paper, we investigate typical error sources introduced by analog computational elements and their impact on system-level performance in DeSTIN--a compositional deep learning architecture. These inaccuracies are evaluated on a pattern classification benchmark, clearly demonstrating the robustness of the underlying algorithm to the errors introduced by analog computational elements. A clear understanding of the impacts of nonideal computations is necessary to fully exploit the efficiency of analog circuits.
NASA Astrophysics Data System (ADS)
Pan, Bing; Wang, Bo
2017-10-01
Digital volume correlation (DVC) is a powerful technique for quantifying interior deformation within solid opaque materials and biological tissues. In the last two decades, great efforts have been made to improve the accuracy and efficiency of the DVC algorithm. However, there is still a lack of a flexible, robust and accurate version that can be efficiently implemented in personal computers with limited RAM. This paper proposes an advanced DVC method that can realize accurate full-field internal deformation measurement applicable to high-resolution volume images with up to billions of voxels. Specifically, a novel layer-wise reliability-guided displacement tracking strategy combined with dynamic data management is presented to guide the DVC computation from slice to slice. The displacements at specified calculation points in each layer are computed using the advanced 3D inverse-compositional Gauss-Newton algorithm with the complete initial guess of the deformation vector accurately predicted from the computed calculation points. Since only limited slices of interest in the reference and deformed volume images rather than the whole volume images are required, the DVC calculation can thus be efficiently implemented on personal computers. The flexibility, accuracy and efficiency of the presented DVC approach are demonstrated by analyzing computer-simulated and experimentally obtained high-resolution volume images.
MPI_XSTAR: MPI-based Parallelization of the XSTAR Photoionization Program
NASA Astrophysics Data System (ADS)
Danehkar, Ashkbiz; Nowak, Michael A.; Lee, Julia C.; Smith, Randall K.
2018-02-01
We describe a program for the parallel implementation of multiple runs of XSTAR, a photoionization code that is used to predict the physical properties of an ionized gas from its emission and/or absorption lines. The parallelization program, called MPI_XSTAR, has been developed and implemented in the C++ language by using the Message Passing Interface (MPI) protocol, a conventional standard of parallel computing. We have benchmarked parallel multiprocessing executions of XSTAR, using MPI_XSTAR, against a serial execution of XSTAR, in terms of the parallelization speedup and the computing resource efficiency. Our experience indicates that the parallel execution runs significantly faster than the serial execution, however, the efficiency in terms of the computing resource usage decreases with increasing the number of processors used in the parallel computing.
Meshfree and efficient modeling of swimming cells
NASA Astrophysics Data System (ADS)
Gallagher, Meurig T.; Smith, David J.
2018-05-01
Locomotion in Stokes flow is an intensively studied problem because it describes important biological phenomena such as the motility of many species' sperm, bacteria, algae, and protozoa. Numerical computations can be challenging, particularly in three dimensions, due to the presence of moving boundaries and complex geometries; methods which combine ease of implementation and computational efficiency are therefore needed. A recently proposed method to discretize the regularized Stokeslet boundary integral equation without the need for a connected mesh is applied to the inertialess locomotion problem in Stokes flow. The mathematical formulation and key aspects of the computational implementation in matlab® or GNU Octave are described, followed by numerical experiments with biflagellate algae and multiple uniflagellate sperm swimming between no-slip surfaces, for which both swimming trajectories and flow fields are calculated. These computational experiments required minutes of time on modest hardware; an extensible implementation is provided in a GitHub repository. The nearest-neighbor discretization dramatically improves convergence and robustness, a key challenge in extending the regularized Stokeslet method to complicated three-dimensional biological fluid problems.
Efficient algorithms and implementations of entropy-based moment closures for rarefied gases
NASA Astrophysics Data System (ADS)
Schaerer, Roman Pascal; Bansal, Pratyuksh; Torrilhon, Manuel
2017-07-01
We present efficient algorithms and implementations of the 35-moment system equipped with the maximum-entropy closure in the context of rarefied gases. While closures based on the principle of entropy maximization have been shown to yield very promising results for moderately rarefied gas flows, the computational cost of these closures is in general much higher than for closure theories with explicit closed-form expressions of the closing fluxes, such as Grad's classical closure. Following a similar approach as Garrett et al. (2015) [13], we investigate efficient implementations of the computationally expensive numerical quadrature method used for the moment evaluations of the maximum-entropy distribution by exploiting its inherent fine-grained parallelism with the parallelism offered by multi-core processors and graphics cards. We show that using a single graphics card as an accelerator allows speed-ups of two orders of magnitude when compared to a serial CPU implementation. To accelerate the time-to-solution for steady-state problems, we propose a new semi-implicit time discretization scheme. The resulting nonlinear system of equations is solved with a Newton type method in the Lagrange multipliers of the dual optimization problem in order to reduce the computational cost. Additionally, fully explicit time-stepping schemes of first and second order accuracy are presented. We investigate the accuracy and efficiency of the numerical schemes for several numerical test cases, including a steady-state shock-structure problem.
An Initial Multi-Domain Modeling of an Actively Cooled Structure
NASA Technical Reports Server (NTRS)
Steinthorsson, Erlendur
1997-01-01
A methodology for the simulation of turbine cooling flows is being developed. The methodology seeks to combine numerical techniques that optimize both accuracy and computational efficiency. Key components of the methodology include the use of multiblock grid systems for modeling complex geometries, and multigrid convergence acceleration for enhancing computational efficiency in highly resolved fluid flow simulations. The use of the methodology has been demonstrated in several turbo machinery flow and heat transfer studies. Ongoing and future work involves implementing additional turbulence models, improving computational efficiency, adding AMR.
Lessons Learned in Designing and Implementing a Computer-Adaptive Test for English
ERIC Educational Resources Information Center
Burston, Jack; Neophytou, Maro
2014-01-01
This paper describes the lessons learned in designing and implementing a computer-adaptive test (CAT) for English. The early identification of students with weak L2 English proficiency is of critical importance in university settings that have compulsory English language course graduation requirements. The most efficient means of diagnosing the L2…
NASA Technical Reports Server (NTRS)
Habiby, Sarry F.
1987-01-01
The design and implementation of a digital (numerical) optical matrix-vector multiplier are presented. The objective is to demonstrate the operation of an optical processor designed to minimize computation time in performing a practical computing application. This is done by using the large array of processing elements in a Hughes liquid crystal light valve, and relying on the residue arithmetic representation, a holographic optical memory, and position coded optical look-up tables. In the design, all operations are performed in effectively one light valve response time regardless of matrix size. The features of the design allowing fast computation include the residue arithmetic representation, the mapping approach to computation, and the holographic memory. In addition, other features of the work include a practical light valve configuration for efficient polarization control, a model for recording multiple exposures in silver halides with equal reconstruction efficiency, and using light from an optical fiber for a reference beam source in constructing the hologram. The design can be extended to implement larger matrix arrays without increasing computation time.
Practical experimental certification of computational quantum gates using a twirling procedure.
Moussa, Osama; da Silva, Marcus P; Ryan, Colm A; Laflamme, Raymond
2012-08-17
Because of the technical difficulty of building large quantum computers, it is important to be able to estimate how faithful a given implementation is to an ideal quantum computer. The common approach of completely characterizing the computation process via quantum process tomography requires an exponential amount of resources, and thus is not practical even for relatively small devices. We solve this problem by demonstrating that twirling experiments previously used to characterize the average fidelity of quantum memories efficiently can be easily adapted to estimate the average fidelity of the experimental implementation of important quantum computation processes, such as unitaries in the Clifford group, in a practical and efficient manner with applicability in current quantum devices. Using this procedure, we demonstrate state-of-the-art coherent control of an ensemble of magnetic moments of nuclear spins in a single crystal solid by implementing the encoding operation for a 3-qubit code with only a 1% degradation in average fidelity discounting preparation and measurement errors. We also highlight one of the advances that was instrumental in achieving such high fidelity control.
A Simple and Resource-efficient Setup for the Computer-aided Drug Design Laboratory.
Moretti, Loris; Sartori, Luca
2016-10-01
Undertaking modelling investigations for Computer-Aided Drug Design (CADD) requires a proper environment. In principle, this could be done on a single computer, but the reality of a drug discovery program requires robustness and high-throughput computing (HTC) to efficiently support the research. Therefore, a more capable alternative is needed but its implementation has no widespread solution. Here, the realization of such a computing facility is discussed, from general layout to technical details all aspects are covered. © 2016 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim.
PIMS: Memristor-Based Processing-in-Memory-and-Storage.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Cook, Jeanine
Continued progress in computing has augmented the quest for higher performance with a new quest for higher energy efficiency. This has led to the re-emergence of Processing-In-Memory (PIM) ar- chitectures that offer higher density and performance with some boost in energy efficiency. Past PIM work either integrated a standard CPU with a conventional DRAM to improve the CPU- memory link, or used a bit-level processor with Single Instruction Multiple Data (SIMD) control, but neither matched the energy consumption of the memory to the computation. We originally proposed to develop a new architecture derived from PIM that more effectively addressed energymore » efficiency for high performance scientific, data analytics, and neuromorphic applications. We also originally planned to implement a von Neumann architecture with arithmetic/logic units (ALUs) that matched the power consumption of an advanced storage array to maximize energy efficiency. Implementing this architecture in storage was our original idea, since by augmenting storage (in- stead of memory), the system could address both in-memory computation and applications that accessed larger data sets directly from storage, hence Processing-in-Memory-and-Storage (PIMS). However, as our research matured, we discovered several things that changed our original direc- tion, the most important being that a PIM that implements a standard von Neumann-type archi- tecture results in significant energy efficiency improvement, but only about a O(10) performance improvement. In addition to this, the emergence of new memory technologies moved us to propos- ing a non-von Neumann architecture, called Superstrider, implemented not in storage, but in a new DRAM technology called High Bandwidth Memory (HBM). HBM is a stacked DRAM tech- nology that includes a logic layer where an architecture such as Superstrider could potentially be implemented.« less
A Bitslice Implementation of Anderson's Attack on A5/1
NASA Astrophysics Data System (ADS)
Bulavintsev, Vadim; Semenov, Alexander; Zaikin, Oleg; Kochemazov, Stepan
2018-03-01
The A5/1 keystream generator is a part of Global System for Mobile Communications (GSM) protocol, employed in cellular networks all over the world. Its cryptographic resistance was extensively analyzed in dozens of papers. However, almost all corresponding methods either employ a specific hardware or require an extensive preprocessing stage and significant amounts of memory. In the present study, a bitslice variant of Anderson's Attack on A5/1 is implemented. It requires very little computer memory and no preprocessing. Moreover, the attack can be made even more efficient by harnessing the computing power of modern Graphics Processing Units (GPUs). As a result, using commonly available GPUs this method can quite efficiently recover the secret key using only 64 bits of keystream. To test the performance of the implementation, a volunteer computing project was launched. 10 instances of A5/1 cryptanalysis have been successfully solved in this project in a single week.
PRESAGE: PRivacy-preserving gEnetic testing via SoftwAre Guard Extension.
Chen, Feng; Wang, Chenghong; Dai, Wenrui; Jiang, Xiaoqian; Mohammed, Noman; Al Aziz, Md Momin; Sadat, Md Nazmus; Sahinalp, Cenk; Lauter, Kristin; Wang, Shuang
2017-07-26
Advances in DNA sequencing technologies have prompted a wide range of genomic applications to improve healthcare and facilitate biomedical research. However, privacy and security concerns have emerged as a challenge for utilizing cloud computing to handle sensitive genomic data. We present one of the first implementations of Software Guard Extension (SGX) based securely outsourced genetic testing framework, which leverages multiple cryptographic protocols and minimal perfect hash scheme to enable efficient and secure data storage and computation outsourcing. We compared the performance of the proposed PRESAGE framework with the state-of-the-art homomorphic encryption scheme, as well as the plaintext implementation. The experimental results demonstrated significant performance over the homomorphic encryption methods and a small computational overhead in comparison to plaintext implementation. The proposed PRESAGE provides an alternative solution for secure and efficient genomic data outsourcing in an untrusted cloud by using a hybrid framework that combines secure hardware and multiple crypto protocols.
On optimal infinite impulse response edge detection filters
NASA Technical Reports Server (NTRS)
Sarkar, Sudeep; Boyer, Kim L.
1991-01-01
The authors outline the design of an optimal, computationally efficient, infinite impulse response edge detection filter. The optimal filter is computed based on Canny's high signal to noise ratio, good localization criteria, and a criterion on the spurious response of the filter to noise. An expression for the width of the filter, which is appropriate for infinite-length filters, is incorporated directly in the expression for spurious responses. The three criteria are maximized using the variational method and nonlinear constrained optimization. The optimal filter parameters are tabulated for various values of the filter performance criteria. A complete methodology for implementing the optimal filter using approximating recursive digital filtering is presented. The approximating recursive digital filter is separable into two linear filters operating in two orthogonal directions. The implementation is very simple and computationally efficient, has a constant time of execution for different sizes of the operator, and is readily amenable to real-time hardware implementation.
Krylov subspace methods on supercomputers
NASA Technical Reports Server (NTRS)
Saad, Youcef
1988-01-01
A short survey of recent research on Krylov subspace methods with emphasis on implementation on vector and parallel computers is presented. Conjugate gradient methods have proven very useful on traditional scalar computers, and their popularity is likely to increase as three-dimensional models gain importance. A conservative approach to derive effective iterative techniques for supercomputers has been to find efficient parallel/vector implementations of the standard algorithms. The main source of difficulty in the incomplete factorization preconditionings is in the solution of the triangular systems at each step. A few approaches consisting of implementing efficient forward and backward triangular solutions are described in detail. Polynomial preconditioning as an alternative to standard incomplete factorization techniques is also discussed. Another efficient approach is to reorder the equations so as to improve the structure of the matrix to achieve better parallelism or vectorization. An overview of these and other ideas and their effectiveness or potential for different types of architectures is given.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Biyikli, Emre; To, Albert C., E-mail: albertto@pitt.edu
Atomistic/continuum coupling methods combine accurate atomistic methods and efficient continuum methods to simulate the behavior of highly ordered crystalline systems. Coupled methods utilize the advantages of both approaches to simulate systems at a lower computational cost, while retaining the accuracy associated with atomistic methods. Many concurrent atomistic/continuum coupling methods have been proposed in the past; however, their true computational efficiency has not been demonstrated. The present work presents an efficient implementation of a concurrent coupling method called the Multiresolution Molecular Mechanics (MMM) for serial, parallel, and adaptive analysis. First, we present the features of the software implemented along with themore » associated technologies. The scalability of the software implementation is demonstrated, and the competing effects of multiscale modeling and parallelization are discussed. Then, the algorithms contributing to the efficiency of the software are presented. These include algorithms for eliminating latent ghost atoms from calculations and measurement-based dynamic balancing of parallel workload. The efficiency improvements made by these algorithms are demonstrated by benchmark tests. The efficiency of the software is found to be on par with LAMMPS, a state-of-the-art Molecular Dynamics (MD) simulation code, when performing full atomistic simulations. Speed-up of the MMM method is shown to be directly proportional to the reduction of the number of the atoms visited in force computation. Finally, an adaptive MMM analysis on a nanoindentation problem, containing over a million atoms, is performed, yielding an improvement of 6.3–8.5 times in efficiency, over the full atomistic MD method. For the first time, the efficiency of a concurrent atomistic/continuum coupling method is comprehensively investigated and demonstrated.« less
Implementation of Finite Rate Chemistry Capability in OVERFLOW
NASA Technical Reports Server (NTRS)
Olsen, M. E.; Venkateswaran, S.; Prabhu, D. K.
2004-01-01
An implementation of both finite rate and equilibrium chemistry have been completed for the OVERFLOW code, a chimera capable, complex geometry flow code widely used to predict transonic flow fields. The implementation builds on the computational efficiency and geometric generality of the solver.
Elastic Cloud Computing Architecture and System for Heterogeneous Spatiotemporal Computing
NASA Astrophysics Data System (ADS)
Shi, X.
2017-10-01
Spatiotemporal computation implements a variety of different algorithms. When big data are involved, desktop computer or standalone application may not be able to complete the computation task due to limited memory and computing power. Now that a variety of hardware accelerators and computing platforms are available to improve the performance of geocomputation, different algorithms may have different behavior on different computing infrastructure and platforms. Some are perfect for implementation on a cluster of graphics processing units (GPUs), while GPUs may not be useful on certain kind of spatiotemporal computation. This is the same situation in utilizing a cluster of Intel's many-integrated-core (MIC) or Xeon Phi, as well as Hadoop or Spark platforms, to handle big spatiotemporal data. Furthermore, considering the energy efficiency requirement in general computation, Field Programmable Gate Array (FPGA) may be a better solution for better energy efficiency when the performance of computation could be similar or better than GPUs and MICs. It is expected that an elastic cloud computing architecture and system that integrates all of GPUs, MICs, and FPGAs could be developed and deployed to support spatiotemporal computing over heterogeneous data types and computational problems.
Memristive Mixed-Signal Neuromorphic Systems: Energy-Efficient Learning at the Circuit-Level
Chakma, Gangotree; Adnan, Md Musabbir; Wyer, Austin R.; ...
2017-11-23
Neuromorphic computing is non-von Neumann computer architecture for the post Moore’s law era of computing. Since a main focus of the post Moore’s law era is energy-efficient computing with fewer resources and less area, neuromorphic computing contributes effectively in this research. Here in this paper, we present a memristive neuromorphic system for improved power and area efficiency. Our particular mixed-signal approach implements neural networks with spiking events in a synchronous way. Moreover, the use of nano-scale memristive devices saves both area and power in the system. We also provide device-level considerations that make the system more energy-efficient. The proposed systemmore » additionally includes synchronous digital long term plasticity, an online learning methodology that helps the system train the neural networks during the operation phase and improves the efficiency in learning considering the power consumption and area overhead.« less
Memristive Mixed-Signal Neuromorphic Systems: Energy-Efficient Learning at the Circuit-Level
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chakma, Gangotree; Adnan, Md Musabbir; Wyer, Austin R.
Neuromorphic computing is non-von Neumann computer architecture for the post Moore’s law era of computing. Since a main focus of the post Moore’s law era is energy-efficient computing with fewer resources and less area, neuromorphic computing contributes effectively in this research. Here in this paper, we present a memristive neuromorphic system for improved power and area efficiency. Our particular mixed-signal approach implements neural networks with spiking events in a synchronous way. Moreover, the use of nano-scale memristive devices saves both area and power in the system. We also provide device-level considerations that make the system more energy-efficient. The proposed systemmore » additionally includes synchronous digital long term plasticity, an online learning methodology that helps the system train the neural networks during the operation phase and improves the efficiency in learning considering the power consumption and area overhead.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Brown, W Michael; Kohlmeyer, Axel; Plimpton, Steven J
The use of accelerators such as graphics processing units (GPUs) has become popular in scientific computing applications due to their low cost, impressive floating-point capabilities, high memory bandwidth, and low electrical power requirements. Hybrid high-performance computers, machines with nodes containing more than one type of floating-point processor (e.g. CPU and GPU), are now becoming more prevalent due to these advantages. In this paper, we present a continuation of previous work implementing algorithms for using accelerators into the LAMMPS molecular dynamics software for distributed memory parallel hybrid machines. In our previous work, we focused on acceleration for short-range models with anmore » approach intended to harness the processing power of both the accelerator and (multi-core) CPUs. To augment the existing implementations, we present an efficient implementation of long-range electrostatic force calculation for molecular dynamics. Specifically, we present an implementation of the particle-particle particle-mesh method based on the work by Harvey and De Fabritiis. We present benchmark results on the Keeneland InfiniBand GPU cluster. We provide a performance comparison of the same kernels compiled with both CUDA and OpenCL. We discuss limitations to parallel efficiency and future directions for improving performance on hybrid or heterogeneous computers.« less
MUTILS - a set of efficient modeling tools for multi-core CPUs implemented in MEX
NASA Astrophysics Data System (ADS)
Krotkiewski, Marcin; Dabrowski, Marcin
2013-04-01
The need for computational performance is common in scientific applications, and in particular in numerical simulations, where high resolution models require efficient processing of large amounts of data. Especially in the context of geological problems the need to increase the model resolution to resolve physical and geometrical complexities seems to have no limits. Alas, the performance of new generations of CPUs does not improve any longer by simply increasing clock speeds. Current industrial trends are to increase the number of computational cores. As a result, parallel implementations are required in order to fully utilize the potential of new processors, and to study more complex models. We target simulations on small to medium scale shared memory computers: laptops and desktop PCs with ~8 CPU cores and up to tens of GB of memory to high-end servers with ~50 CPU cores and hundereds of GB of memory. In this setting MATLAB is often the environment of choice for scientists that want to implement their own models with little effort. It is a useful general purpose mathematical software package, but due to its versatility some of its functionality is not as efficient as it could be. In particular, the challanges of modern multi-core architectures are not fully addressed. We have developed MILAMIN 2 - an efficient FEM modeling environment written in native MATLAB. Amongst others, MILAMIN provides functions to define model geometry, generate and convert structured and unstructured meshes (also through interfaces to external mesh generators), compute element and system matrices, apply boundary conditions, solve the system of linear equations, address non-linear and transient problems, and perform post-processing. MILAMIN strives to combine the ease of code development and the computational efficiency. Where possible, the code is optimized and/or parallelized within the MATLAB framework. Native MATLAB is augmented with the MUTILS library - a set of MEX functions that implement the computationally intensive, performance critical parts of the code, which we have identified to be bottlenecks. Here, we discuss the functionality and performance of the MUTILS library. Currently, it includes: 1. time and memory efficient assembly of sparse matrices for FEM simulations 2. parallel sparse matrix - vector product with optimizations speficic to symmetric matrices and multiple degrees of freedom per node 3. parallel point in triangle location and point in tetrahedron location for unstructured, adaptive 2D and 3D meshes (useful for 'marker in cell' type of methods) 4. parallel FEM interpolation for 2D and 3D meshes of elements of different types and orders, and for different number of degrees of freedom per node 5. a stand-alone, MEX implementation of the Conjugate Gradients iterative solver 6. interface to METIS graph partitioning and a fast implementation of RCM reordering
Efficient Implementation of Multigrid Solvers on Message-Passing Parrallel Systems
NASA Technical Reports Server (NTRS)
Lou, John
1994-01-01
We discuss our implementation strategies for finite difference multigrid partial differential equation (PDE) solvers on message-passing systems. Our target parallel architecture is Intel parallel computers: the Delta and Paragon system.
NASA Astrophysics Data System (ADS)
Yang, Shuangming; Wei, Xile; Deng, Bin; Liu, Chen; Li, Huiyan; Wang, Jiang
2018-03-01
Balance between biological plausibility of dynamical activities and computational efficiency is one of challenging problems in computational neuroscience and neural system engineering. This paper proposes a set of efficient methods for the hardware realization of the conductance-based neuron model with relevant dynamics, targeting reproducing the biological behaviors with low-cost implementation on digital programmable platform, which can be applied in wide range of conductance-based neuron models. Modified GP neuron models for efficient hardware implementation are presented to reproduce reliable pallidal dynamics, which decode the information of basal ganglia and regulate the movement disorder related voluntary activities. Implementation results on a field-programmable gate array (FPGA) demonstrate that the proposed techniques and models can reduce the resource cost significantly and reproduce the biological dynamics accurately. Besides, the biological behaviors with weak network coupling are explored on the proposed platform, and theoretical analysis is also made for the investigation of biological characteristics of the structured pallidal oscillator and network. The implementation techniques provide an essential step towards the large-scale neural network to explore the dynamical mechanisms in real time. Furthermore, the proposed methodology enables the FPGA-based system a powerful platform for the investigation on neurodegenerative diseases and real-time control of bio-inspired neuro-robotics.
NASA Technical Reports Server (NTRS)
Arnold, Steven M; Bednarcyk, Brett; Aboydi, Jacob
2004-01-01
The High-Fidelity Generalized Method of Cells (HFGMC) micromechanics model has recently been reformulated by Bansal and Pindera (in the context of elastic phases with perfect bonding) to maximize its computational efficiency. This reformulated version of HFGMC has now been extended to include both inelastic phases and imperfect fiber-matrix bonding. The present paper presents an overview of the HFGMC theory in both its original and reformulated forms and a comparison of the results of the two implementations. The objective is to establish the correlation between the two HFGMC formulations and document the improved efficiency offered by the reformulation. The results compare the macro and micro scale predictions of the continuous reinforcement (doubly-periodic) and discontinuous reinforcement (triply-periodic) versions of both formulations into the inelastic regime, and, in the case of the discontinuous reinforcement version, with both perfect and weak interfacial bonding. The results demonstrate that identical predictions are obtained using either the original or reformulated implementations of HFGMC aside from small numerical differences in the inelastic regime due to the different implementation schemes used for the inelastic terms present in the two formulations. Finally, a direct comparison of execution times is presented for the original formulation and reformulation code implementations. It is shown that as the discretization employed in representing the composite repeating unit cell becomes increasingly refined (requiring a larger number of sub-volumes), the reformulated implementation becomes significantly (approximately an order of magnitude at best) more computationally efficient in both the continuous reinforcement (doubly-periodic) and discontinuous reinforcement (triply-periodic) cases.
Calvello, Simone; Piccardo, Matteo; Rao, Shashank Vittal; Soncini, Alessandro
2018-03-05
We have developed and implemented a new ab initio code, Ceres (Computational Emulator of Rare Earth Systems), completely written in C++11, which is dedicated to the efficient calculation of the electronic structure and magnetic properties of the crystal field states arising from the splitting of the ground state spin-orbit multiplet in lanthanide complexes. The new code gains efficiency via an optimized implementation of a direct configurational averaged Hartree-Fock (CAHF) algorithm for the determination of 4f quasi-atomic active orbitals common to all multi-electron spin manifolds contributing to the ground spin-orbit multiplet of the lanthanide ion. The new CAHF implementation is based on quasi-Newton convergence acceleration techniques coupled to an efficient library for the direct evaluation of molecular integrals, and problem-specific density matrix guess strategies. After describing the main features of the new code, we compare its efficiency with the current state-of-the-art ab initio strategy to determine crystal field levels and properties, and show that our methodology, as implemented in Ceres, represents a more time-efficient computational strategy for the evaluation of the magnetic properties of lanthanide complexes, also allowing a full representation of non-perturbative spin-orbit coupling effects. © 2017 Wiley Periodicals, Inc. © 2017 Wiley Periodicals, Inc.
Efficient Multicriteria Protein Structure Comparison on Modern Processor Architectures
Manolakos, Elias S.
2015-01-01
Fast increasing computational demand for all-to-all protein structures comparison (PSC) is a result of three confounding factors: rapidly expanding structural proteomics databases, high computational complexity of pairwise protein comparison algorithms, and the trend in the domain towards using multiple criteria for protein structures comparison (MCPSC) and combining results. We have developed a software framework that exploits many-core and multicore CPUs to implement efficient parallel MCPSC in modern processors based on three popular PSC methods, namely, TMalign, CE, and USM. We evaluate and compare the performance and efficiency of the two parallel MCPSC implementations using Intel's experimental many-core Single-Chip Cloud Computer (SCC) as well as Intel's Core i7 multicore processor. We show that the 48-core SCC is more efficient than the latest generation Core i7, achieving a speedup factor of 42 (efficiency of 0.9), making many-core processors an exciting emerging technology for large-scale structural proteomics. We compare and contrast the performance of the two processors on several datasets and also show that MCPSC outperforms its component methods in grouping related domains, achieving a high F-measure of 0.91 on the benchmark CK34 dataset. The software implementation for protein structure comparison using the three methods and combined MCPSC, along with the developed underlying rckskel algorithmic skeletons library, is available via GitHub. PMID:26605332
Efficient Multicriteria Protein Structure Comparison on Modern Processor Architectures.
Sharma, Anuj; Manolakos, Elias S
2015-01-01
Fast increasing computational demand for all-to-all protein structures comparison (PSC) is a result of three confounding factors: rapidly expanding structural proteomics databases, high computational complexity of pairwise protein comparison algorithms, and the trend in the domain towards using multiple criteria for protein structures comparison (MCPSC) and combining results. We have developed a software framework that exploits many-core and multicore CPUs to implement efficient parallel MCPSC in modern processors based on three popular PSC methods, namely, TMalign, CE, and USM. We evaluate and compare the performance and efficiency of the two parallel MCPSC implementations using Intel's experimental many-core Single-Chip Cloud Computer (SCC) as well as Intel's Core i7 multicore processor. We show that the 48-core SCC is more efficient than the latest generation Core i7, achieving a speedup factor of 42 (efficiency of 0.9), making many-core processors an exciting emerging technology for large-scale structural proteomics. We compare and contrast the performance of the two processors on several datasets and also show that MCPSC outperforms its component methods in grouping related domains, achieving a high F-measure of 0.91 on the benchmark CK34 dataset. The software implementation for protein structure comparison using the three methods and combined MCPSC, along with the developed underlying rckskel algorithmic skeletons library, is available via GitHub.
Box schemes and their implementation on the iPSC/860
NASA Technical Reports Server (NTRS)
Chattot, J. J.; Merriam, M. L.
1991-01-01
Research on algoriths for efficiently solving fluid flow problems on massively parallel computers is continued in the present paper. Attention is given to the implementation of a box scheme on the iPSC/860, a massively parallel computer with a peak speed of 10 Gflops and a memory of 128 Mwords. A domain decomposition approach to parallelism is used.
Efficient algorithms and implementations of entropy-based moment closures for rarefied gases
DOE Office of Scientific and Technical Information (OSTI.GOV)
Schaerer, Roman Pascal, E-mail: schaerer@mathcces.rwth-aachen.de; Bansal, Pratyuksh; Torrilhon, Manuel
We present efficient algorithms and implementations of the 35-moment system equipped with the maximum-entropy closure in the context of rarefied gases. While closures based on the principle of entropy maximization have been shown to yield very promising results for moderately rarefied gas flows, the computational cost of these closures is in general much higher than for closure theories with explicit closed-form expressions of the closing fluxes, such as Grad's classical closure. Following a similar approach as Garrett et al. (2015) , we investigate efficient implementations of the computationally expensive numerical quadrature method used for the moment evaluations of the maximum-entropymore » distribution by exploiting its inherent fine-grained parallelism with the parallelism offered by multi-core processors and graphics cards. We show that using a single graphics card as an accelerator allows speed-ups of two orders of magnitude when compared to a serial CPU implementation. To accelerate the time-to-solution for steady-state problems, we propose a new semi-implicit time discretization scheme. The resulting nonlinear system of equations is solved with a Newton type method in the Lagrange multipliers of the dual optimization problem in order to reduce the computational cost. Additionally, fully explicit time-stepping schemes of first and second order accuracy are presented. We investigate the accuracy and efficiency of the numerical schemes for several numerical test cases, including a steady-state shock-structure problem.« less
CORDIC-based digital signal processing (DSP) element for adaptive signal processing
NASA Astrophysics Data System (ADS)
Bolstad, Gregory D.; Neeld, Kenneth B.
1995-04-01
The High Performance Adaptive Weight Computation (HAWC) processing element is a CORDIC based application specific DSP element that, when connected in a linear array, can perform extremely high throughput (100s of GFLOPS) matrix arithmetic operations on linear systems of equations in real time. In particular, it very efficiently performs the numerically intense computation of optimal least squares solutions for large, over-determined linear systems. Most techniques for computing solutions to these types of problems have used either a hard-wired, non-programmable systolic array approach, or more commonly, programmable DSP or microprocessor approaches. The custom logic methods can be efficient, but are generally inflexible. Approaches using multiple programmable generic DSP devices are very flexible, but suffer from poor efficiency and high computation latencies, primarily due to the large number of DSP devices that must be utilized to achieve the necessary arithmetic throughput. The HAWC processor is implemented as a highly optimized systolic array, yet retains some of the flexibility of a programmable data-flow system, allowing efficient implementation of algorithm variations. This provides flexible matrix processing capabilities that are one to three orders of magnitude less expensive and more dense than the current state of the art, and more importantly, allows a realizable solution to matrix processing problems that were previously considered impractical to physically implement. HAWC has direct applications in RADAR, SONAR, communications, and image processing, as well as in many other types of systems.
Wei, Jianing; Bouman, Charles A; Allebach, Jan P
2014-05-01
Many imaging applications require the implementation of space-varying convolution for accurate restoration and reconstruction of images. Here, we use the term space-varying convolution to refer to linear operators whose impulse response has slow spatial variation. In addition, these space-varying convolution operators are often dense, so direct implementation of the convolution operator is typically computationally impractical. One such example is the problem of stray light reduction in digital cameras, which requires the implementation of a dense space-varying deconvolution operator. However, other inverse problems, such as iterative tomographic reconstruction, can also depend on the implementation of dense space-varying convolution. While space-invariant convolution can be efficiently implemented with the fast Fourier transform, this approach does not work for space-varying operators. So direct convolution is often the only option for implementing space-varying convolution. In this paper, we develop a general approach to the efficient implementation of space-varying convolution, and demonstrate its use in the application of stray light reduction. Our approach, which we call matrix source coding, is based on lossy source coding of the dense space-varying convolution matrix. Importantly, by coding the transformation matrix, we not only reduce the memory required to store it; we also dramatically reduce the computation required to implement matrix-vector products. Our algorithm is able to reduce computation by approximately factoring the dense space-varying convolution operator into a product of sparse transforms. Experimental results show that our method can dramatically reduce the computation required for stray light reduction while maintaining high accuracy.
Diamond, Alan; Nowotny, Thomas; Schmuker, Michael
2016-01-01
Neuromorphic computing employs models of neuronal circuits to solve computing problems. Neuromorphic hardware systems are now becoming more widely available and “neuromorphic algorithms” are being developed. As they are maturing toward deployment in general research environments, it becomes important to assess and compare them in the context of the applications they are meant to solve. This should encompass not just task performance, but also ease of implementation, speed of processing, scalability, and power efficiency. Here, we report our practical experience of implementing a bio-inspired, spiking network for multivariate classification on three different platforms: the hybrid digital/analog Spikey system, the digital spike-based SpiNNaker system, and GeNN, a meta-compiler for parallel GPU hardware. We assess performance using a standard hand-written digit classification task. We found that whilst a different implementation approach was required for each platform, classification performances remained in line. This suggests that all three implementations were able to exercise the model's ability to solve the task rather than exposing inherent platform limits, although differences emerged when capacity was approached. With respect to execution speed and power consumption, we found that for each platform a large fraction of the computing time was spent outside of the neuromorphic device, on the host machine. Time was spent in a range of combinations of preparing the model, encoding suitable input spiking data, shifting data, and decoding spike-encoded results. This is also where a large proportion of the total power was consumed, most markedly for the SpiNNaker and Spikey systems. We conclude that the simulation efficiency advantage of the assessed specialized hardware systems is easily lost in excessive host-device communication, or non-neuronal parts of the computation. These results emphasize the need to optimize the host-device communication architecture for scalability, maximum throughput, and minimum latency. Moreover, our results indicate that special attention should be paid to minimize host-device communication when designing and implementing networks for efficient neuromorphic computing. PMID:26778950
An Efficient Solution Method for Multibody Systems with Loops Using Multiple Processors
NASA Technical Reports Server (NTRS)
Ghosh, Tushar K.; Nguyen, Luong A.; Quiocho, Leslie J.
2015-01-01
This paper describes a multibody dynamics algorithm formulated for parallel implementation on multiprocessor computing platforms using the divide-and-conquer approach. The system of interest is a general topology of rigid and elastic articulated bodies with or without loops. The algorithm divides the multibody system into a number of smaller sets of bodies in chain or tree structures, called "branches" at convenient joints called "connection points", and uses an Order-N (O (N)) approach to formulate the dynamics of each branch in terms of the unknown spatial connection forces. The equations of motion for the branches, leaving the connection forces as unknowns, are implemented in separate processors in parallel for computational efficiency, and the equations for all the unknown connection forces are synthesized and solved in one or several processors. The performances of two implementations of this divide-and-conquer algorithm in multiple processors are compared with an existing method implemented on a single processor.
Synthetic analog computation in living cells.
Daniel, Ramiz; Rubens, Jacob R; Sarpeshkar, Rahul; Lu, Timothy K
2013-05-30
A central goal of synthetic biology is to achieve multi-signal integration and processing in living cells for diagnostic, therapeutic and biotechnology applications. Digital logic has been used to build small-scale circuits, but other frameworks may be needed for efficient computation in the resource-limited environments of cells. Here we demonstrate that synthetic analog gene circuits can be engineered to execute sophisticated computational functions in living cells using just three transcription factors. Such synthetic analog gene circuits exploit feedback to implement logarithmically linear sensing, addition, ratiometric and power-law computations. The circuits exhibit Weber's law behaviour as in natural biological systems, operate over a wide dynamic range of up to four orders of magnitude and can be designed to have tunable transfer functions. Our circuits can be composed to implement higher-order functions that are well described by both intricate biochemical models and simple mathematical functions. By exploiting analog building-block functions that are already naturally present in cells, this approach efficiently implements arithmetic operations and complex functions in the logarithmic domain. Such circuits may lead to new applications for synthetic biology and biotechnology that require complex computations with limited parts, need wide-dynamic-range biosensing or would benefit from the fine control of gene expression.
NASA Astrophysics Data System (ADS)
Leamy, Michael J.; Springer, Adam C.
In this research we report parallel implementation of a Cellular Automata-based simulation tool for computing elastodynamic response on complex, two-dimensional domains. Elastodynamic simulation using Cellular Automata (CA) has recently been presented as an alternative, inherently object-oriented technique for accurately and efficiently computing linear and nonlinear wave propagation in arbitrarily-shaped geometries. The local, autonomous nature of the method should lead to straight-forward and efficient parallelization. We address this notion on symmetric multiprocessor (SMP) hardware using a Java-based object-oriented CA code implementing triangular state machines (i.e., automata) and the MPI bindings written in Java (MPJ Express). We use MPJ Express to reconfigure our existing CA code to distribute a domain's automata to cores present on a dual quad-core shared-memory system (eight total processors). We note that this message passing parallelization strategy is directly applicable to computer clustered computing, which will be the focus of follow-on research. Results on the shared memory platform indicate nearly-ideal, linear speed-up. We conclude that the CA-based elastodynamic simulator is easily configured to run in parallel, and yields excellent speed-up on SMP hardware.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Trędak, Przemysław, E-mail: przemyslaw.tredak@fuw.edu.pl; Rudnicki, Witold R.; Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw, ul. Pawińskiego 5a, 02-106 Warsaw
The second generation Reactive Bond Order (REBO) empirical potential is commonly used to accurately model a wide range hydrocarbon materials. It is also extensible to other atom types and interactions. REBO potential assumes complex multi-body interaction model, that is difficult to represent efficiently in the SIMD or SIMT programming model. Hence, despite its importance, no efficient GPGPU implementation has been developed for this potential. Here we present a detailed description of a highly efficient GPGPU implementation of molecular dynamics algorithm using REBO potential. The presented algorithm takes advantage of rarely used properties of the SIMT architecture of a modern GPUmore » to solve difficult synchronizations issues that arise in computations of multi-body potential. Techniques developed for this problem may be also used to achieve efficient solutions of different problems. The performance of proposed algorithm is assessed using a range of model systems. It is compared to highly optimized CPU implementation (both single core and OpenMP) available in LAMMPS package. These experiments show up to 6x improvement in forces computation time using single processor of the NVIDIA Tesla K80 compared to high end 16-core Intel Xeon processor.« less
Near Theoretical Gigabit Link Efficiency for Distributed Data Acquisition Systems
Abu-Nimeh, Faisal T.; Choong, Woon-Seng
2017-01-01
Link efficiency, data integrity, and continuity for high-throughput and real-time systems is crucial. Most of these applications require specialized hardware and operating systems as well as extensive tuning in order to achieve high efficiency. Here, we present an implementation of gigabit Ethernet data streaming which can achieve 99.26% link efficiency while maintaining no packet losses. The design and implementation are built on OpenPET, an opensource data acquisition platform for nuclear medical imaging, where (a) a crate hosting multiple OpenPET detector boards uses a User Datagram Protocol over Internet Protocol (UDP/IP) Ethernet soft-core, that is capable of understanding PAUSE frames, to stream data out to a computer workstation; (b) the receiving computer uses Netmap to allow the processing software (i.e., user space), which is written in Python, to directly receive and manage the network card’s ring buffers, bypassing the operating system kernel’s networking stack; and (c) a multi-threaded application using synchronized queues is implemented in the processing software (Python) to free up the ring buffers as quickly as possible while preserving data integrity and flow continuity. PMID:28630948
Near Theoretical Gigabit Link Efficiency for Distributed Data Acquisition Systems.
Abu-Nimeh, Faisal T; Choong, Woon-Seng
2017-03-01
Link efficiency, data integrity, and continuity for high-throughput and real-time systems is crucial. Most of these applications require specialized hardware and operating systems as well as extensive tuning in order to achieve high efficiency. Here, we present an implementation of gigabit Ethernet data streaming which can achieve 99.26% link efficiency while maintaining no packet losses. The design and implementation are built on OpenPET, an opensource data acquisition platform for nuclear medical imaging, where (a) a crate hosting multiple OpenPET detector boards uses a User Datagram Protocol over Internet Protocol (UDP/IP) Ethernet soft-core, that is capable of understanding PAUSE frames, to stream data out to a computer workstation; (b) the receiving computer uses Netmap to allow the processing software (i.e., user space), which is written in Python, to directly receive and manage the network card's ring buffers, bypassing the operating system kernel's networking stack; and (c) a multi-threaded application using synchronized queues is implemented in the processing software (Python) to free up the ring buffers as quickly as possible while preserving data integrity and flow continuity.
Implementation and evaluation of ILLIAC 4 algorithms for multispectral image processing
NASA Technical Reports Server (NTRS)
Swain, P. H.
1974-01-01
Data concerning a multidisciplinary and multi-organizational effort to implement multispectral data analysis algorithms on a revolutionary computer, the Illiac 4, are reported. The effectiveness and efficiency of implementing the digital multispectral data analysis techniques for producing useful land use classifications from satellite collected data were demonstrated.
Efficient quantum circuits for one-way quantum computing.
Tanamoto, Tetsufumi; Liu, Yu-Xi; Hu, Xuedong; Nori, Franco
2009-03-13
While Ising-type interactions are ideal for implementing controlled phase flip gates in one-way quantum computing, natural interactions between solid-state qubits are most often described by either the XY or the Heisenberg models. We show an efficient way of generating cluster states directly using either the imaginary SWAP (iSWAP) gate for the XY model, or the sqrt[SWAP] gate for the Heisenberg model. Our approach thus makes one-way quantum computing more feasible for solid-state devices.
An emulator for minimizing finite element analysis implementation resources
NASA Technical Reports Server (NTRS)
Melosh, R. J.; Utku, S.; Salama, M.; Islam, M.
1982-01-01
A finite element analysis emulator providing a basis for efficiently establishing an optimum computer implementation strategy when many calculations are involved is described. The SCOPE emulator determines computer resources required as a function of the structural model, structural load-deflection equation characteristics, the storage allocation plan, and computer hardware capabilities. Thereby, it provides data for trading analysis implementation options to arrive at a best strategy. The models contained in SCOPE lead to micro-operation computer counts of each finite element operation as well as overall computer resource cost estimates. Application of SCOPE to the Memphis-Arkansas bridge analysis provides measures of the accuracy of resource assessments. Data indicate that predictions are within 17.3 percent for calculation times and within 3.2 percent for peripheral storage resources for the ELAS code.
Efficient architecture for spike sorting in reconfigurable hardware.
Hwang, Wen-Jyi; Lee, Wei-Hao; Lin, Shiow-Jyu; Lai, Sheng-Ying
2013-11-01
This paper presents a novel hardware architecture for fast spike sorting. The architecture is able to perform both the feature extraction and clustering in hardware. The generalized Hebbian algorithm (GHA) and fuzzy C-means (FCM) algorithm are used for feature extraction and clustering, respectively. The employment of GHA allows efficient computation of principal components for subsequent clustering operations. The FCM is able to achieve near optimal clustering for spike sorting. Its performance is insensitive to the selection of initial cluster centers. The hardware implementations of GHA and FCM feature low area costs and high throughput. In the GHA architecture, the computation of different weight vectors share the same circuit for lowering the area costs. Moreover, in the FCM hardware implementation, the usual iterative operations for updating the membership matrix and cluster centroid are merged into one single updating process to evade the large storage requirement. To show the effectiveness of the circuit, the proposed architecture is physically implemented by field programmable gate array (FPGA). It is embedded in a System-on-Chip (SOC) platform for performance measurement. Experimental results show that the proposed architecture is an efficient spike sorting design for attaining high classification correct rate and high speed computation.
Li, Chuan; Li, Lin; Zhang, Jie; Alexov, Emil
2012-01-01
The Gauss-Seidel method is a standard iterative numerical method widely used to solve a system of equations and, in general, is more efficient comparing to other iterative methods, such as the Jacobi method. However, standard implementation of the Gauss-Seidel method restricts its utilization in parallel computing due to its requirement of using updated neighboring values (i.e., in current iteration) as soon as they are available. Here we report an efficient and exact (not requiring assumptions) method to parallelize iterations and to reduce the computational time as a linear/nearly linear function of the number of CPUs. In contrast to other existing solutions, our method does not require any assumptions and is equally applicable for solving linear and nonlinear equations. This approach is implemented in the DelPhi program, which is a finite difference Poisson-Boltzmann equation solver to model electrostatics in molecular biology. This development makes the iterative procedure on obtaining the electrostatic potential distribution in the parallelized DelPhi several folds faster than that in the serial code. Further we demonstrate the advantages of the new parallelized DelPhi by computing the electrostatic potential and the corresponding energies of large supramolecular structures. PMID:22674480
Rodríguez, Manuel; Magdaleno, Eduardo; Pérez, Fernando; García, Cristhian
2017-03-28
Non-equispaced Fast Fourier transform (NFFT) is a very important algorithm in several technological and scientific areas such as synthetic aperture radar, computational photography, medical imaging, telecommunications, seismic analysis and so on. However, its computation complexity is high. In this paper, we describe an efficient NFFT implementation with a hardware coprocessor using an All-Programmable System-on-Chip (APSoC). This is a hybrid device that employs an Advanced RISC Machine (ARM) as Processing System with Programmable Logic for high-performance digital signal processing through parallelism and pipeline techniques. The algorithm has been coded in C language with pragma directives to optimize the architecture of the system. We have used the very novel Software Develop System-on-Chip (SDSoC) evelopment tool that simplifies the interface and partitioning between hardware and software. This provides shorter development cycles and iterative improvements by exploring several architectures of the global system. The computational results shows that hardware acceleration significantly outperformed the software based implementation.
Rodríguez, Manuel; Magdaleno, Eduardo; Pérez, Fernando; García, Cristhian
2017-01-01
Non-equispaced Fast Fourier transform (NFFT) is a very important algorithm in several technological and scientific areas such as synthetic aperture radar, computational photography, medical imaging, telecommunications, seismic analysis and so on. However, its computation complexity is high. In this paper, we describe an efficient NFFT implementation with a hardware coprocessor using an All-Programmable System-on-Chip (APSoC). This is a hybrid device that employs an Advanced RISC Machine (ARM) as Processing System with Programmable Logic for high-performance digital signal processing through parallelism and pipeline techniques. The algorithm has been coded in C language with pragma directives to optimize the architecture of the system. We have used the very novel Software Develop System-on-Chip (SDSoC) evelopment tool that simplifies the interface and partitioning between hardware and software. This provides shorter development cycles and iterative improvements by exploring several architectures of the global system. The computational results shows that hardware acceleration significantly outperformed the software based implementation. PMID:28350358
Hrdá, Marcela; Kulich, Tomáš; Repiský, Michal; Noga, Jozef; Malkina, Olga L; Malkin, Vladimir G
2014-09-05
A recently developed Thouless-expansion-based diagonalization-free approach for improving the efficiency of self-consistent field (SCF) methods (Noga and Šimunek, J. Chem. Theory Comput. 2010, 6, 2706) has been adapted to the four-component relativistic scheme and implemented within the program package ReSpect. In addition to the implementation, the method has been thoroughly analyzed, particularly with respect to cases for which it is difficult or computationally expensive to find a good initial guess. Based on this analysis, several modifications of the original algorithm, refining its stability and efficiency, are proposed. To demonstrate the robustness and efficiency of the improved algorithm, we present the results of four-component diagonalization-free SCF calculations on several heavy-metal complexes, the largest of which contains more than 80 atoms (about 6000 4-spinor basis functions). The diagonalization-free procedure is about twice as fast as the corresponding diagonalization. Copyright © 2014 Wiley Periodicals, Inc.
NASA Astrophysics Data System (ADS)
Liu, Jie; Thiel, Walter
2018-04-01
We present an efficient implementation of configuration interaction with single excitations (CIS) for semiempirical orthogonalization-corrected OMx methods and standard modified neglect of diatomic overlap (MNDO)-type methods for the computation of vertical excitation energies as well as analytical gradients and nonadiabatic couplings. This CIS implementation is combined with Tully's fewest switches algorithm to enable surface hopping simulations of excited-state nonadiabatic dynamics. We introduce an accurate and efficient expression for the semiempirical evaluation of nonadiabatic couplings, which offers a significant speedup for medium-size molecules and is suitable for use in long nonadiabatic dynamics runs. As a pilot application, the semiempirical CIS implementation is employed to investigate ultrafast energy transfer processes in a phenylene ethynylene dendrimer model.
Liu, Jie; Thiel, Walter
2018-04-21
We present an efficient implementation of configuration interaction with single excitations (CIS) for semiempirical orthogonalization-corrected OMx methods and standard modified neglect of diatomic overlap (MNDO)-type methods for the computation of vertical excitation energies as well as analytical gradients and nonadiabatic couplings. This CIS implementation is combined with Tully's fewest switches algorithm to enable surface hopping simulations of excited-state nonadiabatic dynamics. We introduce an accurate and efficient expression for the semiempirical evaluation of nonadiabatic couplings, which offers a significant speedup for medium-size molecules and is suitable for use in long nonadiabatic dynamics runs. As a pilot application, the semiempirical CIS implementation is employed to investigate ultrafast energy transfer processes in a phenylene ethynylene dendrimer model.
SIMD Optimization of Linear Expressions for Programmable Graphics Hardware
Bajaj, Chandrajit; Ihm, Insung; Min, Jungki; Oh, Jinsang
2009-01-01
The increased programmability of graphics hardware allows efficient graphical processing unit (GPU) implementations of a wide range of general computations on commodity PCs. An important factor in such implementations is how to fully exploit the SIMD computing capacities offered by modern graphics processors. Linear expressions in the form of ȳ = Ax̄ + b̄, where A is a matrix, and x̄, ȳ and b̄ are vectors, constitute one of the most basic operations in many scientific computations. In this paper, we propose a SIMD code optimization technique that enables efficient shader codes to be generated for evaluating linear expressions. It is shown that performance can be improved considerably by efficiently packing arithmetic operations into four-wide SIMD instructions through reordering of the operations in linear expressions. We demonstrate that the presented technique can be used effectively for programming both vertex and pixel shaders for a variety of mathematical applications, including integrating differential equations and solving a sparse linear system of equations using iterative methods. PMID:19946569
An Efficient Scheme for Updating Sparse Cholesky Factors
NASA Technical Reports Server (NTRS)
Raghavan, Padma
2002-01-01
Raghavan had earlier developed the software package DCSPACK which can be used for solving sparse linear systems where the coefficient matrix is symmetric and positive definite (this project was not funded by NASA but by agencies such as NSF). DSCPACK-S is the serial code and DSCPACK-P is a parallel implementation suitable for multiprocessors or networks-of-workstations with message passing using MCI. The main algorithm used is the Cholesky factorization of a sparse symmetric positive positive definite matrix A = LL(T). The code can also compute the factorization A = LDL(T). The complexity of the software arises from several factors relating to the sparsity of the matrix A. A sparse N x N matrix A has typically less that cN nonzeroes where c is a small constant. If the matrix were dense, it would have O(N2) nonzeroes. The most complicated part of such sparse Cholesky factorization relates to fill-in, i.e., zeroes in the original matrix that become nonzeroes in the factor L. An efficient implementation depends to a large extent on complex data structures and on techniques from graph theory to reduce, identify, and manage fill. DSCPACK is based on an efficient multifrontal implementation with fill-managing algorithms and implementation arising from earlier research by Raghavan and others. Sparse Cholesky factorization is typically a four step process: (1) ordering to compute a fill-reducing numbering, (2) symbolic factorization to determine the nonzero structure of L, (3) numeric factorization to compute L, and, (4) triangular solution to solve L(T)x = y and Ly = b. The first two steps are symbolic and are performed using the graph of the matrix. The numeric factorization step is of dominant cost and there are several schemes for improving performance by exploiting the nested and dense structure of groups of columns in the factor. The latter are aimed at better utilization of the cache-memory hierarchy on modem processors to prevent cache-misses and provide execution rates (operations/second) that are close to the peak rates for dense matrix computations. Currently, EPISCOPACY is being used in an application at NASA directed by J. Newman and M. James. We propose the implementation of efficient schemes for updating the LL(T) or LDL(T) factors computed in DSCPACK-S to meet the computational requirements of their project. A brief description is provided in the next section.
Lattice Boltzmann Method for 3-D Flows with Curved Boundary
NASA Technical Reports Server (NTRS)
Mei, Renwei; Shyy, Wei; Yu, Dazhi; Luo, Li-Shi
2002-01-01
In this work, we investigate two issues that are important to computational efficiency and reliability in fluid dynamics applications of the lattice, Boltzmann equation (LBE): (1) Computational stability and accuracy of different lattice Boltzmann models and (2) the treatment of the boundary conditions on curved solid boundaries and their 3-D implementations. Three athermal 3-D LBE models (D3QI5, D3Ql9, and D3Q27) are studied and compared in terms of efficiency, accuracy, and robustness. The boundary treatment recently developed by Filippova and Hanel and Met et al. in 2-D is extended to and implemented for 3-D. The convergence, stability, and computational efficiency of the 3-D LBE models with the boundary treatment for curved boundaries were tested in simulations of four 3-D flows: (1) Fully developed flows in a square duct, (2) flow in a 3-D lid-driven cavity, (3) fully developed flows in a circular pipe, and (4) a uniform flow over a sphere. We found that while the fifteen-velocity 3-D (D3Ql5) model is more prone to numerical instability and the D3Q27 is more computationally intensive, the 63Q19 model provides a balance between computational reliability and efficiency. Through numerical simulations, we demonstrated that the boundary treatment for 3-D arbitrary curved geometry has second-order accuracy and possesses satisfactory stability characteristics.
Development of the Tensoral Computer Language
NASA Technical Reports Server (NTRS)
Ferziger, Joel; Dresselhaus, Eliot
1996-01-01
The research scientist or engineer wishing to perform large scale simulations or to extract useful information from existing databases is required to have expertise in the details of the particular database, the numerical methods and the computer architecture to be used. This poses a significant practical barrier to the use of simulation data. The goal of this research was to develop a high-level computer language called Tensoral, designed to remove this barrier. The Tensoral language provides a framework in which efficient generic data manipulations can be easily coded and implemented. First of all, Tensoral is general. The fundamental objects in Tensoral represent tensor fields and the operators that act on them. The numerical implementation of these tensors and operators is completely and flexibly programmable. New mathematical constructs and operators can be easily added to the Tensoral system. Tensoral is compatible with existing languages. Tensoral tensor operations co-exist in a natural way with a host language, which may be any sufficiently powerful computer language such as Fortran, C, or Vectoral. Tensoral is very-high-level. Tensor operations in Tensoral typically act on entire databases (i.e., arrays) at one time and may, therefore, correspond to many lines of code in a conventional language. Tensoral is efficient. Tensoral is a compiled language. Database manipulations are simplified optimized and scheduled by the compiler eventually resulting in efficient machine code to implement them.
Facilitating NASA Earth Science Data Processing Using Nebula Cloud Computing
NASA Technical Reports Server (NTRS)
Pham, Long; Chen, Aijun; Kempler, Steven; Lynnes, Christopher; Theobald, Michael; Asghar, Esfandiari; Campino, Jane; Vollmer, Bruce
2011-01-01
Cloud Computing has been implemented in several commercial arenas. The NASA Nebula Cloud Computing platform is an Infrastructure as a Service (IaaS) built in 2008 at NASA Ames Research Center and 2010 at GSFC. Nebula is an open source Cloud platform intended to: a) Make NASA realize significant cost savings through efficient resource utilization, reduced energy consumption, and reduced labor costs. b) Provide an easier way for NASA scientists and researchers to efficiently explore and share large and complex data sets. c) Allow customers to provision, manage, and decommission computing capabilities on an as-needed bases
Efficient convolutional sparse coding
Wohlberg, Brendt
2017-06-20
Computationally efficient algorithms may be applied for fast dictionary learning solving the convolutional sparse coding problem in the Fourier domain. More specifically, efficient convolutional sparse coding may be derived within an alternating direction method of multipliers (ADMM) framework that utilizes fast Fourier transforms (FFT) to solve the main linear system in the frequency domain. Such algorithms may enable a significant reduction in computational cost over conventional approaches by implementing a linear solver for the most critical and computationally expensive component of the conventional iterative algorithm. The theoretical computational cost of the algorithm may be reduced from O(M.sup.3N) to O(MN log N), where N is the dimensionality of the data and M is the number of elements in the dictionary. This significant improvement in efficiency may greatly increase the range of problems that can practically be addressed via convolutional sparse representations.
A hybrid architecture for the implementation of the Athena neural net model
NASA Technical Reports Server (NTRS)
Koutsougeras, C.; Papachristou, C.
1989-01-01
The implementation of an earlier introduced neural net model for pattern classification is considered. Data flow principles are employed in the development of a machine that efficiently implements the model and can be useful for real time classification tasks. Further enhancement with optical computing structures is also considered.
DiClemente, Ralph J; Bradley, Erin; Davis, Teaniese L; Brown, Jennifer L; Ukuku, Mary; Sales, Jessica M; Rose, Eve S; Wingood, Gina M
2013-06-01
Although group-delivered HIV/sexually transmitted disease (STD) risk-reduction interventions for African American adolescent females have proven efficacious, they require significant financial and staffing resources to implement and may not be feasible in personnel- and resource-constrained public health clinics. We conducted a study assessing adoption and implementation of an evidence-based HIV/STD risk-reduction intervention that was translated from a group-delivered modality to a computer-delivered modality to facilitate use in county public health departments. Usage of the computer-delivered intervention was low across 8 participating public health clinics. Further investigation is needed to optimize implementation by identifying, understanding, and surmounting barriers that hamper timely and efficient implementation of technology-delivered HIV/STD risk-reduction interventions in county public health clinics.
DiClemente, Ralph J.; Bradley, Erin; Davis, Teaniese L.; Brown, Jennifer L.; Ukuku, Mary; Sales, Jessica M.; Rose, Eve S.; Wingood, Gina M.
2013-01-01
Although group-delivered HIV/STD risk-reduction interventions for African American adolescent females have proven efficacious, they require significant financial and staffing resources to implement and may not be feasible in personnel- and resource-constrained public health clinics. We conducted a study assessing adoption and implementation of an evidence-based HIV/STD risk-reduction intervention that was translated from a group-delivered modality to a computer-delivered modality to facilitate use in county public health departments. Usage of the computer-delivered intervention was low across eight participating public health clinics. Further investigation is needed to optimize implementation by identifying, understanding and surmounting barriers that hamper timely and efficient implementation of technology-delivered HIV/STD risk-reduction interventions in county public health clinics. PMID:23673891
Computational efficiency for the surface renewal method
NASA Astrophysics Data System (ADS)
Kelley, Jason; Higgins, Chad
2018-04-01
Measuring surface fluxes using the surface renewal (SR) method requires programmatic algorithms for tabulation, algebraic calculation, and data quality control. A number of different methods have been published describing automated calibration of SR parameters. Because the SR method utilizes high-frequency (10 Hz+) measurements, some steps in the flux calculation are computationally expensive, especially when automating SR to perform many iterations of these calculations. Several new algorithms were written that perform the required calculations more efficiently and rapidly, and that tested for sensitivity to length of flux averaging period, ability to measure over a large range of lag timescales, and overall computational efficiency. These algorithms utilize signal processing techniques and algebraic simplifications that demonstrate simple modifications that dramatically improve computational efficiency. The results here complement efforts by other authors to standardize a robust and accurate computational SR method. Increased speed of computation time grants flexibility to implementing the SR method, opening new avenues for SR to be used in research, for applied monitoring, and in novel field deployments.
All-optical reservoir computer based on saturation of absorption.
Dejonckheere, Antoine; Duport, François; Smerieri, Anteo; Fang, Li; Oudar, Jean-Louis; Haelterman, Marc; Massar, Serge
2014-05-05
Reservoir computing is a new bio-inspired computation paradigm. It exploits a dynamical system driven by a time-dependent input to carry out computation. For efficient information processing, only a few parameters of the reservoir needs to be tuned, which makes it a promising framework for hardware implementation. Recently, electronic, opto-electronic and all-optical experimental reservoir computers were reported. In those implementations, the nonlinear response of the reservoir is provided by active devices such as optoelectronic modulators or optical amplifiers. By contrast, we propose here the first reservoir computer based on a fully passive nonlinearity, namely the saturable absorption of a semiconductor mirror. Our experimental setup constitutes an important step towards the development of ultrafast low-consumption analog computers.
An Application Development Platform for Neuromorphic Computing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dean, Mark; Chan, Jason; Daffron, Christopher
2016-01-01
Dynamic Adaptive Neural Network Arrays (DANNAs) are neuromorphic computing systems developed as a hardware based approach to the implementation of neural networks. They feature highly adaptive and programmable structural elements, which model arti cial neural networks with spiking behavior. We design them to solve problems using evolutionary optimization. In this paper, we highlight the current hardware and software implementations of DANNA, including their features, functionalities and performance. We then describe the development of an Application Development Platform (ADP) to support efficient application implementation and testing of DANNA based solutions. We conclude with future directions.
NASA Astrophysics Data System (ADS)
Larger, Laurent; Baylón-Fuentes, Antonio; Martinenghi, Romain; Udaltsov, Vladimir S.; Chembo, Yanne K.; Jacquot, Maxime
2017-01-01
Reservoir computing, originally referred to as an echo state network or a liquid state machine, is a brain-inspired paradigm for processing temporal information. It involves learning a "read-out" interpretation for nonlinear transients developed by high-dimensional dynamics when the latter is excited by the information signal to be processed. This novel computational paradigm is derived from recurrent neural network and machine learning techniques. It has recently been implemented in photonic hardware for a dynamical system, which opens the path to ultrafast brain-inspired computing. We report on a novel implementation involving an electro-optic phase-delay dynamics designed with off-the-shelf optoelectronic telecom devices, thus providing the targeted wide bandwidth. Computational efficiency is demonstrated experimentally with speech-recognition tasks. State-of-the-art speed performances reach one million words per second, with very low word error rate. Additionally, to record speed processing, our investigations have revealed computing-efficiency improvements through yet-unexplored temporal-information-processing techniques, such as simultaneous multisample injection and pitched sampling at the read-out compared to information "write-in".
NASA Astrophysics Data System (ADS)
Papasotiriou, P. J.; Geroyannis, V. S.
We implement Hartle's perturbation method to the computation of relativistic rigidly rotating neutron star models. The program has been written in SCILAB (© INRIA ENPC), a matrix-oriented high-level programming language. The numerical method is described in very detail and is applied to many models in slow or fast rotation. We show that, although the method is perturbative, it gives accurate results for all practical purposes and it should prove an efficient tool for computing rapidly rotating pulsars.
Design and implementation of the one-step MSD adder of optical computer.
Song, Kai; Yan, Liping
2012-03-01
On the basis of the symmetric encoding algorithm for the modified signed-digit (MSD), a 7*7 truth table that can be realized with optical methods was developed. And based on the truth table, the optical path structures and circuit implementations of the one-step MSD adder of ternary optical computer (TOC) were designed. Experiments show that the scheme is correct, feasible, and efficient. © 2012 Optical Society of America
Electrooptical adaptive switching network for the hypercube computer
NASA Technical Reports Server (NTRS)
Chow, E.; Peterson, J.
1988-01-01
An all-optical network design for the hyperswitch network using regular free-space interconnects between electronic processor nodes is presented. The adaptive routing model used is described, and an adaptive routing control example is presented. The design demonstrates that existing electrooptical techniques are sufficient for implementing efficient parallel architectures without the need for more complex means of implementing arbitrary interconnection schemes. The electrooptical hyperswitch network significantly improves the communication performance of the hypercube computer.
Fault Tolerant Parallel Implementations of Iterative Algorithms for Optimal Control Problems
1988-01-21
p/.V)] steps, but did not discuss any specific parallel implementation. Gajski [51 improved upon this result by performing the SIMD computation in...N = p2. our approach reduces to that of [51, except that Gajski presents the coefficient computation and partial solution phases as a single...8217>. the SIMD algo- rithm presented by Gajski [5] can be most efficiently mapped to a unidirec- tional ring network with broadcasting capability. Based
NASA software specification and evaluation system: Software verification/validation techniques
NASA Technical Reports Server (NTRS)
1977-01-01
NASA software requirement specifications were used in the development of a system for validating and verifying computer programs. The software specification and evaluation system (SSES) provides for the effective and efficient specification, implementation, and testing of computer software programs. The system as implemented will produce structured FORTRAN or ANSI FORTRAN programs, but the principles upon which SSES is designed allow it to be easily adapted to other high order languages.
Nguyen, Tuan-Anh; Nakib, Amir; Nguyen, Huy-Nam
2016-06-01
The Non-local means denoising filter has been established as gold standard for image denoising problem in general and particularly in medical imaging due to its efficiency. However, its computation time limited its applications in real world application, especially in medical imaging. In this paper, a distributed version on parallel hybrid architecture is proposed to solve the computation time problem and a new method to compute the filters' coefficients is also proposed, where we focused on the implementation and the enhancement of filters' parameters via taking the neighborhood of the current voxel more accurately into account. In terms of implementation, our key contribution consists in reducing the number of shared memory accesses. The different tests of the proposed method were performed on the brain-web database for different levels of noise. Performances and the sensitivity were quantified in terms of speedup, peak signal to noise ratio, execution time, the number of floating point operations. The obtained results demonstrate the efficiency of the proposed method. Moreover, the implementation is compared to that of other techniques, recently published in the literature. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
Sybil--efficient constraint-based modelling in R.
Gelius-Dietrich, Gabriel; Desouki, Abdelmoneim Amer; Fritzemeier, Claus Jonathan; Lercher, Martin J
2013-11-13
Constraint-based analyses of metabolic networks are widely used to simulate the properties of genome-scale metabolic networks. Publicly available implementations tend to be slow, impeding large scale analyses such as the genome-wide computation of pairwise gene knock-outs, or the automated search for model improvements. Furthermore, available implementations cannot easily be extended or adapted by users. Here, we present sybil, an open source software library for constraint-based analyses in R; R is a free, platform-independent environment for statistical computing and graphics that is widely used in bioinformatics. Among other functions, sybil currently provides efficient methods for flux-balance analysis (FBA), MOMA, and ROOM that are about ten times faster than previous implementations when calculating the effect of whole-genome single gene deletions in silico on a complete E. coli metabolic model. Due to the object-oriented architecture of sybil, users can easily build analysis pipelines in R or even implement their own constraint-based algorithms. Based on its highly efficient communication with different mathematical optimisation programs, sybil facilitates the exploration of high-dimensional optimisation problems on small time scales. Sybil and all its dependencies are open source. Sybil and its documentation are available for download from the comprehensive R archive network (CRAN).
Oryspayev, Dossay; Aktulga, Hasan Metin; Sosonkina, Masha; ...
2015-07-14
In this article, sparse matrix vector multiply (SpMVM) is an important kernel that frequently arises in high performance computing applications. Due to its low arithmetic intensity, several approaches have been proposed in literature to improve its scalability and efficiency in large scale computations. In this paper, our target systems are high end multi-core architectures and we use messaging passing interface + open multiprocessing hybrid programming model for parallelism. We analyze the performance of recently proposed implementation of the distributed symmetric SpMVM, originally developed for large sparse symmetric matrices arising in ab initio nuclear structure calculations. We also study important featuresmore » of this implementation and compare with previously reported implementations that do not exploit underlying symmetry. Our SpMVM implementations leverage the hybrid paradigm to efficiently overlap expensive communications with computations. Our main comparison criterion is the "CPU core hours" metric, which is the main measure of resource usage on supercomputers. We analyze the effects of topology-aware mapping heuristic using simplified network load model. Furthermore, we have tested the different SpMVM implementations on two large clusters with 3D Torus and Dragonfly topology. Our results show that the distributed SpMVM implementation that exploits matrix symmetry and hides communication yields the best value for the "CPU core hours" metric and significantly reduces data movement overheads.« less
A Programmable Five Qubit Quantum Computer Using Trapped Atomic Ions
NASA Astrophysics Data System (ADS)
Debnath, Shantanu
Quantum computers can solve certain problems more efficiently compared to conventional classical methods. In the endeavor to build a quantum computer, several competing platforms have emerged that can implement certain quantum algorithms using a few qubits. However, the demonstrations so far have been done usually by tailoring the hardware to meet the requirements of a particular algorithm implemented for a limited number of instances. Although such proof of principal implementations are important to verify the working of algorithms on a physical system, they further need to have the potential to serve as a general purpose quantum computer allowing the flexibility required for running multiple algorithms and be scaled up to host more qubits. Here we demonstrate a small programmable quantum computer based on five trapped atomic ions each of which serves as a qubit. By optically resolving each ion we can individually address them in order to perform a complete set of single-qubit and fully connected two-qubit quantum gates and alsoperform efficient individual qubit measurements. We implement a computation architecture that accepts an algorithm from a user interface in the form of a standard logic gate sequence and decomposes it into fundamental quantum operations that are native to the hardware using a set of compilation instructions that are defined within the software. These operations are then effected through a pattern of laser pulses that perform coherent rotations on targeted qubits in the chain. The architecture implemented in the experiment therefore gives us unprecedented flexibility in the programming of any quantum algorithm while staying blind to the underlying hardware. As a demonstration we implement the Deutsch-Jozsa and Bernstein-Vazirani algorithms on the five-qubit processor and achieve average success rates of 95 and 90 percent, respectively. We also implement a five-qubit coherent quantum Fourier transform and examine its performance in the period finding and phase estimation protocol. We find fidelities of 84 and 62 percent, respectively. While maintaining the same computation architecture the system can be scaled to more ions using resources that scale favorably (O(N. 2)) with the numberof qubits N.
A hardware implementation of the discrete Pascal transform for image processing
NASA Astrophysics Data System (ADS)
Goodman, Thomas J.; Aburdene, Maurice F.
2006-02-01
The discrete Pascal transform is a polynomial transform with applications in pattern recognition, digital filtering, and digital image processing. It already has been shown that the Pascal transform matrix can be decomposed into a product of binary matrices. Such a factorization leads to a fast and efficient hardware implementation without the use of multipliers, which consume large amounts of hardware. We recently developed a field-programmable gate array (FPGA) implementation to compute the Pascal transform. Our goal was to demonstrate the computational efficiency of the transform while keeping hardware requirements at a minimum. Images are uploaded into memory from a remote computer prior to processing, and the transform coefficients can be offloaded from the FPGA board for analysis. Design techniques like as-soon-as-possible scheduling and adder sharing allowed us to develop a fast and efficient system. An eight-point, one-dimensional transform completes in 13 clock cycles and requires only four adders. An 8x8 two-dimensional transform completes in 240 cycles and requires only a top-level controller in addition to the one-dimensional transform hardware. Finally, through minor modifications to the controller, the transform operations can be pipelined to achieve 100% utilization of the four adders, allowing one eight-point transform to complete every seven clock cycles.
NASA Technical Reports Server (NTRS)
Nguyen, D. T.; Watson, Willie R. (Technical Monitor)
2005-01-01
The overall objectives of this research work are to formulate and validate efficient parallel algorithms, and to efficiently design/implement computer software for solving large-scale acoustic problems, arised from the unified frameworks of the finite element procedures. The adopted parallel Finite Element (FE) Domain Decomposition (DD) procedures should fully take advantages of multiple processing capabilities offered by most modern high performance computing platforms for efficient parallel computation. To achieve this objective. the formulation needs to integrate efficient sparse (and dense) assembly techniques, hybrid (or mixed) direct and iterative equation solvers, proper pre-conditioned strategies, unrolling strategies, and effective processors' communicating schemes. Finally, the numerical performance of the developed parallel finite element procedures will be evaluated by solving series of structural, and acoustic (symmetrical and un-symmetrical) problems (in different computing platforms). Comparisons with existing "commercialized" and/or "public domain" software are also included, whenever possible.
NASA Astrophysics Data System (ADS)
Carraro, F.; Valiani, A.; Caleffi, V.
2018-03-01
Within the framework of the de Saint Venant equations coupled with the Exner equation for morphodynamic evolution, this work presents a new efficient implementation of the Dumbser-Osher-Toro (DOT) scheme for non-conservative problems. The DOT path-conservative scheme is a robust upwind method based on a complete Riemann solver, but it has the drawback of requiring expensive numerical computations. Indeed, to compute the non-linear time evolution in each time step, the DOT scheme requires numerical computation of the flux matrix eigenstructure (the totality of eigenvalues and eigenvectors) several times at each cell edge. In this work, an analytical and compact formulation of the eigenstructure for the de Saint Venant-Exner (dSVE) model is introduced and tested in terms of numerical efficiency and stability. Using the original DOT and PRICE-C (a very efficient FORCE-type method) as reference methods, we present a convergence analysis (error against CPU time) to study the performance of the DOT method with our new analytical implementation of eigenstructure calculations (A-DOT). In particular, the numerical performance of the three methods is tested in three test cases: a movable bed Riemann problem with analytical solution; a problem with smooth analytical solution; a test in which the water flow is characterised by subcritical and supercritical regions. For a given target error, the A-DOT method is always the most efficient choice. Finally, two experimental data sets and different transport formulae are considered to test the A-DOT model in more practical case studies.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Suryanarayana, Phanish, E-mail: phanish.suryanarayana@ce.gatech.edu; Phanish, Deepa
We present an Augmented Lagrangian formulation and its real-space implementation for non-periodic Orbital-Free Density Functional Theory (OF-DFT) calculations. In particular, we rewrite the constrained minimization problem of OF-DFT as a sequence of minimization problems without any constraint, thereby making it amenable to powerful unconstrained optimization algorithms. Further, we develop a parallel implementation of this approach for the Thomas–Fermi–von Weizsacker (TFW) kinetic energy functional in the framework of higher-order finite-differences and the conjugate gradient method. With this implementation, we establish that the Augmented Lagrangian approach is highly competitive compared to the penalty and Lagrange multiplier methods. Additionally, we show that higher-ordermore » finite-differences represent a computationally efficient discretization for performing OF-DFT simulations. Overall, we demonstrate that the proposed formulation and implementation are both efficient and robust by studying selected examples, including systems consisting of thousands of atoms. We validate the accuracy of the computed energies and forces by comparing them with those obtained by existing plane-wave methods.« less
The design and implementation of a parallel unstructured Euler solver using software primitives
NASA Technical Reports Server (NTRS)
Das, R.; Mavriplis, D. J.; Saltz, J.; Gupta, S.; Ponnusamy, R.
1992-01-01
This paper is concerned with the implementation of a three-dimensional unstructured grid Euler-solver on massively parallel distributed-memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. An unstructured multigrid algorithm with an edge-based data structure has been adopted, and a number of optimizations have been devised and implemented in order to accelerate the parallel communication rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential code, while providing a basis for future automatic run-time compilation support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effect of these optimizations leads to roughly a factor of three performance improvement. The overall solution efficiency is compared with that obtained on the CRAY-YMP vector supercomputer.
Newman, D M; Hawley, R W; Goeckel, D L; Crawford, R D; Abraham, S; Gallagher, N C
1993-05-10
An efficient storage format was developed for computer-generated holograms for use in electron-beam lithography. This method employs run-length encoding and Lempel-Ziv-Welch compression and succeeds in exposing holograms that were previously infeasible owing to the hologram's tremendous pattern-data file size. These holograms also require significant computation; thus the algorithm was implemented on a parallel computer, which improved performance by 2 orders of magnitude. The decompression algorithm was integrated into the Cambridge electron-beam machine's front-end processor.Although this provides much-needed ability, some hardware enhancements will be required in the future to overcome inadequacies in the current front-end processor that result in a lengthy exposure time.
The multifacet graphically contracted function method. I. Formulation and implementation
NASA Astrophysics Data System (ADS)
Shepard, Ron; Gidofalvi, Gergely; Brozell, Scott R.
2014-08-01
The basic formulation for the multifacet generalization of the graphically contracted function (MFGCF) electronic structure method is presented. The analysis includes the discussion of linear dependency and redundancy of the arc factor parameters, the computation of reduced density matrices, Hamiltonian matrix construction, spin-density matrix construction, the computation of optimization gradients for single-state and state-averaged calculations, graphical wave function analysis, and the efficient computation of configuration state function and Slater determinant expansion coefficients. Timings are given for Hamiltonian matrix element and analytic optimization gradient computations for a range of model problems for full-CI Shavitt graphs, and it is observed that both the energy and the gradient computation scale as O(N2n4) for N electrons and n orbitals. The important arithmetic operations are within dense matrix-matrix product computational kernels, resulting in a computationally efficient procedure. An initial implementation of the method is used to present applications to several challenging chemical systems, including N2 dissociation, cubic H8 dissociation, the symmetric dissociation of H2O, and the insertion of Be into H2. The results are compared to the exact full-CI values and also to those of the previous single-facet GCF expansion form.
The multifacet graphically contracted function method. I. Formulation and implementation.
Shepard, Ron; Gidofalvi, Gergely; Brozell, Scott R
2014-08-14
The basic formulation for the multifacet generalization of the graphically contracted function (MFGCF) electronic structure method is presented. The analysis includes the discussion of linear dependency and redundancy of the arc factor parameters, the computation of reduced density matrices, Hamiltonian matrix construction, spin-density matrix construction, the computation of optimization gradients for single-state and state-averaged calculations, graphical wave function analysis, and the efficient computation of configuration state function and Slater determinant expansion coefficients. Timings are given for Hamiltonian matrix element and analytic optimization gradient computations for a range of model problems for full-CI Shavitt graphs, and it is observed that both the energy and the gradient computation scale as O(N(2)n(4)) for N electrons and n orbitals. The important arithmetic operations are within dense matrix-matrix product computational kernels, resulting in a computationally efficient procedure. An initial implementation of the method is used to present applications to several challenging chemical systems, including N2 dissociation, cubic H8 dissociation, the symmetric dissociation of H2O, and the insertion of Be into H2. The results are compared to the exact full-CI values and also to those of the previous single-facet GCF expansion form.
NASA Astrophysics Data System (ADS)
Hiremath, Varun; Pope, Stephen B.
2013-04-01
The Rate-Controlled Constrained-Equilibrium (RCCE) method is a thermodynamic based dimension reduction method which enables representation of chemistry involving n s species in terms of fewer n r constraints. Here we focus on the application of the RCCE method to Lagrangian particle probability density function based computations. In these computations, at every reaction fractional step, given the initial particle composition (represented using RCCE), we need to compute the reaction mapping, i.e. the particle composition at the end of the time step. In this work we study three different implementations of RCCE for computing this reaction mapping, and compare their relative accuracy and efficiency. These implementations include: (1) RCCE/TIFS (Trajectory In Full Space): this involves solving a system of n s rate-equations for all the species in the full composition space to obtain the reaction mapping. The other two implementations obtain the reaction mapping by solving a reduced system of n r rate-equations obtained by projecting the n s rate-equations for species evaluated in the full space onto the constrained subspace. These implementations include (2) RCCE: this is the classical implementation of RCCE which uses a direct projection of the rate-equations for species onto the constrained subspace; and (3) RCCE/RAMP (Reaction-mixing Attracting Manifold Projector): this is a new implementation introduced here which uses an alternative projector obtained using the RAMP approach. We test these three implementations of RCCE for methane/air premixed combustion in the partially-stirred reactor with chemistry represented using the n s=31 species GRI-Mech 1.2 mechanism with n r=13 to 19 constraints. We show that: (a) the classical RCCE implementation involves an inaccurate projector which yields large errors (over 50%) in the reaction mapping; (b) both RCCE/RAMP and RCCE/TIFS approaches yield significantly lower errors (less than 2%); and (c) overall the RCCE/TIFS approach is the most accurate, efficient (by orders of magnitude) and robust implementation.
Lab4CE: A Remote Laboratory for Computer Education
ERIC Educational Resources Information Center
Broisin, Julien; Venant, Rémi; Vidal, Philippe
2017-01-01
Remote practical activities have been demonstrated to be efficient when learners come to acquire inquiry skills. In computer science education, virtualization technologies are gaining popularity as this technological advance enables instructors to implement realistic practical learning activities, and learners to engage in authentic and…
Programming Digital Stories and How-to Animations
ERIC Educational Resources Information Center
Hansen, Alexandria Killian; Iveland, Ashley; Harlow, Danielle Boyd; Dwyer, Hilary; Franklin, Diana
2015-01-01
As science teachers continue preparing for implementation of the "Next Generation Science Standards," one recommendation is to use computer programming as a promising context to efficiently integrate science and engineering. In this article, a interdisciplinary team of educational researchers and computer scientists describe how to use…
Workflow computing. Improving management and efficiency of pathology diagnostic services.
Buffone, G J; Moreau, D; Beck, J R
1996-04-01
Traditionally, information technology in health care has helped practitioners to collect, store, and present information and also to add a degree of automation to simple tasks (instrument interfaces supporting result entry, for example). Thus commercially available information systems do little to support the need to model, execute, monitor, coordinate, and revise the various complex clinical processes required to support health-care delivery. Workflow computing, which is already implemented and improving the efficiency of operations in several nonmedical industries, can address the need to manage complex clinical processes. Workflow computing not only provides a means to define and manage the events, roles, and information integral to health-care delivery but also supports the explicit implementation of policy or rules appropriate to the process. This article explains how workflow computing may be applied to health-care and the inherent advantages of the technology, and it defines workflow system requirements for use in health-care delivery with special reference to diagnostic pathology.
Potential implementation of reservoir computing models based on magnetic skyrmions
NASA Astrophysics Data System (ADS)
Bourianoff, George; Pinna, Daniele; Sitte, Matthias; Everschor-Sitte, Karin
2018-05-01
Reservoir Computing is a type of recursive neural network commonly used for recognizing and predicting spatio-temporal events relying on a complex hierarchy of nested feedback loops to generate a memory functionality. The Reservoir Computing paradigm does not require any knowledge of the reservoir topology or node weights for training purposes and can therefore utilize naturally existing networks formed by a wide variety of physical processes. Most efforts to implement reservoir computing prior to this have focused on utilizing memristor techniques to implement recursive neural networks. This paper examines the potential of magnetic skyrmion fabrics and the complex current patterns which form in them as an attractive physical instantiation for Reservoir Computing. We argue that their nonlinear dynamical interplay resulting from anisotropic magnetoresistance and spin-torque effects allows for an effective and energy efficient nonlinear processing of spatial temporal events with the aim of event recognition and prediction.
NASA Astrophysics Data System (ADS)
Zhang, H.-m.; Chen, X.-f.; Chang, S.
- It is difficult to compute synthetic seismograms for a layered half-space with sources and receivers at close to or the same depths using the generalized R/T coefficient method (Kennett, 1983; Luco and Apsel, 1983; Yao and Harkrider, 1983; Chen, 1993), because the wavenumber integration converges very slowly. A semi-analytic method for accelerating the convergence, in which part of the integration is implemented analytically, was adopted by some authors (Apsel and Luco, 1983; Hisada, 1994, 1995). In this study, based on the principle of the Repeated Averaging Method (Dahlquist and Björck, 1974; Chang, 1988), we propose an alternative, efficient, numerical method, the peak-trough averaging method (PTAM), to overcome the difficulty mentioned above. Compared with the semi-analytic method, PTAM is not only much simpler mathematically and easier to implement in practice, but also more efficient. Using numerical examples, we illustrate the validity, accuracy and efficiency of the new method.
NASA Astrophysics Data System (ADS)
Xu, Jincheng; Liu, Wei; Wang, Jin; Liu, Linong; Zhang, Jianfeng
2018-02-01
De-absorption pre-stack time migration (QPSTM) compensates for the absorption and dispersion of seismic waves by introducing an effective Q parameter, thereby making it an effective tool for 3D, high-resolution imaging of seismic data. Although the optimal aperture obtained via stationary-phase migration reduces the computational cost of 3D QPSTM and yields 3D stationary-phase QPSTM, the associated computational efficiency is still the main problem in the processing of 3D, high-resolution images for real large-scale seismic data. In the current paper, we proposed a division method for large-scale, 3D seismic data to optimize the performance of stationary-phase QPSTM on clusters of graphics processing units (GPU). Then, we designed an imaging point parallel strategy to achieve an optimal parallel computing performance. Afterward, we adopted an asynchronous double buffering scheme for multi-stream to perform the GPU/CPU parallel computing. Moreover, several key optimization strategies of computation and storage based on the compute unified device architecture (CUDA) were adopted to accelerate the 3D stationary-phase QPSTM algorithm. Compared with the initial GPU code, the implementation of the key optimization steps, including thread optimization, shared memory optimization, register optimization and special function units (SFU), greatly improved the efficiency. A numerical example employing real large-scale, 3D seismic data showed that our scheme is nearly 80 times faster than the CPU-QPSTM algorithm. Our GPU/CPU heterogeneous parallel computing framework significant reduces the computational cost and facilitates 3D high-resolution imaging for large-scale seismic data.
Efficient Hardware Implementation of the Lightweight Block Encryption Algorithm LEA
Lee, Donggeon; Kim, Dong-Chan; Kwon, Daesung; Kim, Howon
2014-01-01
Recently, due to the advent of resource-constrained trends, such as smartphones and smart devices, the computing environment is changing. Because our daily life is deeply intertwined with ubiquitous networks, the importance of security is growing. A lightweight encryption algorithm is essential for secure communication between these kinds of resource-constrained devices, and many researchers have been investigating this field. Recently, a lightweight block cipher called LEA was proposed. LEA was originally targeted for efficient implementation on microprocessors, as it is fast when implemented in software and furthermore, it has a small memory footprint. To reflect on recent technology, all required calculations utilize 32-bit wide operations. In addition, the algorithm is comprised of not complex S-Box-like structures but simple Addition, Rotation, and XOR operations. To the best of our knowledge, this paper is the first report on a comprehensive hardware implementation of LEA. We present various hardware structures and their implementation results according to key sizes. Even though LEA was originally targeted at software efficiency, it also shows high efficiency when implemented as hardware. PMID:24406859
NASA Astrophysics Data System (ADS)
Ren, Feixiang; Huang, Jinsheng; Terauchi, Mutsuhiro; Jiang, Ruyi; Klette, Reinhard
A robust and efficient lane detection system is an essential component of Lane Departure Warning Systems, which are commonly used in many vision-based Driver Assistance Systems (DAS) in intelligent transportation. Various computation platforms have been proposed in the past few years for the implementation of driver assistance systems (e.g., PC, laptop, integrated chips, PlayStation, and so on). In this paper, we propose a new platform for the implementation of lane detection, which is based on a mobile phone (the iPhone). Due to physical limitations of the iPhone w.r.t. memory and computing power, a simple and efficient lane detection algorithm using a Hough transform is developed and implemented on the iPhone, as existing algorithms developed based on the PC platform are not suitable for mobile phone devices (currently). Experiments of the lane detection algorithm are made both on PC and on iPhone.
NASA Technical Reports Server (NTRS)
Lyster, P. M.; Liewer, P. C.; Decyk, V. K.; Ferraro, R. D.
1995-01-01
A three-dimensional electrostatic particle-in-cell (PIC) plasma simulation code has been developed on coarse-grain distributed-memory massively parallel computers with message passing communications. Our implementation is the generalization to three-dimensions of the general concurrent particle-in-cell (GCPIC) algorithm. In the GCPIC algorithm, the particle computation is divided among the processors using a domain decomposition of the simulation domain. In a three-dimensional simulation, the domain can be partitioned into one-, two-, or three-dimensional subdomains ("slabs," "rods," or "cubes") and we investigate the efficiency of the parallel implementation of the push for all three choices. The present implementation runs on the Intel Touchstone Delta machine at Caltech; a multiple-instruction-multiple-data (MIMD) parallel computer with 512 nodes. We find that the parallel efficiency of the push is very high, with the ratio of communication to computation time in the range 0.3%-10.0%. The highest efficiency (> 99%) occurs for a large, scaled problem with 64(sup 3) particles per processing node (approximately 134 million particles of 512 nodes) which has a push time of about 250 ns per particle per time step. We have also developed expressions for the timing of the code which are a function of both code parameters (number of grid points, particles, etc.) and machine-dependent parameters (effective FLOP rate, and the effective interprocessor bandwidths for the communication of particles and grid points). These expressions can be used to estimate the performance of scaled problems--including those with inhomogeneous plasmas--to other parallel machines once the machine-dependent parameters are known.
Complexity of the Quantum Adiabatic Algorithm
NASA Astrophysics Data System (ADS)
Hen, Itay
2013-03-01
The Quantum Adiabatic Algorithm (QAA) has been proposed as a mechanism for efficiently solving optimization problems on a quantum computer. Since adiabatic computation is analog in nature and does not require the design and use of quantum gates, it can be thought of as a simpler and perhaps more profound method for performing quantum computations that might also be easier to implement experimentally. While these features have generated substantial research in QAA, to date there is still a lack of solid evidence that the algorithm can outperform classical optimization algorihms. Here, we discuss several aspects of the quantum adiabatic algorithm: We analyze the efficiency of the algorithm on several ``hard'' (NP) computational problems. Studying the size dependence of the typical minimum energy gap of the Hamiltonians of these problems using quantum Monte Carlo methods, we find that while for most problems the minimum gap decreases exponentially with the size of the problem, indicating that the QAA is not more efficient than existing classical search algorithms, for other problems there is evidence to suggest that the gap may be polynomial near the phase transition. We also discuss applications of the QAA to ``real life'' problems and how they can be implemented on currently available (albeit prototypical) quantum hardware such as ``D-Wave One'', that impose serious restrictions as to which type of problems may be tested. Finally, we discuss different approaches to find improved implementations of the algorithm such as local adiabatic evolution, adaptive methods, local search in Hamiltonian space and others.
Performance of parallel computation using CUDA for solving the one-dimensional elasticity equations
NASA Astrophysics Data System (ADS)
Darmawan, J. B. B.; Mungkasi, S.
2017-01-01
In this paper, we investigate the performance of parallel computation in solving the one-dimensional elasticity equations. Elasticity equations are usually implemented in engineering science. Solving these equations fast and efficiently is desired. Therefore, we propose the use of parallel computation. Our parallel computation uses CUDA of the NVIDIA. Our research results show that parallel computation using CUDA has a great advantage and is powerful when the computation is of large scale.
Biyikli, Emre; To, Albert C.
2015-01-01
A new topology optimization method called the Proportional Topology Optimization (PTO) is presented. As a non-sensitivity method, PTO is simple to understand, easy to implement, and is also efficient and accurate at the same time. It is implemented into two MATLAB programs to solve the stress constrained and minimum compliance problems. Descriptions of the algorithm and computer programs are provided in detail. The method is applied to solve three numerical examples for both types of problems. The method shows comparable efficiency and accuracy with an existing optimality criteria method which computes sensitivities. Also, the PTO stress constrained algorithm and minimum compliance algorithm are compared by feeding output from one algorithm to the other in an alternative manner, where the former yields lower maximum stress and volume fraction but higher compliance compared to the latter. Advantages and disadvantages of the proposed method and future works are discussed. The computer programs are self-contained and publicly shared in the website www.ptomethod.org. PMID:26678849
Efficient Implementation of the Pairing on Mobilephones Using BREW
NASA Astrophysics Data System (ADS)
Yoshitomi, Motoi; Takagi, Tsuyoshi; Kiyomoto, Shinsaku; Tanaka, Toshiaki
Pairing based cryptosystems can accomplish novel security applications such as ID-based cryptosystems, which have not been constructed efficiently without the pairing. The processing speed of the pairing based cryptosystems is relatively slow compared with the other conventional public key cryptosystems. However, several efficient algorithms for computing the pairing have been proposed, namely Duursma-Lee algorithm and its variant ηT pairing. In this paper, we present an efficient implementation of the pairing over some mobilephones. Moreover, we compare the processing speed of the pairing with that of the other standard public key cryptosystems, i. e. RSA cryptosystem and elliptic curve cryptosystem. Indeed the processing speed of our implementation in ARM9 processors on BREW achieves under 100 milliseconds using the supersingular curve over F397. In addition, the pairing is more efficient than the other public key cryptosystems, and the pairing can be achieved enough also on BREW mobilephones. It has become efficient enough to implement security applications, such as short signature, ID-based cryptosystems or broadcast encryption, using the pairing on BREW mobilephones.
Parallel 3D Mortar Element Method for Adaptive Nonconforming Meshes
NASA Technical Reports Server (NTRS)
Feng, Huiyu; Mavriplis, Catherine; VanderWijngaart, Rob; Biswas, Rupak
2004-01-01
High order methods are frequently used in computational simulation for their high accuracy. An efficient way to avoid unnecessary computation in smooth regions of the solution is to use adaptive meshes which employ fine grids only in areas where they are needed. Nonconforming spectral elements allow the grid to be flexibly adjusted to satisfy the computational accuracy requirements. The method is suitable for computational simulations of unsteady problems with very disparate length scales or unsteady moving features, such as heat transfer, fluid dynamics or flame combustion. In this work, we select the Mark Element Method (MEM) to handle the non-conforming interfaces between elements. A new technique is introduced to efficiently implement MEM in 3-D nonconforming meshes. By introducing an "intermediate mortar", the proposed method decomposes the projection between 3-D elements and mortars into two steps. In each step, projection matrices derived in 2-D are used. The two-step method avoids explicitly forming/deriving large projection matrices for 3-D meshes, and also helps to simplify the implementation. This new technique can be used for both h- and p-type adaptation. This method is applied to an unsteady 3-D moving heat source problem. With our new MEM implementation, mesh adaptation is able to efficiently refine the grid near the heat source and coarsen the grid once the heat source passes. The savings in computational work resulting from the dynamic mesh adaptation is demonstrated by the reduction of the the number of elements used and CPU time spent. MEM and mesh adaptation, respectively, bring irregularity and dynamics to the computer memory access pattern. Hence, they provide a good way to gauge the performance of computer systems when running scientific applications whose memory access patterns are irregular and unpredictable. We select a 3-D moving heat source problem as the Unstructured Adaptive (UA) grid benchmark, a new component of the NAS Parallel Benchmarks (NPB). In this paper, we present some interesting performance results of ow OpenMP parallel implementation on different architectures such as the SGI Origin2000, SGI Altix, and Cray MTA-2.
Cilfone, Nicholas A.; Kirschner, Denise E.; Linderman, Jennifer J.
2015-01-01
Biologically related processes operate across multiple spatiotemporal scales. For computational modeling methodologies to mimic this biological complexity, individual scale models must be linked in ways that allow for dynamic exchange of information across scales. A powerful methodology is to combine a discrete modeling approach, agent-based models (ABMs), with continuum models to form hybrid models. Hybrid multi-scale ABMs have been used to simulate emergent responses of biological systems. Here, we review two aspects of hybrid multi-scale ABMs: linking individual scale models and efficiently solving the resulting model. We discuss the computational choices associated with aspects of linking individual scale models while simultaneously maintaining model tractability. We demonstrate implementations of existing numerical methods in the context of hybrid multi-scale ABMs. Using an example model describing Mycobacterium tuberculosis infection, we show relative computational speeds of various combinations of numerical methods. Efficient linking and solution of hybrid multi-scale ABMs is key to model portability, modularity, and their use in understanding biological phenomena at a systems level. PMID:26366228
Two-dimensional nonsteady viscous flow simulation on the Navier-Stokes computer miniNode
NASA Technical Reports Server (NTRS)
Nosenchuck, Daniel M.; Littman, Michael G.; Flannery, William
1986-01-01
The needs of large-scale scientific computation are outpacing the growth in performance of mainframe supercomputers. In particular, problems in fluid mechanics involving complex flow simulations require far more speed and capacity than that provided by current and proposed Class VI supercomputers. To address this concern, the Navier-Stokes Computer (NSC) was developed. The NSC is a parallel-processing machine, comprised of individual Nodes, each comparable in performance to current supercomputers. The global architecture is that of a hypercube, and a 128-Node NSC has been designed. New architectural features, such as a reconfigurable many-function ALU pipeline and a multifunction memory-ALU switch, have provided the capability to efficiently implement a wide range of algorithms. Efficient algorithms typically involve numerically intensive tasks, which often include conditional operations. These operations may be efficiently implemented on the NSC without, in general, sacrificing vector-processing speed. To illustrate the architecture, programming, and several of the capabilities of the NSC, the simulation of two-dimensional, nonsteady viscous flows on a prototype Node, called the miniNode, is presented.
Paraskevopoulou, Sivylla E; Barsakcioglu, Deren Y; Saberi, Mohammed R; Eftekhar, Amir; Constandinou, Timothy G
2013-04-30
Next generation neural interfaces aspire to achieve real-time multi-channel systems by integrating spike sorting on chip to overcome limitations in communication channel capacity. The feasibility of this approach relies on developing highly efficient algorithms for feature extraction and clustering with the potential of low-power hardware implementation. We are proposing a feature extraction method, not requiring any calibration, based on first and second derivative features of the spike waveform. The accuracy and computational complexity of the proposed method are quantified and compared against commonly used feature extraction methods, through simulation across four datasets (with different single units) at multiple noise levels (ranging from 5 to 20% of the signal amplitude). The average classification error is shown to be below 7% with a computational complexity of 2N-3, where N is the number of sample points of each spike. Overall, this method presents a good trade-off between accuracy and computational complexity and is thus particularly well-suited for hardware-efficient implementation. Copyright © 2013 Elsevier B.V. All rights reserved.
Computer Assisted Communication within the Classroom: Interactive Lecturing.
ERIC Educational Resources Information Center
Herr, Richard B.
At the University of Delaware student-teacher communication within the classroom was enhanced through the implementation of a versatile, yet cost efficient, application of computer technology. A single microcomputer at a teacher's station controls a network of student keypad/display stations to provide individual channels of continuous…
Implementation of Steiner point of fuzzy set.
Liang, Jiuzhen; Wang, Dejiang
2014-01-01
This paper deals with the implementation of Steiner point of fuzzy set. Some definitions and properties of Steiner point are investigated and extended to fuzzy set. This paper focuses on establishing efficient methods to compute Steiner point of fuzzy set. Two strategies of computing Steiner point of fuzzy set are proposed. One is called linear combination of Steiner points computed by a series of crisp α-cut sets of the fuzzy set. The other is an approximate method, which is trying to find the optimal α-cut set approaching the fuzzy set. Stability analysis of Steiner point of fuzzy set is also studied. Some experiments on image processing are given, in which the two methods are applied for implementing Steiner point of fuzzy image, and both strategies show their own advantages in computing Steiner point of fuzzy set.
Wan, Shixiang; Zou, Quan
2017-01-01
Multiple sequence alignment (MSA) plays a key role in biological sequence analyses, especially in phylogenetic tree construction. Extreme increase in next-generation sequencing results in shortage of efficient ultra-large biological sequence alignment approaches for coping with different sequence types. Distributed and parallel computing represents a crucial technique for accelerating ultra-large (e.g. files more than 1 GB) sequence analyses. Based on HAlign and Spark distributed computing system, we implement a highly cost-efficient and time-efficient HAlign-II tool to address ultra-large multiple biological sequence alignment and phylogenetic tree construction. The experiments in the DNA and protein large scale data sets, which are more than 1GB files, showed that HAlign II could save time and space. It outperformed the current software tools. HAlign-II can efficiently carry out MSA and construct phylogenetic trees with ultra-large numbers of biological sequences. HAlign-II shows extremely high memory efficiency and scales well with increases in computing resource. THAlign-II provides a user-friendly web server based on our distributed computing infrastructure. HAlign-II with open-source codes and datasets was established at http://lab.malab.cn/soft/halign.
Parallel Calculation of Sensitivity Derivatives for Aircraft Design using Automatic Differentiation
NASA Technical Reports Server (NTRS)
Bischof, c. H.; Green, L. L.; Haigler, K. J.; Knauff, T. L., Jr.
1994-01-01
Sensitivity derivative (SD) calculation via automatic differentiation (AD) typical of that required for the aerodynamic design of a transport-type aircraft is considered. Two ways of computing SD via code generated by the ADIFOR automatic differentiation tool are compared for efficiency and applicability to problems involving large numbers of design variables. A vector implementation on a Cray Y-MP computer is compared with a coarse-grained parallel implementation on an IBM SP1 computer, employing a Fortran M wrapper. The SD are computed for a swept transport wing in turbulent, transonic flow; the number of geometric design variables varies from 1 to 60 with coupling between a wing grid generation program and a state-of-the-art, 3-D computational fluid dynamics program, both augmented for derivative computation via AD. For a small number of design variables, the Cray Y-MP implementation is much faster. As the number of design variables grows, however, the IBM SP1 becomes an attractive alternative in terms of compute speed, job turnaround time, and total memory available for solutions with large numbers of design variables. The coarse-grained parallel implementation also can be moved easily to a network of workstations.
Demonstration of a small programmable quantum computer with atomic qubits.
Debnath, S; Linke, N M; Figgatt, C; Landsman, K A; Wright, K; Monroe, C
2016-08-04
Quantum computers can solve certain problems more efficiently than any possible conventional computer. Small quantum algorithms have been demonstrated on multiple quantum computing platforms, many specifically tailored in hardware to implement a particular algorithm or execute a limited number of computational paths. Here we demonstrate a five-qubit trapped-ion quantum computer that can be programmed in software to implement arbitrary quantum algorithms by executing any sequence of universal quantum logic gates. We compile algorithms into a fully connected set of gate operations that are native to the hardware and have a mean fidelity of 98 per cent. Reconfiguring these gate sequences provides the flexibility to implement a variety of algorithms without altering the hardware. As examples, we implement the Deutsch-Jozsa and Bernstein-Vazirani algorithms with average success rates of 95 and 90 per cent, respectively. We also perform a coherent quantum Fourier transform on five trapped-ion qubits for phase estimation and period finding with average fidelities of 62 and 84 per cent, respectively. This small quantum computer can be scaled to larger numbers of qubits within a single register, and can be further expanded by connecting several such modules through ion shuttling or photonic quantum channels.
Demonstration of a small programmable quantum computer with atomic qubits
NASA Astrophysics Data System (ADS)
Debnath, S.; Linke, N. M.; Figgatt, C.; Landsman, K. A.; Wright, K.; Monroe, C.
2016-08-01
Quantum computers can solve certain problems more efficiently than any possible conventional computer. Small quantum algorithms have been demonstrated on multiple quantum computing platforms, many specifically tailored in hardware to implement a particular algorithm or execute a limited number of computational paths. Here we demonstrate a five-qubit trapped-ion quantum computer that can be programmed in software to implement arbitrary quantum algorithms by executing any sequence of universal quantum logic gates. We compile algorithms into a fully connected set of gate operations that are native to the hardware and have a mean fidelity of 98 per cent. Reconfiguring these gate sequences provides the flexibility to implement a variety of algorithms without altering the hardware. As examples, we implement the Deutsch-Jozsa and Bernstein-Vazirani algorithms with average success rates of 95 and 90 per cent, respectively. We also perform a coherent quantum Fourier transform on five trapped-ion qubits for phase estimation and period finding with average fidelities of 62 and 84 per cent, respectively. This small quantum computer can be scaled to larger numbers of qubits within a single register, and can be further expanded by connecting several such modules through ion shuttling or photonic quantum channels.
Alfredsson, Jayne; Plichart, Patrick; Zary, Nabil
2012-01-01
Research on computer supported scoring of assessments in health care education has mainly focused on automated scoring. Little attention has been given to how informatics can support the currently predominant human-based grading approach. This paper reports steps taken to develop a model for a computer supported scoring process that focuses on optimizing a task that was previously undertaken without computer support. The model was also implemented in the open source assessment platform TAO in order to study its benefits. Ability to score test takers anonymously, analytics on the graders reliability and a more time efficient process are example of observed benefits. A computer supported scoring will increase the quality of the assessment results.
Innovative architectures for dense multi-microprocessor computers
NASA Technical Reports Server (NTRS)
Donaldson, Thomas; Doty, Karl; Engle, Steven W.; Larson, Robert E.; O'Reilly, John G.
1988-01-01
The results of a Phase I Small Business Innovative Research (SBIR) project performed for the NASA Langley Computational Structural Mechanics Group are described. The project resulted in the identification of a family of chordal-ring interconnection architectures with excellent potential to serve as the basis for new multimicroprocessor (MMP) computers. The paper presents examples of how computational algorithms from structural mechanics can be efficiently implemented on the chordal-ring architecture.
NASA Astrophysics Data System (ADS)
Petit, C.; Le Louarn, M.; Fusco, T.; Madec, P.-Y.
2011-09-01
Various tomographic control solutions have been proposed during the last decades to ensure efficient or even optimal closed-loop correction to tomographic Adaptive Optics (AO) concepts such as Laser Tomographic AO (LTAO), Multi-Conjugate AO (MCAO). The optimal solution, based on Linear Quadratic Gaussian (LQG) approach, as well as suboptimal but efficient solutions such as Pseudo-Open Loop Control (POLC) require multiple Matrix Vector Multiplications (MVM). Disregarding their respective performance, these efficient control solutions thus exhibit strong increase of on-line complexity and their implementation may become difficult in demanding cases. Among them, two cases are of particular interest. First, the system Real-Time Computer architecture and implementation is derived from past or present solutions and does not support multiple MVM. This is the case of the AO Facility which RTC architecture is derived from the SPARTA platform and inherits its simple MVM architecture, which does not fit with LTAO control solutions for instance. Second, considering future systems such as Extremely Large Telescopes, the number of degrees of freedom is twenty to one hundred times bigger than present systems. In these conditions, tomographic control solutions can hardly be used in their standard form and optimized implementation shall be considered. Single MVM tomographic control solutions represent a potential solution, and straightforward solutions such as Virtual Deformable Mirrors have been already proposed for LTAO but with tuning issues. We investigate in this paper the possibility to derive from tomographic control solutions, such as POLC or LQG, simplified control solutions ensuring simple MVM architecture and that could be thus implemented on nowadays systems or future complex systems. We theoretically derive various solutions and analyze their respective performance on various systems thanks to numerical simulation. We discuss the optimization of their performance and stability issues with respect to classic control solutions. We finally discuss off-line computation and implementation constraints.
CBES--An Efficient Implementation of the Coursewriter Language.
ERIC Educational Resources Information Center
Franks, Edward W.
An extensive computer based education system (CBES) built around the IBM Coursewriter III program product at Ohio State University is described. In this system, numerous extensions have been added to the Coursewriter III language to provide capabilities needed to implement sophisticated instructional strategies. CBES design goals include lower CPU…
Polymorphous Computing Architectures
2007-12-12
provide a multiprocessor implementation. In this work, we introduce the Atomos transactional programming language, which is the first to include...implicit transactions, strong atomicity, and a scalable multiprocessor implementation [47]. Atomos is derived from Java, but replaces its synchronization...and conditional waiting constructs with transactional alternatives. The Atomos conditional waiting proposal is tailored to allow efficient
Visser, Marco D.; McMahon, Sean M.; Merow, Cory; Dixon, Philip M.; Record, Sydne; Jongejans, Eelke
2015-01-01
Computation has become a critical component of research in biology. A risk has emerged that computational and programming challenges may limit research scope, depth, and quality. We review various solutions to common computational efficiency problems in ecological and evolutionary research. Our review pulls together material that is currently scattered across many sources and emphasizes those techniques that are especially effective for typical ecological and environmental problems. We demonstrate how straightforward it can be to write efficient code and implement techniques such as profiling or parallel computing. We supply a newly developed R package (aprof) that helps to identify computational bottlenecks in R code and determine whether optimization can be effective. Our review is complemented by a practical set of examples and detailed Supporting Information material (S1–S3 Texts) that demonstrate large improvements in computational speed (ranging from 10.5 times to 14,000 times faster). By improving computational efficiency, biologists can feasibly solve more complex tasks, ask more ambitious questions, and include more sophisticated analyses in their research. PMID:25811842
Implementation of a block Lanczos algorithm for Eigenproblem solution of gyroscopic systems
NASA Technical Reports Server (NTRS)
Gupta, Kajal K.; Lawson, Charles L.
1987-01-01
The details of implementation of a general numerical procedure developed for the accurate and economical computation of natural frequencies and associated modes of any elastic structure rotating along an arbitrary axis are described. A block version of the Lanczos algorithm is derived for the solution that fully exploits associated matrix sparsity and employs only real numbers in all relevant computations. It is also capable of determining multiple roots and proves to be most efficient when compared to other, similar, exisiting techniques.
Entrepreneurial systems. Do it yourself for profit and pleasure.
Manuel, G; Young, K
1991-11-28
You don't have to have a degree in computing or a 1 m pounds budget to create a system that improves efficiency or saves money. Gren Manuel introduces a special report which celebrates the small-scale initiatives devised and implemented by health staff armed only with a personal computer.
Efficient option valuation of single and double barrier options
NASA Astrophysics Data System (ADS)
Kabaivanov, Stanimir; Milev, Mariyan; Koleva-Petkova, Dessislava; Vladev, Veselin
2017-12-01
In this paper we present an implementation of pricing algorithm for single and double barrier options using Mellin transformation with Maximum Entropy Inversion and its suitability for real-world applications. A detailed analysis of the applied algorithm is accompanied by implementation in C++ that is then compared to existing solutions in terms of efficiency and computational power. We then compare the applied method with existing closed-form solutions and well known methods of pricing barrier options that are based on finite differences.
NASA Astrophysics Data System (ADS)
Tomaro, Robert F.
1998-07-01
The present research is aimed at developing a higher-order, spatially accurate scheme for both steady and unsteady flow simulations using unstructured meshes. The resulting scheme must work on a variety of general problems to ensure the creation of a flexible, reliable and accurate aerodynamic analysis tool. To calculate the flow around complex configurations, unstructured grids and the associated flow solvers have been developed. Efficient simulations require the minimum use of computer memory and computational times. Unstructured flow solvers typically require more computer memory than a structured flow solver due to the indirect addressing of the cells. The approach taken in the present research was to modify an existing three-dimensional unstructured flow solver to first decrease the computational time required for a solution and then to increase the spatial accuracy. The terms required to simulate flow involving non-stationary grids were also implemented. First, an implicit solution algorithm was implemented to replace the existing explicit procedure. Several test cases, including internal and external, inviscid and viscous, two-dimensional, three-dimensional and axi-symmetric problems, were simulated for comparison between the explicit and implicit solution procedures. The increased efficiency and robustness of modified code due to the implicit algorithm was demonstrated. Two unsteady test cases, a plunging airfoil and a wing undergoing bending and torsion, were simulated using the implicit algorithm modified to include the terms required for a moving and/or deforming grid. Secondly, a higher than second-order spatially accurate scheme was developed and implemented into the baseline code. Third- and fourth-order spatially accurate schemes were implemented and tested. The original dissipation was modified to include higher-order terms and modified near shock waves to limit pre- and post-shock oscillations. The unsteady cases were repeated using the higher-order spatially accurate code. The new solutions were compared with those obtained using the second-order spatially accurate scheme. Finally, the increased efficiency of using an implicit solution algorithm in a production Computational Fluid Dynamics flow solver was demonstrated for steady and unsteady flows. A third- and fourth-order spatially accurate scheme has been implemented creating a basis for a state-of-the-art aerodynamic analysis tool.
Programming the Navier-Stokes computer: An abstract machine model and a visual editor
NASA Technical Reports Server (NTRS)
Middleton, David; Crockett, Tom; Tomboulian, Sherry
1988-01-01
The Navier-Stokes computer is a parallel computer designed to solve Computational Fluid Dynamics problems. Each processor contains several floating point units which can be configured under program control to implement a vector pipeline with several inputs and outputs. Since the development of an effective compiler for this computer appears to be very difficult, machine level programming seems necessary and support tools for this process have been studied. These support tools are organized into a graphical program editor. A programming process is described by which appropriate computations may be efficiently implemented on the Navier-Stokes computer. The graphical editor would support this programming process, verifying various programmer choices for correctness and deducing values such as pipeline delays and network configurations. Step by step details are provided and demonstrated with two example programs.
Parallel algorithms for large-scale biological sequence alignment on Xeon-Phi based clusters.
Lan, Haidong; Chan, Yuandong; Xu, Kai; Schmidt, Bertil; Peng, Shaoliang; Liu, Weiguo
2016-07-19
Computing alignments between two or more sequences are common operations frequently performed in computational molecular biology. The continuing growth of biological sequence databases establishes the need for their efficient parallel implementation on modern accelerators. This paper presents new approaches to high performance biological sequence database scanning with the Smith-Waterman algorithm and the first stage of progressive multiple sequence alignment based on the ClustalW heuristic on a Xeon Phi-based compute cluster. Our approach uses a three-level parallelization scheme to take full advantage of the compute power available on this type of architecture; i.e. cluster-level data parallelism, thread-level coarse-grained parallelism, and vector-level fine-grained parallelism. Furthermore, we re-organize the sequence datasets and use Xeon Phi shuffle operations to improve I/O efficiency. Evaluations show that our method achieves a peak overall performance up to 220 GCUPS for scanning real protein sequence databanks on a single node consisting of two Intel E5-2620 CPUs and two Intel Xeon Phi 7110P cards. It also exhibits good scalability in terms of sequence length and size, and number of compute nodes for both database scanning and multiple sequence alignment. Furthermore, the achieved performance is highly competitive in comparison to optimized Xeon Phi and GPU implementations. Our implementation is available at https://github.com/turbo0628/LSDBS-mpi .
International Conference on Stiff Computation Held at Park City, Utah on April 12, 13 and 14, 1982.
1983-05-31
algorithm should be designed which can analyse a system description and find out for the user ~to which class of problems his system belongs... Dove...processors designed to implement aspecific solution process. yrne: IEE floating point chip design " used by INE and others is an example (Xahan)...the...hardware speciaList has designed his computer such that the paraL#L features can be addressed convenientLy and !! ’) efficientLy, and 4;) the software
Implementation of real-time digital signal processing systems
NASA Technical Reports Server (NTRS)
Narasimha, M.; Peterson, A.; Narayan, S.
1978-01-01
Special purpose hardware implementation of DFT Computers and digital filters is considered in the light of newly introduced algorithms and IC devices. Recent work by Winograd on high-speed convolution techniques for computing short length DFT's, has motivated the development of more efficient algorithms, compared to the FFT, for evaluating the transform of longer sequences. Among these, prime factor algorithms appear suitable for special purpose hardware implementations. Architectural considerations in designing DFT computers based on these algorithms are discussed. With the availability of monolithic multiplier-accumulators, a direct implementation of IIR and FIR filters, using random access memories in place of shift registers, appears attractive. The memory addressing scheme involved in such implementations is discussed. A simple counter set-up to address the data memory in the realization of FIR filters is also described. The combination of a set of simple filters (weighting network) and a DFT computer is shown to realize a bank of uniform bandpass filters. The usefulness of this concept in arriving at a modular design for a million channel spectrum analyzer, based on microprocessors, is discussed.
Yahoo! Compute Coop (YCC). A Next-Generation Passive Cooling Design for Data Centers
DOE Office of Scientific and Technical Information (OSTI.GOV)
Robison, AD; Page, Christina; Lytle, Bob
The purpose of the Yahoo! Compute Coop (YCC) project is to research, design, build and implement a greenfield "efficient data factory" and to specifically demonstrate that the YCC concept is feasible for large facilities housing tens of thousands of heat-producing computing servers. The project scope for the Yahoo! Compute Coop technology includes: - Analyzing and implementing ways in which to drastically decrease energy consumption and waste output. - Analyzing the laws of thermodynamics and implementing naturally occurring environmental effects in order to maximize the "free-cooling" for large data center facilities. "Free cooling" is the direct usage of outside air tomore » cool the servers vs. traditional "mechanical cooling" which is supplied by chillers or other Dx units. - Redesigning and simplifying building materials and methods. - Shortening and simplifying build-to-operate schedules while at the same time reducing initial build and operating costs. Selected for its favorable climate, the greenfield project site is located in Lockport, NY. Construction on the 9.0 MW critical load data center facility began in May 2009, with the fully operational facility deployed in September 2010. The relatively low initial build cost, compatibility with current server and network models, and the efficient use of power and water are all key features that make it a highly compatible and globally implementable design innovation for the data center industry. Yahoo! Compute Coop technology is designed to achieve 99.98% uptime availability. This integrated building design allows for free cooling 99% of the year via the building's unique shape and orientation, as well as server physical configuration.« less
NASA Technical Reports Server (NTRS)
Janetzke, David C.; Murthy, Durbha V.
1991-01-01
Aeroelastic analysis is multi-disciplinary and computationally expensive. Hence, it can greatly benefit from parallel processing. As part of an effort to develop an aeroelastic capability on a distributed memory transputer network, a parallel algorithm for the computation of aerodynamic influence coefficients is implemented on a network of 32 transputers. The aerodynamic influence coefficients are calculated using a 3-D unsteady aerodynamic model and a parallel discretization. Efficiencies up to 85 percent were demonstrated using 32 processors. The effect of subtask ordering, problem size, and network topology are presented. A comparison to results on a shared memory computer indicates that higher speedup is achieved on the distributed memory system.
Computational Discovery of Materials Using the Firefly Algorithm
NASA Astrophysics Data System (ADS)
Avendaño-Franco, Guillermo; Romero, Aldo
Our current ability to model physical phenomena accurately, the increase computational power and better algorithms are the driving forces behind the computational discovery and design of novel materials, allowing for virtual characterization before their realization in the laboratory. We present the implementation of a novel firefly algorithm, a population-based algorithm for global optimization for searching the structure/composition space. This novel computation-intensive approach naturally take advantage of concurrency, targeted exploration and still keeping enough diversity. We apply the new method in both periodic and non-periodic structures and we present the implementation challenges and solutions to improve efficiency. The implementation makes use of computational materials databases and network analysis to optimize the search and get insights about the geometric structure of local minima on the energy landscape. The method has been implemented in our software PyChemia, an open-source package for materials discovery. We acknowledge the support of DMREF-NSF 1434897 and the Donors of the American Chemical Society Petroleum Research Fund for partial support of this research under Contract 54075-ND10.
Fractional Steps methods for transient problems on commodity computer architectures
NASA Astrophysics Data System (ADS)
Krotkiewski, M.; Dabrowski, M.; Podladchikov, Y. Y.
2008-12-01
Fractional Steps methods are suitable for modeling transient processes that are central to many geological applications. Low memory requirements and modest computational complexity facilitates calculations on high-resolution three-dimensional models. An efficient implementation of Alternating Direction Implicit/Locally One-Dimensional schemes for an Opteron-based shared memory system is presented. The memory bandwidth usage, the main bottleneck on modern computer architectures, is specially addressed. High efficiency of above 2 GFlops per CPU is sustained for problems of 1 billion degrees of freedom. The optimized sequential implementation of all 1D sweeps is comparable in execution time to copying the used data in the memory. Scalability of the parallel implementation on up to 8 CPUs is close to perfect. Performing one timestep of the Locally One-Dimensional scheme on a system of 1000 3 unknowns on 8 CPUs takes only 11 s. We validate the LOD scheme using a computational model of an isolated inclusion subject to a constant far field flux. Next, we study numerically the evolution of a diffusion front and the effective thermal conductivity of composites consisting of multiple inclusions and compare the results with predictions based on the differential effective medium approach. Finally, application of the developed parabolic solver is suggested for a real-world problem of fluid transport and reactions inside a reservoir.
NASA Technical Reports Server (NTRS)
Weed, Richard Allen; Sankar, L. N.
1994-01-01
An increasing amount of research activity in computational fluid dynamics has been devoted to the development of efficient algorithms for parallel computing systems. The increasing performance to price ratio of engineering workstations has led to research to development procedures for implementing a parallel computing system composed of distributed workstations. This thesis proposal outlines an ongoing research program to develop efficient strategies for performing three-dimensional flow analysis on distributed computing systems. The PVM parallel programming interface was used to modify an existing three-dimensional flow solver, the TEAM code developed by Lockheed for the Air Force, to function as a parallel flow solver on clusters of workstations. Steady flow solutions were generated for three different wing and body geometries to validate the code and evaluate code performance. The proposed research will extend the parallel code development to determine the most efficient strategies for unsteady flow simulations.
Yiu, Sean; Tom, Brian Dm
2017-01-01
Several researchers have described two-part models with patient-specific stochastic processes for analysing longitudinal semicontinuous data. In theory, such models can offer greater flexibility than the standard two-part model with patient-specific random effects. However, in practice, the high dimensional integrations involved in the marginal likelihood (i.e. integrated over the stochastic processes) significantly complicates model fitting. Thus, non-standard computationally intensive procedures based on simulating the marginal likelihood have so far only been proposed. In this paper, we describe an efficient method of implementation by demonstrating how the high dimensional integrations involved in the marginal likelihood can be computed efficiently. Specifically, by using a property of the multivariate normal distribution and the standard marginal cumulative distribution function identity, we transform the marginal likelihood so that the high dimensional integrations are contained in the cumulative distribution function of a multivariate normal distribution, which can then be efficiently evaluated. Hence, maximum likelihood estimation can be used to obtain parameter estimates and asymptotic standard errors (from the observed information matrix) of model parameters. We describe our proposed efficient implementation procedure for the standard two-part model parameterisation and when it is of interest to directly model the overall marginal mean. The methodology is applied on a psoriatic arthritis data set concerning functional disability.
A Very High Order, Adaptable MESA Implementation for Aeroacoustic Computations
NASA Technical Reports Server (NTRS)
Dydson, Roger W.; Goodrich, John W.
2000-01-01
Since computational efficiency and wave resolution scale with accuracy, the ideal would be infinitely high accuracy for problems with widely varying wavelength scales. Currently, many of the computational aeroacoustics methods are limited to 4th order accurate Runge-Kutta methods in time which limits their resolution and efficiency. However, a new procedure for implementing the Modified Expansion Solution Approximation (MESA) schemes, based upon Hermitian divided differences, is presented which extends the effective accuracy of the MESA schemes to 57th order in space and time when using 128 bit floating point precision. This new approach has the advantages of reducing round-off error, being easy to program. and is more computationally efficient when compared to previous approaches. Its accuracy is limited only by the floating point hardware. The advantages of this new approach are demonstrated by solving the linearized Euler equations in an open bi-periodic domain. A 500th order MESA scheme can now be created in seconds, making these schemes ideally suited for the next generation of high performance 256-bit (double quadruple) or higher precision computers. This ease of creation makes it possible to adapt the algorithm to the mesh in time instead of its converse: this is ideal for resolving varying wavelength scales which occur in noise generation simulations. And finally, the sources of round-off error which effect the very high order methods are examined and remedies provided that effectively increase the accuracy of the MESA schemes while using current computer technology.
NASA Astrophysics Data System (ADS)
Ercan, İlke; Suyabatmaz, Enes
2018-06-01
The saturation in the efficiency and performance scaling of conventional electronic technologies brings about the development of novel computational paradigms. Brownian circuits are among the promising alternatives that can exploit fluctuations to increase the efficiency of information processing in nanocomputing. A Brownian cellular automaton, where signals propagate randomly and are driven by local transition rules, can be made computationally universal by embedding arbitrary asynchronous circuits on it. One of the potential realizations of such circuits is via single electron tunneling (SET) devices since SET technology enable simulation of noise and fluctuations in a fashion similar to Brownian search. In this paper, we perform a physical-information-theoretic analysis on the efficiency limitations in a Brownian NAND and half-adder circuits implemented using SET technology. The method we employed here establishes a solid ground that enables studying computational and physical features of this emerging technology on an equal footing, and yield fundamental lower bounds that provide valuable insights into how far its efficiency can be improved in principle. In order to provide a basis for comparison, we also analyze a NAND gate and half-adder circuit implemented in complementary metal oxide semiconductor technology to show how the fundamental bound of the Brownian circuit compares against a conventional paradigm.
The multifacet graphically contracted function method. I. Formulation and implementation
DOE Office of Scientific and Technical Information (OSTI.GOV)
Shepard, Ron; Brozell, Scott R.; Gidofalvi, Gergely
2014-08-14
The basic formulation for the multifacet generalization of the graphically contracted function (MFGCF) electronic structure method is presented. The analysis includes the discussion of linear dependency and redundancy of the arc factor parameters, the computation of reduced density matrices, Hamiltonian matrix construction, spin-density matrix construction, the computation of optimization gradients for single-state and state-averaged calculations, graphical wave function analysis, and the efficient computation of configuration state function and Slater determinant expansion coefficients. Timings are given for Hamiltonian matrix element and analytic optimization gradient computations for a range of model problems for full-CI Shavitt graphs, and it is observed that bothmore » the energy and the gradient computation scale as O(N{sup 2}n{sup 4}) for N electrons and n orbitals. The important arithmetic operations are within dense matrix-matrix product computational kernels, resulting in a computationally efficient procedure. An initial implementation of the method is used to present applications to several challenging chemical systems, including N{sub 2} dissociation, cubic H{sub 8} dissociation, the symmetric dissociation of H{sub 2}O, and the insertion of Be into H{sub 2}. The results are compared to the exact full-CI values and also to those of the previous single-facet GCF expansion form.« less
Fast Acceleration of 2D Wave Propagation Simulations Using Modern Computational Accelerators
Wang, Wei; Xu, Lifan; Cavazos, John; Huang, Howie H.; Kay, Matthew
2014-01-01
Recent developments in modern computational accelerators like Graphics Processing Units (GPUs) and coprocessors provide great opportunities for making scientific applications run faster than ever before. However, efficient parallelization of scientific code using new programming tools like CUDA requires a high level of expertise that is not available to many scientists. This, plus the fact that parallelized code is usually not portable to different architectures, creates major challenges for exploiting the full capabilities of modern computational accelerators. In this work, we sought to overcome these challenges by studying how to achieve both automated parallelization using OpenACC and enhanced portability using OpenCL. We applied our parallelization schemes using GPUs as well as Intel Many Integrated Core (MIC) coprocessor to reduce the run time of wave propagation simulations. We used a well-established 2D cardiac action potential model as a specific case-study. To the best of our knowledge, we are the first to study auto-parallelization of 2D cardiac wave propagation simulations using OpenACC. Our results identify several approaches that provide substantial speedups. The OpenACC-generated GPU code achieved more than speedup above the sequential implementation and required the addition of only a few OpenACC pragmas to the code. An OpenCL implementation provided speedups on GPUs of at least faster than the sequential implementation and faster than a parallelized OpenMP implementation. An implementation of OpenMP on Intel MIC coprocessor provided speedups of with only a few code changes to the sequential implementation. We highlight that OpenACC provides an automatic, efficient, and portable approach to achieve parallelization of 2D cardiac wave simulations on GPUs. Our approach of using OpenACC, OpenCL, and OpenMP to parallelize this particular model on modern computational accelerators should be applicable to other computational models of wave propagation in multi-dimensional media. PMID:24497950
The thermodynamic efficiency of computations made in cells across the range of life
NASA Astrophysics Data System (ADS)
Kempes, Christopher P.; Wolpert, David; Cohen, Zachary; Pérez-Mercader, Juan
2017-11-01
Biological organisms must perform computation as they grow, reproduce and evolve. Moreover, ever since Landauer's bound was proposed, it has been known that all computation has some thermodynamic cost-and that the same computation can be achieved with greater or smaller thermodynamic cost depending on how it is implemented. Accordingly an important issue concerning the evolution of life is assessing the thermodynamic efficiency of the computations performed by organisms. This issue is interesting both from the perspective of how close life has come to maximally efficient computation (presumably under the pressure of natural selection), and from the practical perspective of what efficiencies we might hope that engineered biological computers might achieve, especially in comparison with current computational systems. Here we show that the computational efficiency of translation, defined as free energy expended per amino acid operation, outperforms the best supercomputers by several orders of magnitude, and is only about an order of magnitude worse than the Landauer bound. However, this efficiency depends strongly on the size and architecture of the cell in question. In particular, we show that the useful efficiency of an amino acid operation, defined as the bulk energy per amino acid polymerization, decreases for increasing bacterial size and converges to the polymerization cost of the ribosome. This cost of the largest bacteria does not change in cells as we progress through the major evolutionary shifts to both single- and multicellular eukaryotes. However, the rates of total computation per unit mass are non-monotonic in bacteria with increasing cell size, and also change across different biological architectures, including the shift from unicellular to multicellular eukaryotes. This article is part of the themed issue 'Reconceptualizing the origins of life'.
A method for real-time implementation of HOG feature extraction
NASA Astrophysics Data System (ADS)
Luo, Hai-bo; Yu, Xin-rong; Liu, Hong-mei; Ding, Qing-hai
2011-08-01
Histogram of oriented gradient (HOG) is an efficient feature extraction scheme, and HOG descriptors are feature descriptors which is widely used in computer vision and image processing for the purpose of biometrics, target tracking, automatic target detection(ATD) and automatic target recognition(ATR) etc. However, computation of HOG feature extraction is unsuitable for hardware implementation since it includes complicated operations. In this paper, the optimal design method and theory frame for real-time HOG feature extraction based on FPGA were proposed. The main principle is as follows: firstly, the parallel gradient computing unit circuit based on parallel pipeline structure was designed. Secondly, the calculation of arctangent and square root operation was simplified. Finally, a histogram generator based on parallel pipeline structure was designed to calculate the histogram of each sub-region. Experimental results showed that the HOG extraction can be implemented in a pixel period by these computing units.
Ellingwood, Nathan D; Yin, Youbing; Smith, Matthew; Lin, Ching-Long
2016-04-01
Faster and more accurate methods for registration of images are important for research involved in conducting population-based studies that utilize medical imaging, as well as improvements for use in clinical applications. We present a novel computation- and memory-efficient multi-level method on graphics processing units (GPU) for performing registration of two computed tomography (CT) volumetric lung images. We developed a computation- and memory-efficient Diffeomorphic Multi-level B-Spline Transform Composite (DMTC) method to implement nonrigid mass-preserving registration of two CT lung images on GPU. The framework consists of a hierarchy of B-Spline control grids of increasing resolution. A similarity criterion known as the sum of squared tissue volume difference (SSTVD) was adopted to preserve lung tissue mass. The use of SSTVD consists of the calculation of the tissue volume, the Jacobian, and their derivatives, which makes its implementation on GPU challenging due to memory constraints. The use of the DMTC method enabled reduced computation and memory storage of variables with minimal communication between GPU and Central Processing Unit (CPU) due to ability to pre-compute values. The method was assessed on six healthy human subjects. Resultant GPU-generated displacement fields were compared against the previously validated CPU counterpart fields, showing good agreement with an average normalized root mean square error (nRMS) of 0.044±0.015. Runtime and performance speedup are compared between single-threaded CPU, multi-threaded CPU, and GPU algorithms. Best performance speedup occurs at the highest resolution in the GPU implementation for the SSTVD cost and cost gradient computations, with a speedup of 112 times that of the single-threaded CPU version and 11 times over the twelve-threaded version when considering average time per iteration using a Nvidia Tesla K20X GPU. The proposed GPU-based DMTC method outperforms its multi-threaded CPU version in terms of runtime. Total registration time reduced runtime to 2.9min on the GPU version, compared to 12.8min on twelve-threaded CPU version and 112.5min on a single-threaded CPU. Furthermore, the GPU implementation discussed in this work can be adapted for use of other cost functions that require calculation of the first derivatives. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.
3-D modeling of ductile tearing using finite elements: Computational aspects and techniques
NASA Astrophysics Data System (ADS)
Gullerud, Arne Stewart
This research focuses on the development and application of computational tools to perform large-scale, 3-D modeling of ductile tearing in engineering components under quasi-static to mild loading rates. Two standard models for ductile tearing---the computational cell methodology and crack growth controlled by the crack tip opening angle (CTOA)---are described and their 3-D implementations are explored. For the computational cell methodology, quantification of the effects of several numerical issues---computational load step size, procedures for force release after cell deletion, and the porosity for cell deletion---enables construction of computational algorithms to remove the dependence of predicted crack growth on these issues. This work also describes two extensions of the CTOA approach into 3-D: a general 3-D method and a constant front technique. Analyses compare the characteristics of the extensions, and a validation study explores the ability of the constant front extension to predict crack growth in thin aluminum test specimens over a range of specimen geometries, absolutes sizes, and levels of out-of-plane constraint. To provide a computational framework suitable for the solution of these problems, this work also describes the parallel implementation of a nonlinear, implicit finite element code. The implementation employs an explicit message-passing approach using the MPI standard to maintain portability, a domain decomposition of element data to provide parallel execution, and a master-worker organization of the computational processes to enhance future extensibility. A linear preconditioned conjugate gradient (LPCG) solver serves as the core of the solution process. The parallel LPCG solver utilizes an element-by-element (EBE) structure of the computations to permit a dual-level decomposition of the element data: domain decomposition of the mesh provides efficient coarse-grain parallel execution, while decomposition of the domains into blocks of similar elements (same type, constitutive model, etc.) provides fine-grain parallel computation on each processor. A major focus of the LPCG solver is a new implementation of the Hughes-Winget element-by-element (HW) preconditioner. The implementation employs a weighted dependency graph combined with a new coloring algorithm to provide load-balanced scheduling for the preconditioner and overlapped communication/computation. This approach enables efficient parallel application of the HW preconditioner for arbitrary unstructured meshes.
Dietz, Dennis C.
2014-01-01
A cogent method is presented for computing the expected cost of an appointment schedule where customers are statistically identical, the service time distribution has known mean and variance, and customer no-shows occur with time-dependent probability. The approach is computationally efficient and can be easily implemented to evaluate candidate schedules within a schedule optimization algorithm. PMID:24605070
Minho Won; Albalawi, Hassan; Xin Li; Thomas, Donald E
2014-01-01
This paper describes a low-power hardware implementation for movement decoding of brain computer interface. Our proposed hardware design is facilitated by two novel ideas: (i) an efficient feature extraction method based on reduced-resolution discrete cosine transform (DCT), and (ii) a new hardware architecture of dual look-up table to perform discrete cosine transform without explicit multiplication. The proposed hardware implementation has been validated for movement decoding of electrocorticography (ECoG) signal by using a Xilinx FPGA Zynq-7000 board. It achieves more than 56× energy reduction over a reference design using band-pass filters for feature extraction.
Convolutional networks for fast, energy-efficient neuromorphic computing
Esser, Steven K.; Merolla, Paul A.; Arthur, John V.; Cassidy, Andrew S.; Appuswamy, Rathinakumar; Andreopoulos, Alexander; Berg, David J.; McKinstry, Jeffrey L.; Melano, Timothy; Barch, Davis R.; di Nolfo, Carmelo; Datta, Pallab; Amir, Arnon; Taba, Brian; Flickner, Myron D.; Modha, Dharmendra S.
2016-01-01
Deep networks are now able to achieve human-level performance on a broad spectrum of recognition tasks. Independently, neuromorphic computing has now demonstrated unprecedented energy-efficiency through a new chip architecture based on spiking neurons, low precision synapses, and a scalable communication network. Here, we demonstrate that neuromorphic computing, despite its novel architectural primitives, can implement deep convolution networks that (i) approach state-of-the-art classification accuracy across eight standard datasets encompassing vision and speech, (ii) perform inference while preserving the hardware’s underlying energy-efficiency and high throughput, running on the aforementioned datasets at between 1,200 and 2,600 frames/s and using between 25 and 275 mW (effectively >6,000 frames/s per Watt), and (iii) can be specified and trained using backpropagation with the same ease-of-use as contemporary deep learning. This approach allows the algorithmic power of deep learning to be merged with the efficiency of neuromorphic processors, bringing the promise of embedded, intelligent, brain-inspired computing one step closer. PMID:27651489
Convolutional networks for fast, energy-efficient neuromorphic computing.
Esser, Steven K; Merolla, Paul A; Arthur, John V; Cassidy, Andrew S; Appuswamy, Rathinakumar; Andreopoulos, Alexander; Berg, David J; McKinstry, Jeffrey L; Melano, Timothy; Barch, Davis R; di Nolfo, Carmelo; Datta, Pallab; Amir, Arnon; Taba, Brian; Flickner, Myron D; Modha, Dharmendra S
2016-10-11
Deep networks are now able to achieve human-level performance on a broad spectrum of recognition tasks. Independently, neuromorphic computing has now demonstrated unprecedented energy-efficiency through a new chip architecture based on spiking neurons, low precision synapses, and a scalable communication network. Here, we demonstrate that neuromorphic computing, despite its novel architectural primitives, can implement deep convolution networks that (i) approach state-of-the-art classification accuracy across eight standard datasets encompassing vision and speech, (ii) perform inference while preserving the hardware's underlying energy-efficiency and high throughput, running on the aforementioned datasets at between 1,200 and 2,600 frames/s and using between 25 and 275 mW (effectively >6,000 frames/s per Watt), and (iii) can be specified and trained using backpropagation with the same ease-of-use as contemporary deep learning. This approach allows the algorithmic power of deep learning to be merged with the efficiency of neuromorphic processors, bringing the promise of embedded, intelligent, brain-inspired computing one step closer.
Experimental magic state distillation for fault-tolerant quantum computing.
Souza, Alexandre M; Zhang, Jingfu; Ryan, Colm A; Laflamme, Raymond
2011-01-25
Any physical quantum device for quantum information processing (QIP) is subject to errors in implementation. In order to be reliable and efficient, quantum computers will need error-correcting or error-avoiding methods. Fault-tolerance achieved through quantum error correction will be an integral part of quantum computers. Of the many methods that have been discovered to implement it, a highly successful approach has been to use transversal gates and specific initial states. A critical element for its implementation is the availability of high-fidelity initial states, such as |0〉 and the 'magic state'. Here, we report an experiment, performed in a nuclear magnetic resonance (NMR) quantum processor, showing sufficient quantum control to improve the fidelity of imperfect initial magic states by distilling five of them into one with higher fidelity.
Komorkiewicz, Mateusz; Kryjak, Tomasz; Gorgon, Marek
2014-01-01
This article presents an efficient hardware implementation of the Horn-Schunck algorithm that can be used in an embedded optical flow sensor. An architecture is proposed, that realises the iterative Horn-Schunck algorithm in a pipelined manner. This modification allows to achieve data throughput of 175 MPixels/s and makes processing of Full HD video stream (1, 920 × 1, 080 @ 60 fps) possible. The structure of the optical flow module as well as pre- and post-filtering blocks and a flow reliability computation unit is described in details. Three versions of optical flow modules, with different numerical precision, working frequency and obtained results accuracy are proposed. The errors caused by switching from floating- to fixed-point computations are also evaluated. The described architecture was tested on popular sequences from an optical flow dataset of the Middlebury University. It achieves state-of-the-art results among hardware implementations of single scale methods. The designed fixed-point architecture achieves performance of 418 GOPS with power efficiency of 34 GOPS/W. The proposed floating-point module achieves 103 GFLOPS, with power efficiency of 24 GFLOPS/W. Moreover, a 100 times speedup compared to a modern CPU with SIMD support is reported. A complete, working vision system realized on Xilinx VC707 evaluation board is also presented. It is able to compute optical flow for Full HD video stream received from an HDMI camera in real-time. The obtained results prove that FPGA devices are an ideal platform for embedded vision systems. PMID:24526303
Komorkiewicz, Mateusz; Kryjak, Tomasz; Gorgon, Marek
2014-02-12
This article presents an efficient hardware implementation of the Horn-Schunck algorithm that can be used in an embedded optical flow sensor. An architecture is proposed, that realises the iterative Horn-Schunck algorithm in a pipelined manner. This modification allows to achieve data throughput of 175 MPixels/s and makes processing of Full HD video stream (1; 920 × 1; 080 @ 60 fps) possible. The structure of the optical flow module as well as pre- and post-filtering blocks and a flow reliability computation unit is described in details. Three versions of optical flow modules, with different numerical precision, working frequency and obtained results accuracy are proposed. The errors caused by switching from floating- to fixed-point computations are also evaluated. The described architecture was tested on popular sequences from an optical flow dataset of the Middlebury University. It achieves state-of-the-art results among hardware implementations of single scale methods. The designed fixed-point architecture achieves performance of 418 GOPS with power efficiency of 34 GOPS/W. The proposed floating-point module achieves 103 GFLOPS, with power efficiency of 24 GFLOPS/W. Moreover, a 100 times speedup compared to a modern CPU with SIMD support is reported. A complete, working vision system realized on Xilinx VC707 evaluation board is also presented. It is able to compute optical flow for Full HD video stream received from an HDMI camera in real-time. The obtained results prove that FPGA devices are an ideal platform for embedded vision systems.
Computational Approaches to Simulation and Optimization of Global Aircraft Trajectories
NASA Technical Reports Server (NTRS)
Ng, Hok Kwan; Sridhar, Banavar
2016-01-01
This study examines three possible approaches to improving the speed in generating wind-optimal routes for air traffic at the national or global level. They are: (a) using the resources of a supercomputer, (b) running the computations on multiple commercially available computers and (c) implementing those same algorithms into NASAs Future ATM Concepts Evaluation Tool (FACET) and compares those to a standard implementation run on a single CPU. Wind-optimal aircraft trajectories are computed using global air traffic schedules. The run time and wait time on the supercomputer for trajectory optimization using various numbers of CPUs ranging from 80 to 10,240 units are compared with the total computational time for running the same computation on a single desktop computer and on multiple commercially available computers for potential computational enhancement through parallel processing on the computer clusters. This study also re-implements the trajectory optimization algorithm for further reduction of computational time through algorithm modifications and integrates that with FACET to facilitate the use of the new features which calculate time-optimal routes between worldwide airport pairs in a wind field for use with existing FACET applications. The implementations of trajectory optimization algorithms use MATLAB, Python, and Java programming languages. The performance evaluations are done by comparing their computational efficiencies and based on the potential application of optimized trajectories. The paper shows that in the absence of special privileges on a supercomputer, a cluster of commercially available computers provides a feasible approach for national and global air traffic system studies.
NASA Astrophysics Data System (ADS)
Gaikwad, Akshay; Rehal, Diksha; Singh, Amandeep; Arvind, Dorai, Kavita
2018-02-01
We present the NMR implementation of a scheme for selective and efficient quantum process tomography without ancilla. We generalize this scheme such that it can be implemented efficiently using only a set of measurements involving product operators. The method allows us to estimate any element of the quantum process matrix to a desired precision, provided a set of quantum states can be prepared efficiently. Our modified technique requires fewer experimental resources as compared to the standard implementation of selective and efficient quantum process tomography, as it exploits the special nature of NMR measurements to allow us to compute specific elements of the process matrix by a restrictive set of subsystem measurements. To demonstrate the efficacy of our scheme, we experimentally tomograph the processes corresponding to "no operation," a controlled-NOT (CNOT), and a controlled-Hadamard gate on a two-qubit NMR quantum information processor, with high fidelities.
Variable cross-section windings for efficiency improvement of electric machines
NASA Astrophysics Data System (ADS)
Grachev, P. Yu; Bazarov, A. A.; Tabachinskiy, A. S.
2018-02-01
Implementation of energy-saving technologies in industry is impossible without efficiency improvement of electric machines. The article considers the ways of efficiency improvement and mass and dimensions reduction of electric machines with electronic control. Features of compact winding design for stators and armatures are described. Influence of compact winding on thermal and electrical process is given. Finite element method was used in computer simulation.
Compiler-assisted multiple instruction rollback recovery using a read buffer
NASA Technical Reports Server (NTRS)
Alewine, N. J.; Chen, S.-K.; Fuchs, W. K.; Hwu, W.-M.
1993-01-01
Multiple instruction rollback (MIR) is a technique that has been implemented in mainframe computers to provide rapid recovery from transient processor failures. Hardware-based MIR designs eliminate rollback data hazards by providing data redundancy implemented in hardware. Compiler-based MIR designs have also been developed which remove rollback data hazards directly with data-flow transformations. This paper focuses on compiler-assisted techniques to achieve multiple instruction rollback recovery. We observe that some data hazards resulting from instruction rollback can be resolved efficiently by providing an operand read buffer while others are resolved more efficiently with compiler transformations. A compiler-assisted multiple instruction rollback scheme is developed which combines hardware-implemented data redundancy with compiler-driven hazard removal transformations. Experimental performance evaluations indicate improved efficiency over previous hardware-based and compiler-based schemes.
ERIC Educational Resources Information Center
Gezgin, Deniz Mertkan; Adnan, Muge; Acar Guvendir, Meltem
2018-01-01
Mobile learning has started to perform an increasingly significant role in improving learning outcomes in education. Successful and efficient implementation of m-learning in higher education, as with all educational levels, depends on users' acceptance of this technology. This study focuses on investigating the attitudes of undergraduate students…
A Database Training Module for Nassau Community College Staff and Faculty.
ERIC Educational Resources Information Center
Goodman, Harriett Ziskin
A training module developed following the Instructional System Design model was implemented at Nassau Community College (NCC) to teach its administration, faculty, and staff members computer skills that would enable them to use the available computer equipment more efficiently. Using this module, each trainee designed a file to be used for the…
Implementation of Computer Based Management Information Systems: A Behavioral Perspective.
ERIC Educational Resources Information Center
Lilly, Edward R.
In the past decade significant advances have taken place in the development of management information systems (MIS) to support managerial decision making. Recent literature has shown, however, that educators have yet to make full and efficient use of these computer-based systems. A number of authors have discussed factors that may affect…
Vecharynski, Eugene; Yang, Chao; Pask, John E.
2015-02-25
Here, we present an iterative algorithm for computing an invariant subspace associated with the algebraically smallest eigenvalues of a large sparse or structured Hermitian matrix A. We are interested in the case in which the dimension of the invariant subspace is large (e.g., over several hundreds or thousands) even though it may still be small relative to the dimension of A. These problems arise from, for example, density functional theory (DFT) based electronic structure calculations for complex materials. The key feature of our algorithm is that it performs fewer Rayleigh–Ritz calculations compared to existing algorithms such as the locally optimalmore » block preconditioned conjugate gradient or the Davidson algorithm. It is a block algorithm, and hence can take advantage of efficient BLAS3 operations and be implemented with multiple levels of concurrency. We discuss a number of practical issues that must be addressed in order to implement the algorithm efficiently on a high performance computer.« less
Kan, Guangyuan; He, Xiaoyan; Ding, Liuqian; Li, Jiren; Liang, Ke; Hong, Yang
2017-10-01
The shuffled complex evolution optimization developed at the University of Arizona (SCE-UA) has been successfully applied in various kinds of scientific and engineering optimization applications, such as hydrological model parameter calibration, for many years. The algorithm possesses good global optimality, convergence stability and robustness. However, benchmark and real-world applications reveal the poor computational efficiency of the SCE-UA. This research aims at the parallelization and acceleration of the SCE-UA method based on powerful heterogeneous computing technology. The parallel SCE-UA is implemented on Intel Xeon multi-core CPU (by using OpenMP and OpenCL) and NVIDIA Tesla many-core GPU (by using OpenCL, CUDA, and OpenACC). The serial and parallel SCE-UA were tested based on the Griewank benchmark function. Comparison results indicate the parallel SCE-UA significantly improves computational efficiency compared to the original serial version. The OpenCL implementation obtains the best overall acceleration results however, with the most complex source code. The parallel SCE-UA has bright prospects to be applied in real-world applications.
Image detection and compression for memory efficient system analysis
NASA Astrophysics Data System (ADS)
Bayraktar, Mustafa
2015-02-01
The advances in digital signal processing have been progressing towards efficient use of memory and processing. Both of these factors can be utilized efficiently by using feasible techniques of image storage by computing the minimum information of image which will enhance computation in later processes. Scale Invariant Feature Transform (SIFT) can be utilized to estimate and retrieve of an image. In computer vision, SIFT can be implemented to recognize the image by comparing its key features from SIFT saved key point descriptors. The main advantage of SIFT is that it doesn't only remove the redundant information from an image but also reduces the key points by matching their orientation and adding them together in different windows of image [1]. Another key property of this approach is that it works on highly contrasted images more efficiently because it`s design is based on collecting key points from the contrast shades of image.
Efficient dynamic simulation for multiple chain robotic mechanisms
NASA Technical Reports Server (NTRS)
Lilly, Kathryn W.; Orin, David E.
1989-01-01
An efficient O(mN) algorithm for dynamic simulation of simple closed-chain robotic mechanisms is presented, where m is the number of chains, and N is the number of degrees of freedom for each chain. It is based on computation of the operational space inertia matrix (6 x 6) for each chain as seen by the body, load, or object. Also, computation of the chain dynamics, when opened at one end, is required, and the most efficient algorithm is used for this purpose. Parallel implementation of the dynamics for each chain results in an O(N) + O(log sub 2 m+1) algorithm.
Efficient parallel implementation of active appearance model fitting algorithm on GPU.
Wang, Jinwei; Ma, Xirong; Zhu, Yuanping; Sun, Jizhou
2014-01-01
The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia's GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures.
Efficient Parallel Implementation of Active Appearance Model Fitting Algorithm on GPU
Wang, Jinwei; Ma, Xirong; Zhu, Yuanping; Sun, Jizhou
2014-01-01
The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia's GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures. PMID:24723812
Evaluation of Proteus as a Tool for the Rapid Development of Models of Hydrologic Systems
NASA Astrophysics Data System (ADS)
Weigand, T. M.; Farthing, M. W.; Kees, C. E.; Miller, C. T.
2013-12-01
Models of modern hydrologic systems can be complex and involve a variety of operators with varying character. The goal is to implement approximations of such models that are both efficient for the developer and computationally efficient, which is a set of naturally competing objectives. Proteus is a Python-based toolbox that supports prototyping of model formulations as well as a wide variety of modern numerical methods and parallel computing. We used Proteus to develop numerical approximations for three models: Richards' equation, a brine flow model derived using the Thermodynamically Constrained Averaging Theory (TCAT), and a multiphase TCAT-based tumor growth model. For Richards' equation, we investigated discontinuous Galerkin solutions with higher order time integration based on the backward difference formulas. The TCAT brine flow model was implemented using Proteus and a variety of numerical methods were compared to hand coded solutions. Finally, an existing tumor growth model was implemented in Proteus to introduce more advanced numerics and allow the code to be run in parallel. From these three example models, Proteus was found to be an attractive open-source option for rapidly developing high quality code for solving existing and evolving computational science models.
Dynamic modeling of parallel robots for computed-torque control implementation
DOE Office of Scientific and Technical Information (OSTI.GOV)
Codourey, A.
1998-12-01
In recent years, increased interest in parallel robots has been observed. Their control with modern theory, such as the computed-torque method, has, however, been restrained, essentially due to the difficulty in establishing a simple dynamic model that can be calculated in real time. In this paper, a simple method based on the virtual work principle is proposed for modeling parallel robots. The mass matrix of the robot, needed for decoupling control strategies, does not explicitly appear in the formulation; however, it can be computed separately, based on kinetic energy considerations. The method is applied to the DELTA parallel robot, leadingmore » to a very efficient model that has been implemented in a real-time computed-torque control algorithm.« less
BCM: toolkit for Bayesian analysis of Computational Models using samplers.
Thijssen, Bram; Dijkstra, Tjeerd M H; Heskes, Tom; Wessels, Lodewyk F A
2016-10-21
Computational models in biology are characterized by a large degree of uncertainty. This uncertainty can be analyzed with Bayesian statistics, however, the sampling algorithms that are frequently used for calculating Bayesian statistical estimates are computationally demanding, and each algorithm has unique advantages and disadvantages. It is typically unclear, before starting an analysis, which algorithm will perform well on a given computational model. We present BCM, a toolkit for the Bayesian analysis of Computational Models using samplers. It provides efficient, multithreaded implementations of eleven algorithms for sampling from posterior probability distributions and for calculating marginal likelihoods. BCM includes tools to simplify the process of model specification and scripts for visualizing the results. The flexible architecture allows it to be used on diverse types of biological computational models. In an example inference task using a model of the cell cycle based on ordinary differential equations, BCM is significantly more efficient than existing software packages, allowing more challenging inference problems to be solved. BCM represents an efficient one-stop-shop for computational modelers wishing to use sampler-based Bayesian statistics.
NASA Astrophysics Data System (ADS)
Nishiura, Daisuke; Furuichi, Mikito; Sakaguchi, Hide
2015-09-01
The computational performance of a smoothed particle hydrodynamics (SPH) simulation is investigated for three types of current shared-memory parallel computer devices: many integrated core (MIC) processors, graphics processing units (GPUs), and multi-core CPUs. We are especially interested in efficient shared-memory allocation methods for each chipset, because the efficient data access patterns differ between compute unified device architecture (CUDA) programming for GPUs and OpenMP programming for MIC processors and multi-core CPUs. We first introduce several parallel implementation techniques for the SPH code, and then examine these on our target computer architectures to determine the most effective algorithms for each processor unit. In addition, we evaluate the effective computing performance and power efficiency of the SPH simulation on each architecture, as these are critical metrics for overall performance in a multi-device environment. In our benchmark test, the GPU is found to produce the best arithmetic performance as a standalone device unit, and gives the most efficient power consumption. The multi-core CPU obtains the most effective computing performance. The computational speed of the MIC processor on Xeon Phi approached that of two Xeon CPUs. This indicates that using MICs is an attractive choice for existing SPH codes on multi-core CPUs parallelized by OpenMP, as it gains computational acceleration without the need for significant changes to the source code.
Multi-GPU implementation of a VMAT treatment plan optimization algorithm.
Tian, Zhen; Peng, Fei; Folkerts, Michael; Tan, Jun; Jia, Xun; Jiang, Steve B
2015-06-01
Volumetric modulated arc therapy (VMAT) optimization is a computationally challenging problem due to its large data size, high degrees of freedom, and many hardware constraints. High-performance graphics processing units (GPUs) have been used to speed up the computations. However, GPU's relatively small memory size cannot handle cases with a large dose-deposition coefficient (DDC) matrix in cases of, e.g., those with a large target size, multiple targets, multiple arcs, and/or small beamlet size. The main purpose of this paper is to report an implementation of a column-generation-based VMAT algorithm, previously developed in the authors' group, on a multi-GPU platform to solve the memory limitation problem. While the column-generation-based VMAT algorithm has been previously developed, the GPU implementation details have not been reported. Hence, another purpose is to present detailed techniques employed for GPU implementation. The authors also would like to utilize this particular problem as an example problem to study the feasibility of using a multi-GPU platform to solve large-scale problems in medical physics. The column-generation approach generates VMAT apertures sequentially by solving a pricing problem (PP) and a master problem (MP) iteratively. In the authors' method, the sparse DDC matrix is first stored on a CPU in coordinate list format (COO). On the GPU side, this matrix is split into four submatrices according to beam angles, which are stored on four GPUs in compressed sparse row format. Computation of beamlet price, the first step in PP, is accomplished using multi-GPUs. A fast inter-GPU data transfer scheme is accomplished using peer-to-peer access. The remaining steps of PP and MP problems are implemented on CPU or a single GPU due to their modest problem scale and computational loads. Barzilai and Borwein algorithm with a subspace step scheme is adopted here to solve the MP problem. A head and neck (H&N) cancer case is then used to validate the authors' method. The authors also compare their multi-GPU implementation with three different single GPU implementation strategies, i.e., truncating DDC matrix (S1), repeatedly transferring DDC matrix between CPU and GPU (S2), and porting computations involving DDC matrix to CPU (S3), in terms of both plan quality and computational efficiency. Two more H&N patient cases and three prostate cases are used to demonstrate the advantages of the authors' method. The authors' multi-GPU implementation can finish the optimization process within ∼ 1 min for the H&N patient case. S1 leads to an inferior plan quality although its total time was 10 s shorter than the multi-GPU implementation due to the reduced matrix size. S2 and S3 yield the same plan quality as the multi-GPU implementation but take ∼4 and ∼6 min, respectively. High computational efficiency was consistently achieved for the other five patient cases tested, with VMAT plans of clinically acceptable quality obtained within 23-46 s. Conversely, to obtain clinically comparable or acceptable plans for all six of these VMAT cases that the authors have tested in this paper, the optimization time needed in a commercial TPS system on CPU was found to be in an order of several minutes. The results demonstrate that the multi-GPU implementation of the authors' column-generation-based VMAT optimization can handle the large-scale VMAT optimization problem efficiently without sacrificing plan quality. The authors' study may serve as an example to shed some light on other large-scale medical physics problems that require multi-GPU techniques.
An efficient solver for large structured eigenvalue problems in relativistic quantum chemistry
NASA Astrophysics Data System (ADS)
Shiozaki, Toru
2017-01-01
We report an efficient program for computing the eigenvalues and symmetry-adapted eigenvectors of very large quaternionic (or Hermitian skew-Hamiltonian) matrices, using which structure-preserving diagonalisation of matrices of dimension N > 10, 000 is now routine on a single computer node. Such matrices appear frequently in relativistic quantum chemistry owing to the time-reversal symmetry. The implementation is based on a blocked version of the Paige-Van Loan algorithm, which allows us to use the Level 3 BLAS subroutines for most of the computations. Taking advantage of the symmetry, the program is faster by up to a factor of 2 than state-of-the-art implementations of complex Hermitian diagonalisation; diagonalising a 12, 800 × 12, 800 matrix took 42.8 (9.5) and 85.6 (12.6) minutes with 1 CPU core (16 CPU cores) using our symmetry-adapted solver and Intel Math Kernel Library's ZHEEV that is not structure-preserving, respectively. The source code is publicly available under the FreeBSD licence.
Łazarski, Roman; Burow, Asbjörn Manfred; Grajciar, Lukáš; Sierka, Marek
2016-10-30
A full implementation of analytical energy gradients for molecular and periodic systems is reported in the TURBOMOLE program package within the framework of Kohn-Sham density functional theory using Gaussian-type orbitals as basis functions. Its key component is a combination of density fitting (DF) approximation and continuous fast multipole method (CFMM) that allows for an efficient calculation of the Coulomb energy gradient. For exchange-correlation part the hierarchical numerical integration scheme (Burow and Sierka, Journal of Chemical Theory and Computation 2011, 7, 3097) is extended to energy gradients. Computational efficiency and asymptotic O(N) scaling behavior of the implementation is demonstrated for various molecular and periodic model systems, with the largest unit cell of hematite containing 640 atoms and 19,072 basis functions. The overall computational effort of energy gradient is comparable to that of the Kohn-Sham matrix formation. © 2016 Wiley Periodicals, Inc. © 2016 Wiley Periodicals, Inc.
Discrete square root filtering - A survey of current techniques.
NASA Technical Reports Server (NTRS)
Kaminskii, P. G.; Bryson, A. E., Jr.; Schmidt, S. F.
1971-01-01
Current techniques in square root filtering are surveyed and related by applying a duality association. Four efficient square root implementations are suggested, and compared with three common conventional implementations in terms of computational complexity and precision. It is shown that the square root computational burden should not exceed the conventional by more than 50% in most practical problems. An examination of numerical conditioning predicts that the square root approach can yield twice the effective precision of the conventional filter in ill-conditioned problems. This prediction is verified in two examples.
NASA Astrophysics Data System (ADS)
Balac, Stéphane; Fernandez, Arnaud
2016-02-01
The computer program SPIP is aimed at solving the Generalized Non-Linear Schrödinger equation (GNLSE), involved in optics e.g. in the modelling of light-wave propagation in an optical fibre, by the Interaction Picture method, a new efficient alternative method to the Symmetric Split-Step method. In the SPIP program a dedicated costless adaptive step-size control based on the use of a 4th order embedded Runge-Kutta method is implemented in order to speed up the resolution.
An efficient and accurate 3D displacements tracking strategy for digital volume correlation
NASA Astrophysics Data System (ADS)
Pan, Bing; Wang, Bo; Wu, Dafang; Lubineau, Gilles
2014-07-01
Owing to its inherent computational complexity, practical implementation of digital volume correlation (DVC) for internal displacement and strain mapping faces important challenges in improving its computational efficiency. In this work, an efficient and accurate 3D displacement tracking strategy is proposed for fast DVC calculation. The efficiency advantage is achieved by using three improvements. First, to eliminate the need of updating Hessian matrix in each iteration, an efficient 3D inverse compositional Gauss-Newton (3D IC-GN) algorithm is introduced to replace existing forward additive algorithms for accurate sub-voxel displacement registration. Second, to ensure the 3D IC-GN algorithm that converges accurately and rapidly and avoid time-consuming integer-voxel displacement searching, a generalized reliability-guided displacement tracking strategy is designed to transfer accurate and complete initial guess of deformation for each calculation point from its computed neighbors. Third, to avoid the repeated computation of sub-voxel intensity interpolation coefficients, an interpolation coefficient lookup table is established for tricubic interpolation. The computational complexity of the proposed fast DVC and the existing typical DVC algorithms are first analyzed quantitatively according to necessary arithmetic operations. Then, numerical tests are performed to verify the performance of the fast DVC algorithm in terms of measurement accuracy and computational efficiency. The experimental results indicate that, compared with the existing DVC algorithm, the presented fast DVC algorithm produces similar precision and slightly higher accuracy at a substantially reduced computational cost.
Provable classically intractable sampling with measurement-based computation in constant time
NASA Astrophysics Data System (ADS)
Sanders, Stephen; Miller, Jacob; Miyake, Akimasa
We present a constant-time measurement-based quantum computation (MQC) protocol to perform a classically intractable sampling problem. We sample from the output probability distribution of a subclass of the instantaneous quantum polynomial time circuits introduced by Bremner, Montanaro and Shepherd. In contrast with the usual circuit model, our MQC implementation includes additional randomness due to byproduct operators associated with the computation. Despite this additional randomness we show that our sampling task cannot be efficiently simulated by a classical computer. We extend previous results to verify the quantum supremacy of our sampling protocol efficiently using only single-qubit Pauli measurements. Center for Quantum Information and Control, Department of Physics and Astronomy, University of New Mexico, Albuquerque, NM 87131, USA.
Parallel Computer System for 3D Visualization Stereo on GPU
NASA Astrophysics Data System (ADS)
Al-Oraiqat, Anas M.; Zori, Sergii A.
2018-03-01
This paper proposes the organization of a parallel computer system based on Graphic Processors Unit (GPU) for 3D stereo image synthesis. The development is based on the modified ray tracing method developed by the authors for fast search of tracing rays intersections with scene objects. The system allows significant increase in the productivity for the 3D stereo synthesis of photorealistic quality. The generalized procedure of 3D stereo image synthesis on the Graphics Processing Unit/Graphics Processing Clusters (GPU/GPC) is proposed. The efficiency of the proposed solutions by GPU implementation is compared with single-threaded and multithreaded implementations on the CPU. The achieved average acceleration in multi-thread implementation on the test GPU and CPU is about 7.5 and 1.6 times, respectively. Studying the influence of choosing the size and configuration of the computational Compute Unified Device Archi-tecture (CUDA) network on the computational speed shows the importance of their correct selection. The obtained experimental estimations can be significantly improved by new GPUs with a large number of processing cores and multiprocessors, as well as optimized configuration of the computing CUDA network.
Suplatov, Dmitry; Popova, Nina; Zhumatiy, Sergey; Voevodin, Vladimir; Švedas, Vytas
2016-04-01
Rapid expansion of online resources providing access to genomic, structural, and functional information associated with biological macromolecules opens an opportunity to gain a deeper understanding of the mechanisms of biological processes due to systematic analysis of large datasets. This, however, requires novel strategies to optimally utilize computer processing power. Some methods in bioinformatics and molecular modeling require extensive computational resources. Other algorithms have fast implementations which take at most several hours to analyze a common input on a modern desktop station, however, due to multiple invocations for a large number of subtasks the full task requires a significant computing power. Therefore, an efficient computational solution to large-scale biological problems requires both a wise parallel implementation of resource-hungry methods as well as a smart workflow to manage multiple invocations of relatively fast algorithms. In this work, a new computer software mpiWrapper has been developed to accommodate non-parallel implementations of scientific algorithms within the parallel supercomputing environment. The Message Passing Interface has been implemented to exchange information between nodes. Two specialized threads - one for task management and communication, and another for subtask execution - are invoked on each processing unit to avoid deadlock while using blocking calls to MPI. The mpiWrapper can be used to launch all conventional Linux applications without the need to modify their original source codes and supports resubmission of subtasks on node failure. We show that this approach can be used to process huge amounts of biological data efficiently by running non-parallel programs in parallel mode on a supercomputer. The C++ source code and documentation are available from http://biokinet.belozersky.msu.ru/mpiWrapper .
A cellular automata based FPGA realization of a new metaheuristic bat-inspired algorithm
NASA Astrophysics Data System (ADS)
Progias, Pavlos; Amanatiadis, Angelos A.; Spataro, William; Trunfio, Giuseppe A.; Sirakoulis, Georgios Ch.
2016-10-01
Optimization algorithms are often inspired by processes occuring in nature, such as animal behavioral patterns. The main concern with implementing such algorithms in software is the large amounts of processing power they require. In contrast to software code, that can only perform calculations in a serial manner, an implementation in hardware, exploiting the inherent parallelism of single-purpose processors, can prove to be much more efficient both in speed and energy consumption. Furthermore, the use of Cellular Automata (CA) in such an implementation would be efficient both as a model for natural processes, as well as a computational paradigm implemented well on hardware. In this paper, we propose a VHDL implementation of a metaheuristic algorithm inspired by the echolocation behavior of bats. More specifically, the CA model is inspired by the metaheuristic algorithm proposed earlier in the literature, which could be considered at least as efficient than other existing optimization algorithms. The function of the FPGA implementation of our algorithm is explained in full detail and results of our simulations are also demonstrated.
Manyscale Computing for Sensor Processing in Support of Space Situational Awareness
NASA Astrophysics Data System (ADS)
Schmalz, M.; Chapman, W.; Hayden, E.; Sahni, S.; Ranka, S.
2014-09-01
Increasing image and signal data burden associated with sensor data processing in support of space situational awareness implies continuing computational throughput growth beyond the petascale regime. In addition to growing applications data burden and diversity, the breadth, diversity and scalability of high performance computing architectures and their various organizations challenge the development of a single, unifying, practicable model of parallel computation. Therefore, models for scalable parallel processing have exploited architectural and structural idiosyncrasies, yielding potential misapplications when legacy programs are ported among such architectures. In response to this challenge, we have developed a concise, efficient computational paradigm and software called Manyscale Computing to facilitate efficient mapping of annotated application codes to heterogeneous parallel architectures. Our theory, algorithms, software, and experimental results support partitioning and scheduling of application codes for envisioned parallel architectures, in terms of work atoms that are mapped (for example) to threads or thread blocks on computational hardware. Because of the rigor, completeness, conciseness, and layered design of our manyscale approach, application-to-architecture mapping is feasible and scalable for architectures at petascales, exascales, and above. Further, our methodology is simple, relying primarily on a small set of primitive mapping operations and support routines that are readily implemented on modern parallel processors such as graphics processing units (GPUs) and hybrid multi-processors (HMPs). In this paper, we overview the opportunities and challenges of manyscale computing for image and signal processing in support of space situational awareness applications. We discuss applications in terms of a layered hardware architecture (laboratory > supercomputer > rack > processor > component hierarchy). Demonstration applications include performance analysis and results in terms of execution time as well as storage, power, and energy consumption for bus-connected and/or networked architectures. The feasibility of the manyscale paradigm is demonstrated by addressing four principal challenges: (1) architectural/structural diversity, parallelism, and locality, (2) masking of I/O and memory latencies, (3) scalability of design as well as implementation, and (4) efficient representation/expression of parallel applications. Examples will demonstrate how manyscale computing helps solve these challenges efficiently on real-world computing systems.
NASA Astrophysics Data System (ADS)
Wu, Yuanfeng; Gao, Lianru; Zhang, Bing; Zhao, Haina; Li, Jun
2014-01-01
We present a parallel implementation of the optimized maximum noise fraction (G-OMNF) transform algorithm for feature extraction of hyperspectral images on commodity graphics processing units (GPUs). The proposed approach explored the algorithm data-level concurrency and optimized the computing flow. We first defined a three-dimensional grid, in which each thread calculates a sub-block data to easily facilitate the spatial and spectral neighborhood data searches in noise estimation, which is one of the most important steps involved in OMNF. Then, we optimized the processing flow and computed the noise covariance matrix before computing the image covariance matrix to reduce the original hyperspectral image data transmission. These optimization strategies can greatly improve the computing efficiency and can be applied to other feature extraction algorithms. The proposed parallel feature extraction algorithm was implemented on an Nvidia Tesla GPU using the compute unified device architecture and basic linear algebra subroutines library. Through the experiments on several real hyperspectral images, our GPU parallel implementation provides a significant speedup of the algorithm compared with the CPU implementation, especially for highly data parallelizable and arithmetically intensive algorithm parts, such as noise estimation. In order to further evaluate the effectiveness of G-OMNF, we used two different applications: spectral unmixing and classification for evaluation. Considering the sensor scanning rate and the data acquisition time, the proposed parallel implementation met the on-board real-time feature extraction.
Sundareshan, Malur K; Bhattacharjee, Supratik; Inampudi, Radhika; Pang, Ho-Yuen
2002-12-10
Computational complexity is a major impediment to the real-time implementation of image restoration and superresolution algorithms in many applications. Although powerful restoration algorithms have been developed within the past few years utilizing sophisticated mathematical machinery (based on statistical optimization and convex set theory), these algorithms are typically iterative in nature and require a sufficient number of iterations to be executed to achieve the desired resolution improvement that may be needed to meaningfully perform postprocessing image exploitation tasks in practice. Additionally, recent technological breakthroughs have facilitated novel sensor designs (focal plane arrays, for instance) that make it possible to capture megapixel imagery data at video frame rates. A major challenge in the processing of these large-format images is to complete the execution of the image processing steps within the frame capture times and to keep up with the output rate of the sensor so that all data captured by the sensor can be efficiently utilized. Consequently, development of novel methods that facilitate real-time implementation of image restoration and superresolution algorithms is of significant practical interest and is the primary focus of this study. The key to designing computationally efficient processing schemes lies in strategically introducing appropriate preprocessing steps together with the superresolution iterations to tailor optimized overall processing sequences for imagery data of specific formats. For substantiating this assertion, three distinct methods for tailoring a preprocessing filter and integrating it with the superresolution processing steps are outlined. These methods consist of a region-of-interest extraction scheme, a background-detail separation procedure, and a scene-derived information extraction step for implementing a set-theoretic restoration of the image that is less demanding in computation compared with the superresolution iterations. A quantitative evaluation of the performance of these algorithms for restoring and superresolving various imagery data captured by diffraction-limited sensing operations are also presented.
Computation of Reacting Flows in Combustion Processes
NASA Technical Reports Server (NTRS)
Keith, Theo G., Jr.; Chen, K.-H.
2001-01-01
The objective of this research is to develop an efficient numerical algorithm with unstructured grids for the computation of three-dimensional chemical reacting flows that are known to occur in combustion components of propulsion systems. During the grant period (1996 to 1999), two companion codes have been developed and various numerical and physical models were implemented into the two codes.
NASA Technical Reports Server (NTRS)
Mavriplis, D. J.; Das, Raja; Saltz, Joel; Vermeland, R. E.
1992-01-01
An efficient three dimensional unstructured Euler solver is parallelized on a Cray Y-MP C90 shared memory computer and on an Intel Touchstone Delta distributed memory computer. This paper relates the experiences gained and describes the software tools and hardware used in this study. Performance comparisons between two differing architectures are made.
Automatic finite element generators
NASA Technical Reports Server (NTRS)
Wang, P. S.
1984-01-01
The design and implementation of a software system for generating finite elements and related computations are described. Exact symbolic computational techniques are employed to derive strain-displacement matrices and element stiffness matrices. Methods for dealing with the excessive growth of symbolic expressions are discussed. Automatic FORTRAN code generation is described with emphasis on improving the efficiency of the resultant code.
Martinez-Espronceda, Miguel; Martinez, Ignacio; Serrano, Luis; Led, Santiago; Trigo, Jesús Daniel; Marzo, Asier; Escayola, Javier; Garcia, José
2011-05-01
Traditionally, e-Health solutions were located at the point of care (PoC), while the new ubiquitous user-centered paradigm draws on standard-based personal health devices (PHDs). Such devices place strict constraints on computation and battery efficiency that encouraged the International Organization for Standardization/IEEE11073 (X73) standard for medical devices to evolve from X73PoC to X73PHD. In this context, low-voltage low-power (LV-LP) technologies meet the restrictions of X73PHD-compliant devices. Since X73PHD does not approach the software architecture, the accomplishment of an efficient design falls directly on the software developer. Therefore, computational and battery performance of such LV-LP-constrained devices can even be outperformed through an efficient X73PHD implementation design. In this context, this paper proposes a new methodology to implement X73PHD into microcontroller-based platforms with LV-LP constraints. Such implementation methodology has been developed through a patterns-based approach and applied to a number of X73PHD-compliant agents (including weighing scale, blood pressure monitor, and thermometer specializations) and microprocessor architectures (8, 16, and 32 bits) as a proof of concept. As a reference, the results obtained in the weighing scale guarantee all features of X73PHD running over a microcontroller architecture based on ARM7TDMI requiring only 168 B of RAM and 2546 B of flash memory.
Sun, Yuwen; Cheng, Allen C
2012-07-01
Artificial neural networks (ANNs) are a promising machine learning technique in classifying non-linear electrocardiogram (ECG) signals and recognizing abnormal patterns suggesting risks of cardiovascular diseases (CVDs). In this paper, we propose a new reusable neuron architecture (RNA) enabling a performance-efficient and cost-effective silicon implementation for ANN. The RNA architecture consists of a single layer of physical RNA neurons, each of which is designed to use minimal hardware resource (e.g., a single 2-input multiplier-accumulator is used to compute the dot product of two vectors). By carefully applying the principal of time sharing, RNA can multiplexs this single layer of physical neurons to efficiently execute both feed-forward and back-propagation computations of an ANN while conserving the area and reducing the power dissipation of the silicon. A three-layer 51-30-12 ANN is implemented in RNA to perform the ECG classification for CVD detection. This RNA hardware also allows on-chip automatic training update. A quantitative design space exploration in area, power dissipation, and execution speed between RNA and three other implementations representative of different reusable hardware strategies is presented and discussed. Compared with an equivalent software implementation in C executed on an embedded microprocessor, the RNA ASIC achieves three orders of magnitude improvements in both the execution speed and the energy efficiency. Copyright © 2012 Elsevier Ltd. All rights reserved.
Fault tolerant features and experiments of ANTS distributed real-time system
NASA Astrophysics Data System (ADS)
Dominic-Savio, Patrick; Lo, Jien-Chung; Tufts, Donald W.
1995-01-01
The ANTS project at the University of Rhode Island introduces the concept of Active Nodal Task Seeking (ANTS) as a way to efficiently design and implement dependable, high-performance, distributed computing. This paper presents the fault tolerant design features that have been incorporated in the ANTS experimental system implementation. The results of performance evaluations and fault injection experiments are reported. The fault-tolerant version of ANTS categorizes all computing nodes into three groups. They are: the up-and-running green group, the self-diagnosing yellow group and the failed red group. Each available computing node will be placed in the yellow group periodically for a routine diagnosis. In addition, for long-life missions, ANTS uses a monitoring scheme to identify faulty computing nodes. In this monitoring scheme, the communication pattern of each computing node is monitored by two other nodes.
Gate sequence for continuous variable one-way quantum computation
Su, Xiaolong; Hao, Shuhong; Deng, Xiaowei; Ma, Lingyu; Wang, Meihong; Jia, Xiaojun; Xie, Changde; Peng, Kunchi
2013-01-01
Measurement-based one-way quantum computation using cluster states as resources provides an efficient model to perform computation and information processing of quantum codes. Arbitrary Gaussian quantum computation can be implemented sufficiently by long single-mode and two-mode gate sequences. However, continuous variable gate sequences have not been realized so far due to an absence of cluster states larger than four submodes. Here we present the first continuous variable gate sequence consisting of a single-mode squeezing gate and a two-mode controlled-phase gate based on a six-mode cluster state. The quantum property of this gate sequence is confirmed by the fidelities and the quantum entanglement of two output modes, which depend on both the squeezing and controlled-phase gates. The experiment demonstrates the feasibility of implementing Gaussian quantum computation by means of accessible gate sequences.
Aprà, E; Kowalski, K
2016-03-08
In this paper we discuss the implementation of multireference coupled-cluster formalism with singles, doubles, and noniterative triples (MRCCSD(T)), which is capable of taking advantage of the processing power of the Intel Xeon Phi coprocessor. We discuss the integration of two levels of parallelism underlying the MRCCSD(T) implementation with computational kernels designed to offload the computationally intensive parts of the MRCCSD(T) formalism to Intel Xeon Phi coprocessors. Special attention is given to the enhancement of the parallel performance by task reordering that has improved load balancing in the noniterative part of the MRCCSD(T) calculations. We also discuss aspects regarding efficient optimization and vectorization strategies.
A C++11 implementation of arbitrary-rank tensors for high-performance computing
NASA Astrophysics Data System (ADS)
Aragón, Alejandro M.
2014-06-01
This article discusses an efficient implementation of tensors of arbitrary rank by using some of the idioms introduced by the recently published C++ ISO Standard (C++11). With the aims at providing a basic building block for high-performance computing, a single Array class template is carefully crafted, from which vectors, matrices, and even higher-order tensors can be created. An expression template facility is also built around the array class template to provide convenient mathematical syntax. As a result, by using templates, an extra high-level layer is added to the C++ language when dealing with algebraic objects and their operations, without compromising performance. The implementation is tested running on both CPU and GPU.
A C++11 implementation of arbitrary-rank tensors for high-performance computing
NASA Astrophysics Data System (ADS)
Aragón, Alejandro M.
2014-11-01
This article discusses an efficient implementation of tensors of arbitrary rank by using some of the idioms introduced by the recently published C++ ISO Standard (C++11). With the aims at providing a basic building block for high-performance computing, a single Array class template is carefully crafted, from which vectors, matrices, and even higher-order tensors can be created. An expression template facility is also built around the array class template to provide convenient mathematical syntax. As a result, by using templates, an extra high-level layer is added to the C++ language when dealing with algebraic objects and their operations, without compromising performance. The implementation is tested running on both CPU and GPU.
Tempest - Efficient Computation of Atmospheric Flows Using High-Order Local Discretization Methods
NASA Astrophysics Data System (ADS)
Ullrich, P. A.; Guerra, J. E.
2014-12-01
The Tempest Framework composes several compact numerical methods to easily facilitate intercomparison of atmospheric flow calculations on the sphere and in rectangular domains. This framework includes the implementations of Spectral Elements, Discontinuous Galerkin, Flux Reconstruction, and Hybrid Finite Element methods with the goal of achieving optimal accuracy in the solution of atmospheric problems. Several advantages of this approach are discussed such as: improved pressure gradient calculation, numerical stability by vertical/horizontal splitting, arbitrary order of accuracy, etc. The local numerical discretization allows for high performance parallel computation and efficient inclusion of parameterizations. These techniques are used in conjunction with a non-conformal, locally refined, cubed-sphere grid for global simulations and standard Cartesian grids for simulations at the mesoscale. A complete implementation of the methods described is demonstrated in a non-hydrostatic setting.
Sensor network based vehicle classification and license plate identification system
DOE Office of Scientific and Technical Information (OSTI.GOV)
Frigo, Janette Rose; Brennan, Sean M; Rosten, Edward J
Typically, for energy efficiency and scalability purposes, sensor networks have been used in the context of environmental and traffic monitoring applications in which operations at the sensor level are not computationally intensive. But increasingly, sensor network applications require data and compute intensive sensors such video cameras and microphones. In this paper, we describe the design and implementation of two such systems: a vehicle classifier based on acoustic signals and a license plate identification system using a camera. The systems are implemented in an energy-efficient manner to the extent possible using commercially available hardware, the Mica motes and the Stargate platform.more » Our experience in designing these systems leads us to consider an alternate more flexible, modular, low-power mote architecture that uses a combination of FPGAs, specialized embedded processing units and sensor data acquisition systems.« less
Performance advantages of CPML over UPML absorbing boundary conditions in FDTD algorithm
NASA Astrophysics Data System (ADS)
Gvozdic, Branko D.; Djurdjevic, Dusan Z.
2017-01-01
Implementation of absorbing boundary condition (ABC) has a very important role in simulation performance and accuracy in finite difference time domain (FDTD) method. The perfectly matched layer (PML) is the most efficient type of ABC. The aim of this paper is to give detailed insight in and discussion of boundary conditions and hence to simplify the choice of PML used for termination of computational domain in FDTD method. In particular, we demonstrate that using the convolutional PML (CPML) has significant advantages in terms of implementation in FDTD method and reducing computer resources than using uniaxial PML (UPML). An extensive number of numerical experiments has been performed and results have shown that CPML is more efficient in electromagnetic waves absorption. Numerical code is prepared, several problems are analyzed and relative error is calculated and presented.
NASA Astrophysics Data System (ADS)
Kelly, Jamie S.; Bowman, Hiroshi C.; Rao, Vittal S.; Pottinger, Hardy J.
1997-06-01
Implementation issues represent an unfamiliar challenge to most control engineers, and many techniques for controller design ignore these issues outright. Consequently, the design of controllers for smart structural systems usually proceeds without regard for their eventual implementation, thus resulting either in serious performance degradation or in hardware requirements that squander power, complicate integration, and drive up cost. The level of integration assumed by the Smart Patch further exacerbates these difficulties, and any design inefficiency may render the realization of a single-package sensor-controller-actuator system infeasible. The goal of this research is to automate the controller implementation process and to relieve the design engineer of implementation concerns like quantization, computational efficiency, and device selection. We specifically target Field Programmable Gate Arrays (FPGAs) as our hardware platform because these devices are highly flexible, power efficient, and reprogrammable. The current study develops an automated implementation sequence that minimizes hardware requirements while maintaining controller performance. Beginning with a state space representation of the controller, the sequence automatically generates a configuration bitstream for a suitable FPGA implementation. MATLAB functions optimize and simulate the control algorithm before translating it into the VHSIC hardware description language. These functions improve power efficiency and simplify integration in the final implementation by performing a linear transformation that renders the controller computationally friendly. The transformation favors sparse matrices in order to reduce multiply operations and the hardware necessary to support them; simultaneously, the remaining matrix elements take on values that minimize limit cycles and parameter sensitivity. The proposed controller design methodology is implemented on a simple cantilever beam test structure using FPGA hardware. The experimental closed loop response is compared with that of an automated FPGA controller implementation. Finally, we explore the integration of FPGA based controllers into a multi-chip module, which we believe represents the next step towards the realization of the Smart Patch.
Convolution of large 3D images on GPU and its decomposition
NASA Astrophysics Data System (ADS)
Karas, Pavel; Svoboda, David
2011-12-01
In this article, we propose a method for computing convolution of large 3D images. The convolution is performed in a frequency domain using a convolution theorem. The algorithm is accelerated on a graphic card by means of the CUDA parallel computing model. Convolution is decomposed in a frequency domain using the decimation in frequency algorithm. We pay attention to keeping our approach efficient in terms of both time and memory consumption and also in terms of memory transfers between CPU and GPU which have a significant inuence on overall computational time. We also study the implementation on multiple GPUs and compare the results between the multi-GPU and multi-CPU implementations.
NASA Technical Reports Server (NTRS)
Smith, Terence R.; Menon, Sudhakar; Star, Jeffrey L.; Estes, John E.
1987-01-01
This paper provides a brief survey of the history, structure and functions of 'traditional' geographic information systems (GIS), and then suggests a set of requirements that large-scale GIS should satisfy, together with a set of principles for their satisfaction. These principles, which include the systematic application of techniques from several subfields of computer science to the design and implementation of GIS and the integration of techniques from computer vision and image processing into standard GIS technology, are discussed in some detail. In particular, the paper provides a detailed discussion of questions relating to appropriate data models, data structures and computational procedures for the efficient storage, retrieval and analysis of spatially-indexed data.
Masuda, Y; Misztal, I; Legarra, A; Tsuruta, S; Lourenco, D A L; Fragomeni, B O; Aguilar, I
2017-01-01
This paper evaluates an efficient implementation to multiply the inverse of a numerator relationship matrix for genotyped animals () by a vector (). The computation is required for solving mixed model equations in single-step genomic BLUP (ssGBLUP) with the preconditioned conjugate gradient (PCG). The inverse can be decomposed into sparse matrices that are blocks of the sparse inverse of a numerator relationship matrix () including genotyped animals and their ancestors. The elements of were rapidly calculated with the Henderson's rule and stored as sparse matrices in memory. Implementation of was by a series of sparse matrix-vector multiplications. Diagonal elements of , which were required as preconditioners in PCG, were approximated with a Monte Carlo method using 1,000 samples. The efficient implementation of was compared with explicit inversion of with 3 data sets including about 15,000, 81,000, and 570,000 genotyped animals selected from populations with 213,000, 8.2 million, and 10.7 million pedigree animals, respectively. The explicit inversion required 1.8 GB, 49 GB, and 2,415 GB (estimated) of memory, respectively, and 42 s, 56 min, and 13.5 d (estimated), respectively, for the computations. The efficient implementation required <1 MB, 2.9 GB, and 2.3 GB of memory, respectively, and <1 sec, 3 min, and 5 min, respectively, for setting up. Only <1 sec was required for the multiplication in each PCG iteration for any data sets. When the equations in ssGBLUP are solved with the PCG algorithm, is no longer a limiting factor in the computations.
NASA Astrophysics Data System (ADS)
Schoups, G.; Vrugt, J. A.; Fenicia, F.; van de Giesen, N. C.
2010-10-01
Conceptual rainfall-runoff models have traditionally been applied without paying much attention to numerical errors induced by temporal integration of water balance dynamics. Reliance on first-order, explicit, fixed-step integration methods leads to computationally cheap simulation models that are easy to implement. Computational speed is especially desirable for estimating parameter and predictive uncertainty using Markov chain Monte Carlo (MCMC) methods. Confirming earlier work of Kavetski et al. (2003), we show here that the computational speed of first-order, explicit, fixed-step integration methods comes at a cost: for a case study with a spatially lumped conceptual rainfall-runoff model, it introduces artificial bimodality in the marginal posterior parameter distributions, which is not present in numerically accurate implementations of the same model. The resulting effects on MCMC simulation include (1) inconsistent estimates of posterior parameter and predictive distributions, (2) poor performance and slow convergence of the MCMC algorithm, and (3) unreliable convergence diagnosis using the Gelman-Rubin statistic. We studied several alternative numerical implementations to remedy these problems, including various adaptive-step finite difference schemes and an operator splitting method. Our results show that adaptive-step, second-order methods, based on either explicit finite differencing or operator splitting with analytical integration, provide the best alternative for accurate and efficient MCMC simulation. Fixed-step or adaptive-step implicit methods may also be used for increased accuracy, but they cannot match the efficiency of adaptive-step explicit finite differencing or operator splitting. Of the latter two, explicit finite differencing is more generally applicable and is preferred if the individual hydrologic flux laws cannot be integrated analytically, as the splitting method then loses its advantage.
Sarpeshkar, R
2014-03-28
We analyse the pros and cons of analog versus digital computation in living cells. Our analysis is based on fundamental laws of noise in gene and protein expression, which set limits on the energy, time, space, molecular count and part-count resources needed to compute at a given level of precision. We conclude that analog computation is significantly more efficient in its use of resources than deterministic digital computation even at relatively high levels of precision in the cell. Based on this analysis, we conclude that synthetic biology must use analog, collective analog, probabilistic and hybrid analog-digital computational approaches; otherwise, even relatively simple synthetic computations in cells such as addition will exceed energy and molecular-count budgets. We present schematics for efficiently representing analog DNA-protein computation in cells. Analog electronic flow in subthreshold transistors and analog molecular flux in chemical reactions obey Boltzmann exponential laws of thermodynamics and are described by astoundingly similar logarithmic electrochemical potentials. Therefore, cytomorphic circuits can help to map circuit designs between electronic and biochemical domains. We review recent work that uses positive-feedback linearization circuits to architect wide-dynamic-range logarithmic analog computation in Escherichia coli using three transcription factors, nearly two orders of magnitude more efficient in parts than prior digital implementations.
Sarpeshkar, R.
2014-01-01
We analyse the pros and cons of analog versus digital computation in living cells. Our analysis is based on fundamental laws of noise in gene and protein expression, which set limits on the energy, time, space, molecular count and part-count resources needed to compute at a given level of precision. We conclude that analog computation is significantly more efficient in its use of resources than deterministic digital computation even at relatively high levels of precision in the cell. Based on this analysis, we conclude that synthetic biology must use analog, collective analog, probabilistic and hybrid analog–digital computational approaches; otherwise, even relatively simple synthetic computations in cells such as addition will exceed energy and molecular-count budgets. We present schematics for efficiently representing analog DNA–protein computation in cells. Analog electronic flow in subthreshold transistors and analog molecular flux in chemical reactions obey Boltzmann exponential laws of thermodynamics and are described by astoundingly similar logarithmic electrochemical potentials. Therefore, cytomorphic circuits can help to map circuit designs between electronic and biochemical domains. We review recent work that uses positive-feedback linearization circuits to architect wide-dynamic-range logarithmic analog computation in Escherichia coli using three transcription factors, nearly two orders of magnitude more efficient in parts than prior digital implementations. PMID:24567476
IMPROVING TACONITE PROCESSING PLANT EFFICIENCY BY COMPUTER SIMULATION, Final Report
DOE Office of Scientific and Technical Information (OSTI.GOV)
William M. Bond; Salih Ersayin
2007-03-30
This project involved industrial scale testing of a mineral processing simulator to improve the efficiency of a taconite processing plant, namely the Minorca mine. The Concentrator Modeling Center at the Coleraine Minerals Research Laboratory, University of Minnesota Duluth, enhanced the capabilities of available software, Usim Pac, by developing mathematical models needed for accurate simulation of taconite plants. This project provided funding for this technology to prove itself in the industrial environment. As the first step, data representing existing plant conditions were collected by sampling and sample analysis. Data were then balanced and provided a basis for assessing the efficiency ofmore » individual devices and the plant, and also for performing simulations aimed at improving plant efficiency. Performance evaluation served as a guide in developing alternative process strategies for more efficient production. A large number of computer simulations were then performed to quantify the benefits and effects of implementing these alternative schemes. Modification of makeup ball size was selected as the most feasible option for the target performance improvement. This was combined with replacement of existing hydrocyclones with more efficient ones. After plant implementation of these modifications, plant sampling surveys were carried out to validate findings of the simulation-based study. Plant data showed very good agreement with the simulated data, confirming results of simulation. After the implementation of modifications in the plant, several upstream bottlenecks became visible. Despite these bottlenecks limiting full capacity, concentrator energy improvement of 7% was obtained. Further improvements in energy efficiency are expected in the near future. The success of this project demonstrated the feasibility of a simulation-based approach. Currently, the Center provides simulation-based service to all the iron ore mining companies operating in northern Minnesota, and future proposals are pending with non-taconite mineral processing applications.« less
Efficient Decoding With Steady-State Kalman Filter in Neural Interface Systems
Malik, Wasim Q.; Truccolo, Wilson; Brown, Emery N.; Hochberg, Leigh R.
2011-01-01
The Kalman filter is commonly used in neural interface systems to decode neural activity and estimate the desired movement kinematics. We analyze a low-complexity Kalman filter implementation in which the filter gain is approximated by its steady-state form, computed offline before real-time decoding commences. We evaluate its performance using human motor cortical spike train data obtained from an intracortical recording array as part of an ongoing pilot clinical trial. We demonstrate that the standard Kalman filter gain converges to within 95% of the steady-state filter gain in 1.5 ± 0.5 s (mean ± s.d.). The difference in the intended movement velocity decoded by the two filters vanishes within 5 s, with a correlation coefficient of 0.99 between the two decoded velocities over the session length. We also find that the steady-state Kalman filter reduces the computational load (algorithm execution time) for decoding the firing rates of 25 ± 3 single units by a factor of 7.0 ± 0.9. We expect that the gain in computational efficiency will be much higher in systems with larger neural ensembles. The steady-state filter can thus provide substantial runtime efficiency at little cost in terms of estimation accuracy. This far more efficient neural decoding approach will facilitate the practical implementation of future large-dimensional, multisignal neural interface systems. PMID:21078582
Protein alignment algorithms with an efficient backtracking routine on multiple GPUs.
Blazewicz, Jacek; Frohmberg, Wojciech; Kierzynka, Michal; Pesch, Erwin; Wojciechowski, Pawel
2011-05-20
Pairwise sequence alignment methods are widely used in biological research. The increasing number of sequences is perceived as one of the upcoming challenges for sequence alignment methods in the nearest future. To overcome this challenge several GPU (Graphics Processing Unit) computing approaches have been proposed lately. These solutions show a great potential of a GPU platform but in most cases address the problem of sequence database scanning and computing only the alignment score whereas the alignment itself is omitted. Thus, the need arose to implement the global and semiglobal Needleman-Wunsch, and Smith-Waterman algorithms with a backtracking procedure which is needed to construct the alignment. In this paper we present the solution that performs the alignment of every given sequence pair, which is a required step for progressive multiple sequence alignment methods, as well as for DNA recognition at the DNA assembly stage. Performed tests show that the implementation, with performance up to 6.3 GCUPS on a single GPU for affine gap penalties, is very efficient in comparison to other CPU and GPU-based solutions. Moreover, multiple GPUs support with load balancing makes the application very scalable. The article shows that the backtracking procedure of the sequence alignment algorithms may be designed to fit in with the GPU architecture. Therefore, our algorithm, apart from scores, is able to compute pairwise alignments. This opens a wide range of new possibilities, allowing other methods from the area of molecular biology to take advantage of the new computational architecture. Performed tests show that the efficiency of the implementation is excellent. Moreover, the speed of our GPU-based algorithms can be almost linearly increased when using more than one graphics card.
NASA Astrophysics Data System (ADS)
Vodenicarevic, D.; Locatelli, N.; Mizrahi, A.; Friedman, J. S.; Vincent, A. F.; Romera, M.; Fukushima, A.; Yakushiji, K.; Kubota, H.; Yuasa, S.; Tiwari, S.; Grollier, J.; Querlioz, D.
2017-11-01
Low-energy random number generation is critical for many emerging computing schemes proposed to complement or replace von Neumann architectures. However, current random number generators are always associated with an energy cost that is prohibitive for these computing schemes. We introduce random number bit generation based on specific nanodevices: superparamagnetic tunnel junctions. We experimentally demonstrate high-quality random bit generation that represents an orders-of-magnitude improvement in energy efficiency over current solutions. We show that the random generation speed improves with nanodevice scaling, and we investigate the impact of temperature, magnetic field, and cross talk. Finally, we show how alternative computing schemes can be implemented using superparamagentic tunnel junctions as random number generators. These results open the way for fabricating efficient hardware computing devices leveraging stochasticity, and they highlight an alternative use for emerging nanodevices.
Finding a roadmap to achieve large neuromorphic hardware systems
Hasler, Jennifer; Marr, Bo
2013-01-01
Neuromorphic systems are gaining increasing importance in an era where CMOS digital computing techniques are reaching physical limits. These silicon systems mimic extremely energy efficient neural computing structures, potentially both for solving engineering applications as well as understanding neural computation. Toward this end, the authors provide a glimpse at what the technology evolution roadmap looks like for these systems so that Neuromorphic engineers may gain the same benefit of anticipation and foresight that IC designers gained from Moore's law many years ago. Scaling of energy efficiency, performance, and size will be discussed as well as how the implementation and application space of Neuromorphic systems are expected to evolve over time. PMID:24058330
Bajaj, Chandrajit; Chen, Shun-Chuan; Rand, Alexander
2011-01-01
In order to compute polarization energy of biomolecules, we describe a boundary element approach to solving the linearized Poisson-Boltzmann equation. Our approach combines several important features including the derivative boundary formulation of the problem and a smooth approximation of the molecular surface based on the algebraic spline molecular surface. State of the art software for numerical linear algebra and the kernel independent fast multipole method is used for both simplicity and efficiency of our implementation. We perform a variety of computational experiments, testing our method on a number of actual proteins involved in molecular docking and demonstrating the effectiveness of our solver for computing molecular polarization energy. PMID:21660123
Efficient ICCG on a shared memory multiprocessor
NASA Technical Reports Server (NTRS)
Hammond, Steven W.; Schreiber, Robert
1989-01-01
Different approaches are discussed for exploiting parallelism in the ICCG (Incomplete Cholesky Conjugate Gradient) method for solving large sparse symmetric positive definite systems of equations on a shared memory parallel computer. Techniques for efficiently solving triangular systems and computing sparse matrix-vector products are explored. Three methods for scheduling the tasks in solving triangular systems are implemented on the Sequent Balance 21000. Sample problems that are representative of a large class of problems solved using iterative methods are used. We show that a static analysis to determine data dependences in the triangular solve can greatly improve its parallel efficiency. We also show that ignoring symmetry and storing the whole matrix can reduce solution time substantially.
Inversion Of Jacobian Matrix For Robot Manipulators
NASA Technical Reports Server (NTRS)
Fijany, Amir; Bejczy, Antal K.
1989-01-01
Report discusses inversion of Jacobian matrix for class of six-degree-of-freedom arms with spherical wrist, i.e., with last three joints intersecting. Shows by taking advantage of simple geometry of such arms, closed-form solution of Q=J-1X, which represents linear transformation from task space to joint space, obtained efficiently. Presents solutions for PUMA arm, JPL/Stanford arm, and six-revolute-joint coplanar arm along with all singular points. Main contribution of paper shows simple geometry of this type of arms exploited in performing inverse transformation without any need to compute Jacobian or its inverse explicitly. Implication of this computational efficiency advanced task-space control schemes for spherical-wrist arms implemented more efficiently.
NASA Astrophysics Data System (ADS)
Dettmer, J.; Quijano, J. E.; Dosso, S. E.; Holland, C. W.; Mandolesi, E.
2016-12-01
Geophysical seabed properties are important for the detection and classification of unexploded ordnance. However, current surveying methods such as vertical seismic profiling, coring, or inversion are of limited use when surveying large areas with high spatial sampling density. We consider surveys based on a source and receiver array towed by an autonomous vehicle which produce large volumes of seabed reflectivity data that contain unprecedented and detailed seabed information. The data are analyzed with a particle filter, which requires efficient reflection-coefficient computation, efficient inversion algorithms and efficient use of computer resources. The filter quantifies information content of multiple sequential data sets by considering results from previous data along the survey track to inform the importance sampling at the current point. Challenges arise from environmental changes along the track where the number of sediment layers and their properties change. This is addressed by a trans-dimensional model in the filter which allows layering complexity to change along a track. Efficiency is improved by likelihood tempering of various particle subsets and including exchange moves (parallel tempering). The filter is implemented on a hybrid computer that combines central processing units (CPUs) and graphics processing units (GPUs) to exploit three levels of parallelism: (1) fine-grained parallel computation of spherical reflection coefficients with a GPU implementation of Levin integration; (2) updating particles by concurrent CPU processes which exchange information using automatic load balancing (coarse grained parallelism); (3) overlapping CPU-GPU communication (a major bottleneck) with GPU computation by staggering CPU access to the multiple GPUs. The algorithm is applied to spherical reflection coefficients for data sets along a 14-km track on the Malta Plateau, Mediterranean Sea. We demonstrate substantial efficiency gains over previous methods. [This research was supported in part by the U.S. Dept of Defense, thought the Strategic Environmental Research and Development Program (SERDP).
DANoC: An Efficient Algorithm and Hardware Codesign of Deep Neural Networks on Chip.
Zhou, Xichuan; Li, Shengli; Tang, Fang; Hu, Shengdong; Lin, Zhi; Zhang, Lei
2017-07-18
Deep neural networks (NNs) are the state-of-the-art models for understanding the content of images and videos. However, implementing deep NNs in embedded systems is a challenging task, e.g., a typical deep belief network could exhaust gigabytes of memory and result in bandwidth and computational bottlenecks. To address this challenge, this paper presents an algorithm and hardware codesign for efficient deep neural computation. A hardware-oriented deep learning algorithm, named the deep adaptive network, is proposed to explore the sparsity of neural connections. By adaptively removing the majority of neural connections and robustly representing the reserved connections using binary integers, the proposed algorithm could save up to 99.9% memory utility and computational resources without undermining classification accuracy. An efficient sparse-mapping-memory-based hardware architecture is proposed to fully take advantage of the algorithmic optimization. Different from traditional Von Neumann architecture, the deep-adaptive network on chip (DANoC) brings communication and computation in close proximity to avoid power-hungry parameter transfers between on-board memory and on-chip computational units. Experiments over different image classification benchmarks show that the DANoC system achieves competitively high accuracy and efficiency comparing with the state-of-the-art approaches.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hannon, Kevin P.; Li, Chenyang; Evangelista, Francesco A., E-mail: francesco.evangelista@emory.edu
2016-05-28
We report an efficient implementation of a second-order multireference perturbation theory based on the driven similarity renormalization group (DSRG-MRPT2) [C. Li and F. A. Evangelista, J. Chem. Theory Comput. 11, 2097 (2015)]. Our implementation employs factorized two-electron integrals to avoid storage of large four-index intermediates. It also exploits the block structure of the reference density matrices to reduce the computational cost to that of second-order Møller–Plesset perturbation theory. Our new DSRG-MRPT2 implementation is benchmarked on ten naphthyne isomers using basis sets up to quintuple-ζ quality. We find that the singlet-triplet splittings (Δ{sub ST}) of the naphthyne isomers strongly depend onmore » the equilibrium structures. For a consistent set of geometries, the Δ{sub ST} values predicted by the DSRG-MRPT2 are in good agreements with those computed by the reduced multireference coupled cluster theory with singles, doubles, and perturbative triples.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zachary M. Prince; Jean C. Ragusa; Yaqi Wang
Because of the recent interest in reactor transient modeling and the restart of the Transient Reactor (TREAT) Facility, there has been a need for more efficient, robust methods in computation frameworks. This is the impetus of implementing the Improved Quasi-Static method (IQS) in the RATTLESNAKE/MOOSE framework. IQS has implemented with CFEM diffusion by factorizing flux into time-dependent amplitude and spacial- and weakly time-dependent shape. The shape evaluation is very similar to a flux diffusion solve and is computed at large (macro) time steps. While the amplitude evaluation is a PRKE solve where the parameters are dependent on the shape andmore » is computed at small (micro) time steps. IQS has been tested with a custom one-dimensional example and the TWIGL ramp benchmark. These examples prove it to be a viable and effective method for highly transient cases. More complex cases are intended to be applied to further test the method and its implementation.« less
Chessa, Manuela; Bianchi, Valentina; Zampetti, Massimo; Sabatini, Silvio P; Solari, Fabio
2012-01-01
The intrinsic parallelism of visual neural architectures based on distributed hierarchical layers is well suited to be implemented on the multi-core architectures of modern graphics cards. The design strategies that allow us to optimally take advantage of such parallelism, in order to efficiently map on GPU the hierarchy of layers and the canonical neural computations, are proposed. Specifically, the advantages of a cortical map-like representation of the data are exploited. Moreover, a GPU implementation of a novel neural architecture for the computation of binocular disparity from stereo image pairs, based on populations of binocular energy neurons, is presented. The implemented neural model achieves good performances in terms of reliability of the disparity estimates and a near real-time execution speed, thus demonstrating the effectiveness of the devised design strategies. The proposed approach is valid in general, since the neural building blocks we implemented are a common basis for the modeling of visual neural functionalities.
Edge-Based Efficient Search over Encrypted Data Mobile Cloud Storage
Liu, Fang; Cai, Zhiping; Xiao, Nong; Zhao, Ziming
2018-01-01
Smart sensor-equipped mobile devices sense, collect, and process data generated by the edge network to achieve intelligent control, but such mobile devices usually have limited storage and computing resources. Mobile cloud storage provides a promising solution owing to its rich storage resources, great accessibility, and low cost. But it also brings a risk of information leakage. The encryption of sensitive data is the basic step to resist the risk. However, deploying a high complexity encryption and decryption algorithm on mobile devices will greatly increase the burden of terminal operation and the difficulty to implement the necessary privacy protection algorithm. In this paper, we propose ENSURE (EfficieNt and SecURE), an efficient and secure encrypted search architecture over mobile cloud storage. ENSURE is inspired by edge computing. It allows mobile devices to offload the computation intensive task onto the edge server to achieve a high efficiency. Besides, to protect data security, it reduces the information acquisition of untrusted cloud by hiding the relevance between query keyword and search results from the cloud. Experiments on a real data set show that ENSURE reduces the computation time by 15% to 49% and saves the energy consumption by 38% to 69% per query. PMID:29652810
Edge-Based Efficient Search over Encrypted Data Mobile Cloud Storage.
Guo, Yeting; Liu, Fang; Cai, Zhiping; Xiao, Nong; Zhao, Ziming
2018-04-13
Smart sensor-equipped mobile devices sense, collect, and process data generated by the edge network to achieve intelligent control, but such mobile devices usually have limited storage and computing resources. Mobile cloud storage provides a promising solution owing to its rich storage resources, great accessibility, and low cost. But it also brings a risk of information leakage. The encryption of sensitive data is the basic step to resist the risk. However, deploying a high complexity encryption and decryption algorithm on mobile devices will greatly increase the burden of terminal operation and the difficulty to implement the necessary privacy protection algorithm. In this paper, we propose ENSURE (EfficieNt and SecURE), an efficient and secure encrypted search architecture over mobile cloud storage. ENSURE is inspired by edge computing. It allows mobile devices to offload the computation intensive task onto the edge server to achieve a high efficiency. Besides, to protect data security, it reduces the information acquisition of untrusted cloud by hiding the relevance between query keyword and search results from the cloud. Experiments on a real data set show that ENSURE reduces the computation time by 15% to 49% and saves the energy consumption by 38% to 69% per query.
CBESW: sequence alignment on the Playstation 3.
Wirawan, Adrianto; Kwoh, Chee Keong; Hieu, Nim Tri; Schmidt, Bertil
2008-09-17
The exponential growth of available biological data has caused bioinformatics to be rapidly moving towards a data-intensive, computational science. As a result, the computational power needed by bioinformatics applications is growing exponentially as well. The recent emergence of accelerator technologies has made it possible to achieve an excellent improvement in execution time for many bioinformatics applications, compared to current general-purpose platforms. In this paper, we demonstrate how the PlayStation 3, powered by the Cell Broadband Engine, can be used as a computational platform to accelerate the Smith-Waterman algorithm. For large datasets, our implementation on the PlayStation 3 provides a significant improvement in running time compared to other implementations such as SSEARCH, Striped Smith-Waterman and CUDA. Our implementation achieves a peak performance of up to 3,646 MCUPS. The results from our experiments demonstrate that the PlayStation 3 console can be used as an efficient low cost computational platform for high performance sequence alignment applications.
CBESW: Sequence Alignment on the Playstation 3
Wirawan, Adrianto; Kwoh, Chee Keong; Hieu, Nim Tri; Schmidt, Bertil
2008-01-01
Background The exponential growth of available biological data has caused bioinformatics to be rapidly moving towards a data-intensive, computational science. As a result, the computational power needed by bioinformatics applications is growing exponentially as well. The recent emergence of accelerator technologies has made it possible to achieve an excellent improvement in execution time for many bioinformatics applications, compared to current general-purpose platforms. In this paper, we demonstrate how the PlayStation® 3, powered by the Cell Broadband Engine, can be used as a computational platform to accelerate the Smith-Waterman algorithm. Results For large datasets, our implementation on the PlayStation® 3 provides a significant improvement in running time compared to other implementations such as SSEARCH, Striped Smith-Waterman and CUDA. Our implementation achieves a peak performance of up to 3,646 MCUPS. Conclusion The results from our experiments demonstrate that the PlayStation® 3 console can be used as an efficient low cost computational platform for high performance sequence alignment applications. PMID:18798993
NASA Astrophysics Data System (ADS)
Gilbert, B. K.; Robb, R. A.; Chu, A.; Kenue, S. K.; Lent, A. H.; Swartzlander, E. E., Jr.
1981-02-01
Rapid advances during the past ten years of several forms of computer-assisted tomography (CT) have resulted in the development of numerous algorithms to convert raw projection data into cross-sectional images. These reconstruction algorithms are either 'iterative,' in which a large matrix algebraic equation is solved by successive approximation techniques; or 'closed form'. Continuing evolution of the closed form algorithms has allowed the newest versions to produce excellent reconstructed images in most applications. This paper will review several computer software and special-purpose digital hardware implementations of closed form algorithms, either proposed during the past several years by a number of workers or actually implemented in commercial or research CT scanners. The discussion will also cover a number of recently investigated algorithmic modifications which reduce the amount of computation required to execute the reconstruction process, as well as several new special-purpose digital hardware implementations under development in laboratories at the Mayo Clinic.
Massively parallel algorithms for real-time wavefront control of a dense adaptive optics system
DOE Office of Scientific and Technical Information (OSTI.GOV)
Fijany, A.; Milman, M.; Redding, D.
1994-12-31
In this paper massively parallel algorithms and architectures for real-time wavefront control of a dense adaptive optic system (SELENE) are presented. The authors have already shown that the computation of a near optimal control algorithm for SELENE can be reduced to the solution of a discrete Poisson equation on a regular domain. Although, this represents an optimal computation, due the large size of the system and the high sampling rate requirement, the implementation of this control algorithm poses a computationally challenging problem since it demands a sustained computational throughput of the order of 10 GFlops. They develop a novel algorithm,more » designated as Fast Invariant Imbedding algorithm, which offers a massive degree of parallelism with simple communication and synchronization requirements. Due to these features, this algorithm is significantly more efficient than other Fast Poisson Solvers for implementation on massively parallel architectures. The authors also discuss two massively parallel, algorithmically specialized, architectures for low-cost and optimal implementation of the Fast Invariant Imbedding algorithm.« less
Experimental quantum computing without entanglement.
Lanyon, B P; Barbieri, M; Almeida, M P; White, A G
2008-11-14
Deterministic quantum computation with one pure qubit (DQC1) is an efficient model of computation that uses highly mixed states. Unlike pure-state models, its power is not derived from the generation of a large amount of entanglement. Instead it has been proposed that other nonclassical correlations are responsible for the computational speedup, and that these can be captured by the quantum discord. In this Letter we implement DQC1 in an all-optical architecture, and experimentally observe the generated correlations. We find no entanglement, but large amounts of quantum discord-except in three cases where an efficient classical simulation is always possible. Our results show that even fully separable, highly mixed, states can contain intrinsically quantum mechanical correlations and that these could offer a valuable resource for quantum information technologies.
Change Detection of Mobile LIDAR Data Using Cloud Computing
NASA Astrophysics Data System (ADS)
Liu, Kun; Boehm, Jan; Alis, Christian
2016-06-01
Change detection has long been a challenging problem although a lot of research has been conducted in different fields such as remote sensing and photogrammetry, computer vision, and robotics. In this paper, we blend voxel grid and Apache Spark together to propose an efficient method to address the problem in the context of big data. Voxel grid is a regular geometry representation consisting of the voxels with the same size, which fairly suites parallel computation. Apache Spark is a popular distributed parallel computing platform which allows fault tolerance and memory cache. These features can significantly enhance the performance of Apache Spark and results in an efficient and robust implementation. In our experiments, both synthetic and real point cloud data are employed to demonstrate the quality of our method.
NASA Technical Reports Server (NTRS)
Delaat, J. C.; Merrill, W. C.
1983-01-01
A sensor failure detection, isolation, and accommodation algorithm was developed which incorporates analytic sensor redundancy through software. This algorithm was implemented in a high level language on a microprocessor based controls computer. Parallel processing and state-of-the-art 16-bit microprocessors are used along with efficient programming practices to achieve real-time operation.
NASA Astrophysics Data System (ADS)
Kim, Jeong-Gyu; Kim, Woong-Tae; Ostriker, Eve C.; Skinner, M. Aaron
2017-12-01
We present an implementation of an adaptive ray-tracing (ART) module in the Athena hydrodynamics code that accurately and efficiently handles the radiative transfer involving multiple point sources on a three-dimensional Cartesian grid. We adopt a recently proposed parallel algorithm that uses nonblocking, asynchronous MPI communications to accelerate transport of rays across the computational domain. We validate our implementation through several standard test problems, including the propagation of radiation in vacuum and the expansions of various types of H II regions. Additionally, scaling tests show that the cost of a full ray trace per source remains comparable to that of the hydrodynamics update on up to ∼ {10}3 processors. To demonstrate application of our ART implementation, we perform a simulation of star cluster formation in a marginally bound, turbulent cloud, finding that its star formation efficiency is 12% when both radiation pressure forces and photoionization by UV radiation are treated. We directly compare the radiation forces computed from the ART scheme with those from the M1 closure relation. Although the ART and M1 schemes yield similar results on large scales, the latter is unable to resolve the radiation field accurately near individual point sources.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Littlefield, R.J.
1990-02-01
To implement an efficient data-parallel program on a non-shared memory MIMD multicomputer, data and computations must be properly partitioned to achieve good load balance and locality of reference. Programs with irregular data reference patterns often require irregular partitions. Although good partitions may be easy to determine, they can be difficult or impossible to implement in programming languages that provide only regular data distributions, such as blocked or cyclic arrays. We are developing Onyx, a programming system that provides a shared memory model of distributed data structures and extends the concept of data distribution to include irregular and dynamic distributions. Thismore » provides a powerful means to specify irregular partitions. Perhaps surprisingly, programs using it can also execute efficiently. In this paper, we describe and evaluate the Onyx implementation of a model problem that repeatedly executes an irregular but fixed data reference pattern. On an NCUBE hypercube, the speed of the Onyx implementation is comparable to that of carefully handwritten message-passing code.« less
Katouda, Michio; Naruse, Akira; Hirano, Yukihiko; Nakajima, Takahito
2016-11-15
A new parallel algorithm and its implementation for the RI-MP2 energy calculation utilizing peta-flop-class many-core supercomputers are presented. Some improvements from the previous algorithm (J. Chem. Theory Comput. 2013, 9, 5373) have been performed: (1) a dual-level hierarchical parallelization scheme that enables the use of more than 10,000 Message Passing Interface (MPI) processes and (2) a new data communication scheme that reduces network communication overhead. A multi-node and multi-GPU implementation of the present algorithm is presented for calculations on a central processing unit (CPU)/graphics processing unit (GPU) hybrid supercomputer. Benchmark results of the new algorithm and its implementation using the K computer (CPU clustering system) and TSUBAME 2.5 (CPU/GPU hybrid system) demonstrate high efficiency. The peak performance of 3.1 PFLOPS is attained using 80,199 nodes of the K computer. The peak performance of the multi-node and multi-GPU implementation is 514 TFLOPS using 1349 nodes and 4047 GPUs of TSUBAME 2.5. © 2016 Wiley Periodicals, Inc. © 2016 Wiley Periodicals, Inc.
Smith, Daniel G A; Burns, Lori A; Sirianni, Dominic A; Nascimento, Daniel R; Kumar, Ashutosh; James, Andrew M; Schriber, Jeffrey B; Zhang, Tianyuan; Zhang, Boyi; Abbott, Adam S; Berquist, Eric J; Lechner, Marvin H; Cunha, Leonardo A; Heide, Alexander G; Waldrop, Jonathan M; Takeshita, Tyler Y; Alenaizan, Asem; Neuhauser, Daniel; King, Rollin A; Simmonett, Andrew C; Turney, Justin M; Schaefer, Henry F; Evangelista, Francesco A; DePrince, A Eugene; Crawford, T Daniel; Patkowski, Konrad; Sherrill, C David
2018-06-11
Psi4NumPy demonstrates the use of efficient computational kernels from the open-source Psi4 program through the popular NumPy library for linear algebra in Python to facilitate the rapid development of clear, understandable Python computer code for new quantum chemical methods, while maintaining a relatively low execution time. Using these tools, reference implementations have been created for a number of methods, including self-consistent field (SCF), SCF response, many-body perturbation theory, coupled-cluster theory, configuration interaction, and symmetry-adapted perturbation theory. Furthermore, several reference codes have been integrated into Jupyter notebooks, allowing background, underlying theory, and formula information to be associated with the implementation. Psi4NumPy tools and associated reference implementations can lower the barrier for future development of quantum chemistry methods. These implementations also demonstrate the power of the hybrid C++/Python programming approach employed by the Psi4 program.
Resource Efficient Hardware Architecture for Fast Computation of Running Max/Min Filters
Torres-Huitzil, Cesar
2013-01-01
Running max/min filters on rectangular kernels are widely used in many digital signal and image processing applications. Filtering with a k × k kernel requires of k 2 − 1 comparisons per sample for a direct implementation; thus, performance scales expensively with the kernel size k. Faster computations can be achieved by kernel decomposition and using constant time one-dimensional algorithms on custom hardware. This paper presents a hardware architecture for real-time computation of running max/min filters based on the van Herk/Gil-Werman (HGW) algorithm. The proposed architecture design uses less computation and memory resources than previously reported architectures when targeted to Field Programmable Gate Array (FPGA) devices. Implementation results show that the architecture is able to compute max/min filters, on 1024 × 1024 images with up to 255 × 255 kernels, in around 8.4 milliseconds, 120 frames per second, at a clock frequency of 250 MHz. The implementation is highly scalable for the kernel size with good performance/area tradeoff suitable for embedded applications. The applicability of the architecture is shown for local adaptive image thresholding. PMID:24288456
ERIC Educational Resources Information Center
Bayne, Pauline S; Rader, Joe C.
The purpose of this project was to demonstrate that computer-based training (CBT) sessions, produced as HyperCard stacks (files), are an efficient and effective component for staff training in libraries. The purpose was successfully met in the 15-month period of development, evaluation, and implementation, and the University of Tennessee (UT)…
ERIC Educational Resources Information Center
McKinley, Kenneth H.; Self, Burl E., Jr.
A study was conducted to determine the feasibility of using the computer-based Synagraphic Mapping Program (SYMAP) and the Statistical Package for the Social Sciences (SPSS) in formulating an efficient and accurate information system which Creek Nation tribal staff could implement and use in planning for more effective and precise delivery of…
SYN-OP-SYS™: A Computerized Management Information System for Quality Assurance and Risk Management
Thomas, David J.; Weiner, Jayne; Lippincott, Ronald C.
1985-01-01
SYN·OP·SYS™ is a computerized management information system for quality assurance and risk management. Computer software for the efficient collection and analysis of “occurrences” and the clinical data associated with these kinds of patient events is described. The system is evaluated according to certain computer design criteria, and the system's implementation is assessed.
A parallel computing engine for a class of time critical processes.
Nabhan, T M; Zomaya, A Y
1997-01-01
This paper focuses on the efficient parallel implementation of systems of numerically intensive nature over loosely coupled multiprocessor architectures. These analytical models are of significant importance to many real-time systems that have to meet severe time constants. A parallel computing engine (PCE) has been developed in this work for the efficient simplification and the near optimal scheduling of numerical models over the different cooperating processors of the parallel computer. First, the analytical system is efficiently coded in its general form. The model is then simplified by using any available information (e.g., constant parameters). A task graph representing the interconnections among the different components (or equations) is generated. The graph can then be compressed to control the computation/communication requirements. The task scheduler employs a graph-based iterative scheme, based on the simulated annealing algorithm, to map the vertices of the task graph onto a Multiple-Instruction-stream Multiple-Data-stream (MIMD) type of architecture. The algorithm uses a nonanalytical cost function that properly considers the computation capability of the processors, the network topology, the communication time, and congestion possibilities. Moreover, the proposed technique is simple, flexible, and computationally viable. The efficiency of the algorithm is demonstrated by two case studies with good results.
Multiphysics Computational Analysis of a Solid-Core Nuclear Thermal Engine Thrust Chamber
NASA Technical Reports Server (NTRS)
Wang, Ten-See; Canabal, Francisco; Cheng, Gary; Chen, Yen-Sen
2007-01-01
The objective of this effort is to develop an efficient and accurate computational heat transfer methodology to predict thermal, fluid, and hydrogen environments for a hypothetical solid-core, nuclear thermal engine - the Small Engine. In addition, the effects of power profile and hydrogen conversion on heat transfer efficiency and thrust performance were also investigated. The computational methodology is based on an unstructured-grid, pressure-based, all speeds, chemically reacting, computational fluid dynamics platform, while formulations of conjugate heat transfer were implemented to describe the heat transfer from solid to hydrogen inside the solid-core reactor. The computational domain covers the entire thrust chamber so that the afore-mentioned heat transfer effects impact the thrust performance directly. The result shows that the computed core-exit gas temperature, specific impulse, and core pressure drop agree well with those of design data for the Small Engine. Finite-rate chemistry is very important in predicting the proper energy balance as naturally occurring hydrogen decomposition is endothermic. Locally strong hydrogen conversion associated with centralized power profile gives poor heat transfer efficiency and lower thrust performance. On the other hand, uniform hydrogen conversion associated with a more uniform radial power profile achieves higher heat transfer efficiency, and higher thrust performance.
NASA Technical Reports Server (NTRS)
Maine, R. E.; Iliff, K. W.
1980-01-01
A new formulation is proposed for the problem of parameter estimation of dynamic systems with both process and measurement noise. The formulation gives estimates that are maximum likelihood asymptotically in time. The means used to overcome the difficulties encountered by previous formulations are discussed. It is then shown how the proposed formulation can be efficiently implemented in a computer program. A computer program using the proposed formulation is available in a form suitable for routine application. Examples with simulated and real data are given to illustrate that the program works well.
Baryonic and mesonic 3-point functions with open spin indices
NASA Astrophysics Data System (ADS)
Bali, Gunnar S.; Collins, Sara; Gläßle, Benjamin; Heybrock, Simon; Korcyl, Piotr; Löffler, Marius; Rödl, Rudolf; Schäfer, Andreas
2018-03-01
We have implemented a new way of computing three-point correlation functions. It is based on a factorization of the entire correlation function into two parts which are evaluated with open spin-(and to some extent flavor-) indices. This allows us to estimate the two contributions simultaneously for many different initial and final states and momenta, with little computational overhead. We explain this factorization as well as its efficient implementation in a new library which has been written to provide the necessary functionality on modern parallel architectures and on CPUs, including Intel's Xeon Phi series.
Improving Design Efficiency for Large-Scale Heterogeneous Circuits
NASA Astrophysics Data System (ADS)
Gregerson, Anthony
Despite increases in logic density, many Big Data applications must still be partitioned across multiple computing devices in order to meet their strict performance requirements. Among the most demanding of these applications is high-energy physics (HEP), which uses complex computing systems consisting of thousands of FPGAs and ASICs to process the sensor data created by experiments at particles accelerators such as the Large Hadron Collider (LHC). Designing such computing systems is challenging due to the scale of the systems, the exceptionally high-throughput and low-latency performance constraints that necessitate application-specific hardware implementations, the requirement that algorithms are efficiently partitioned across many devices, and the possible need to update the implemented algorithms during the lifetime of the system. In this work, we describe our research to develop flexible architectures for implementing such large-scale circuits on FPGAs. In particular, this work is motivated by (but not limited in scope to) high-energy physics algorithms for the Compact Muon Solenoid (CMS) experiment at the LHC. To make efficient use of logic resources in multi-FPGA systems, we introduce Multi-Personality Partitioning, a novel form of the graph partitioning problem, and present partitioning algorithms that can significantly improve resource utilization on heterogeneous devices while also reducing inter-chip connections. To reduce the high communication costs of Big Data applications, we also introduce Information-Aware Partitioning, a partitioning method that analyzes the data content of application-specific circuits, characterizes their entropy, and selects circuit partitions that enable efficient compression of data between chips. We employ our information-aware partitioning method to improve the performance of the hardware validation platform for evaluating new algorithms for the CMS experiment. Together, these research efforts help to improve the efficiency and decrease the cost of the developing large-scale, heterogeneous circuits needed to enable large-scale application in high-energy physics and other important areas.
A sample implementation for parallelizing Divide-and-Conquer algorithms on the GPU.
Mei, Gang; Zhang, Jiayin; Xu, Nengxiong; Zhao, Kunyang
2018-01-01
The strategy of Divide-and-Conquer (D&C) is one of the frequently used programming patterns to design efficient algorithms in computer science, which has been parallelized on shared memory systems and distributed memory systems. Tzeng and Owens specifically developed a generic paradigm for parallelizing D&C algorithms on modern Graphics Processing Units (GPUs). In this paper, by following the generic paradigm proposed by Tzeng and Owens, we provide a new and publicly available GPU implementation of the famous D&C algorithm, QuickHull, to give a sample and guide for parallelizing D&C algorithms on the GPU. The experimental results demonstrate the practicality of our sample GPU implementation. Our research objective in this paper is to present a sample GPU implementation of a classical D&C algorithm to help interested readers to develop their own efficient GPU implementations with fewer efforts.
Xyce parallel electronic simulator users guide, version 6.1
DOE Office of Scientific and Technical Information (OSTI.GOV)
Keiter, Eric R; Mei, Ting; Russo, Thomas V.
This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas; Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). This includes support for most popular parallel and serial computers; A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows one to developmore » new types of analysis without requiring the implementation of analysis-specific device models; Device models that are specifically tailored to meet Sandia's needs, including some radiationaware devices (for Sandia users only); and Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase-a message passing parallel implementation-which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows.« less
Xyce parallel electronic simulator users' guide, Version 6.0.1.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Keiter, Eric R; Mei, Ting; Russo, Thomas V.
This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows one to developmore » new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandias needs, including some radiationaware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase a message passing parallel implementation which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows.« less
Efficient Parallel Video Processing Techniques on GPU: From Framework to Implementation
Su, Huayou; Wen, Mei; Wu, Nan; Ren, Ju; Zhang, Chunyuan
2014-01-01
Through reorganizing the execution order and optimizing the data structure, we proposed an efficient parallel framework for H.264/AVC encoder based on massively parallel architecture. We implemented the proposed framework by CUDA on NVIDIA's GPU. Not only the compute intensive components of the H.264 encoder are parallelized but also the control intensive components are realized effectively, such as CAVLC and deblocking filter. In addition, we proposed serial optimization methods, including the multiresolution multiwindow for motion estimation, multilevel parallel strategy to enhance the parallelism of intracoding as much as possible, component-based parallel CAVLC, and direction-priority deblocking filter. More than 96% of workload of H.264 encoder is offloaded to GPU. Experimental results show that the parallel implementation outperforms the serial program by 20 times of speedup ratio and satisfies the requirement of the real-time HD encoding of 30 fps. The loss of PSNR is from 0.14 dB to 0.77 dB, when keeping the same bitrate. Through the analysis to the kernels, we found that speedup ratios of the compute intensive algorithms are proportional with the computation power of the GPU. However, the performance of the control intensive parts (CAVLC) is much related to the memory bandwidth, which gives an insight for new architecture design. PMID:24757432
Xyce parallel electronic simulator users guide, version 6.0.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Keiter, Eric R; Mei, Ting; Russo, Thomas V.
This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows one to developmore » new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandias needs, including some radiationaware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase a message passing parallel implementation which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows.« less
Model checking for linear temporal logic: An efficient implementation
NASA Technical Reports Server (NTRS)
Sherman, Rivi; Pnueli, Amir
1990-01-01
This report provides evidence to support the claim that model checking for linear temporal logic (LTL) is practically efficient. Two implementations of a linear temporal logic model checker is described. One is based on transforming the model checking problem into a satisfiability problem; the other checks an LTL formula for a finite model by computing the cross-product of the finite state transition graph of the program with a structure containing all possible models for the property. An experiment was done with a set of mutual exclusion algorithms and tested safety and liveness under fairness for these algorithms.
A Decentralized Eigenvalue Computation Method for Spectrum Sensing Based on Average Consensus
NASA Astrophysics Data System (ADS)
Mohammadi, Jafar; Limmer, Steffen; Stańczak, Sławomir
2016-07-01
This paper considers eigenvalue estimation for the decentralized inference problem for spectrum sensing. We propose a decentralized eigenvalue computation algorithm based on the power method, which is referred to as generalized power method GPM; it is capable of estimating the eigenvalues of a given covariance matrix under certain conditions. Furthermore, we have developed a decentralized implementation of GPM by splitting the iterative operations into local and global computation tasks. The global tasks require data exchange to be performed among the nodes. For this task, we apply an average consensus algorithm to efficiently perform the global computations. As a special case, we consider a structured graph that is a tree with clusters of nodes at its leaves. For an accelerated distributed implementation, we propose to use computation over multiple access channel (CoMAC) as a building block of the algorithm. Numerical simulations are provided to illustrate the performance of the two algorithms.
Deep learning with coherent nanophotonic circuits
NASA Astrophysics Data System (ADS)
Shen, Yichen; Harris, Nicholas C.; Skirlo, Scott; Prabhu, Mihika; Baehr-Jones, Tom; Hochberg, Michael; Sun, Xin; Zhao, Shijie; Larochelle, Hugo; Englund, Dirk; Soljačić, Marin
2017-07-01
Artificial neural networks are computational network models inspired by signal processing in the brain. These models have dramatically improved performance for many machine-learning tasks, including speech and image recognition. However, today's computing hardware is inefficient at implementing neural networks, in large part because much of it was designed for von Neumann computing schemes. Significant effort has been made towards developing electronic architectures tuned to implement artificial neural networks that exhibit improved computational speed and accuracy. Here, we propose a new architecture for a fully optical neural network that, in principle, could offer an enhancement in computational speed and power efficiency over state-of-the-art electronics for conventional inference tasks. We experimentally demonstrate the essential part of the concept using a programmable nanophotonic processor featuring a cascaded array of 56 programmable Mach-Zehnder interferometers in a silicon photonic integrated circuit and show its utility for vowel recognition.
NASA Astrophysics Data System (ADS)
Lanzagorta, Marco O.; Gomez, Richard B.; Uhlmann, Jeffrey K.
2003-08-01
In recent years, computer graphics has emerged as a critical component of the scientific and engineering process, and it is recognized as an important computer science research area. Computer graphics are extensively used for a variety of aerospace and defense training systems and by Hollywood's special effects companies. All these applications require the computer graphics systems to produce high quality renderings of extremely large data sets in short periods of time. Much research has been done in "classical computing" toward the development of efficient methods and techniques to reduce the rendering time required for large datasets. Quantum Computing's unique algorithmic features offer the possibility of speeding up some of the known rendering algorithms currently used in computer graphics. In this paper we discuss possible implementations of quantum rendering algorithms. In particular, we concentrate on the implementation of Grover's quantum search algorithm for Z-buffering, ray-tracing, radiosity, and scene management techniques. We also compare the theoretical performance between the classical and quantum versions of the algorithms.
Meyer, Frans J C; Davidson, David B; Jakobus, Ulrich; Stuchly, Maria A
2003-02-01
A hybrid finite-element method (FEM)/method of moments (MoM) technique is employed for specific absorption rate (SAR) calculations in a human phantom in the near field of a typical group special mobile (GSM) base-station antenna. The MoM is used to model the metallic surfaces and wires of the base-station antenna, and the FEM is used to model the heterogeneous human phantom. The advantages of each of these frequency domain techniques are, thus, exploited, leading to a highly efficient and robust numerical method for addressing this type of bioelectromagnetic problem. The basic mathematical formulation of the hybrid technique is presented. This is followed by a discussion of important implementation details-in particular, the linear algebra routines for sparse, complex FEM matrices combined with dense MoM matrices. The implementation is validated by comparing results to MoM (surface equivalence principle implementation) and finite-difference time-domain (FDTD) solutions of human exposure problems. A comparison of the computational efficiency of the different techniques is presented. The FEM/MoM implementation is then used for whole-body and critical-organ SAR calculations in a phantom at different positions in the near field of a base-station antenna. This problem cannot, in general, be solved using the MoM or FDTD due to computational limitations. This paper shows that the specific hybrid FEM/MoM implementation is an efficient numerical tool for accurate assessment of human exposure in the near field of base-station antennas.
Efficient parallel resolution of the simplified transport equations in mixed-dual formulation
NASA Astrophysics Data System (ADS)
Barrault, M.; Lathuilière, B.; Ramet, P.; Roman, J.
2011-03-01
A reactivity computation consists of computing the highest eigenvalue of a generalized eigenvalue problem, for which an inverse power algorithm is commonly used. Very fine modelizations are difficult to treat for our sequential solver, based on the simplified transport equations, in terms of memory consumption and computational time. A first implementation of a Lagrangian based domain decomposition method brings to a poor parallel efficiency because of an increase in the power iterations [1]. In order to obtain a high parallel efficiency, we improve the parallelization scheme by changing the location of the loop over the subdomains in the overall algorithm and by benefiting from the characteristics of the Raviart-Thomas finite element. The new parallel algorithm still allows us to locally adapt the numerical scheme (mesh, finite element order). However, it can be significantly optimized for the matching grid case. The good behavior of the new parallelization scheme is demonstrated for the matching grid case on several hundreds of nodes for computations based on a pin-by-pin discretization.
Composite SAR imaging using sequential joint sparsity
NASA Astrophysics Data System (ADS)
Sanders, Toby; Gelb, Anne; Platte, Rodrigo B.
2017-06-01
This paper investigates accurate and efficient ℓ1 regularization methods for generating synthetic aperture radar (SAR) images. Although ℓ1 regularization algorithms are already employed in SAR imaging, practical and efficient implementation in terms of real time imaging remain a challenge. Here we demonstrate that fast numerical operators can be used to robustly implement ℓ1 regularization methods that are as or more efficient than traditional approaches such as back projection, while providing superior image quality. In particular, we develop a sequential joint sparsity model for composite SAR imaging which naturally combines the joint sparsity methodology with composite SAR. Our technique, which can be implemented using standard, fractional, or higher order total variation regularization, is able to reduce the effects of speckle and other noisy artifacts with little additional computational cost. Finally we show that generalizing total variation regularization to non-integer and higher orders provides improved flexibility and robustness for SAR imaging.
NASA Technical Reports Server (NTRS)
Nielsen, Eric J.; Kleb, William L.
2005-01-01
A methodology is developed and implemented to mitigate the lengthy software development cycle typically associated with constructing a discrete adjoint solver for aerodynamic simulations. The approach is based on a complex-variable formulation that enables straightforward differentiation of complicated real-valued functions. An automated scripting process is used to create the complex-variable form of the set of discrete equations. An efficient method for assembling the residual and cost function linearizations is developed. The accuracy of the implementation is verified through comparisons with a discrete direct method as well as a previously developed handcoded discrete adjoint approach. Comparisons are also shown for a large-scale configuration to establish the computational efficiency of the present scheme. To ultimately demonstrate the power of the approach, the implementation is extended to high temperature gas flows in chemical nonequilibrium. Finally, several fruitful research and development avenues enabled by the current work are suggested.
Efficient Construction of Discrete Adjoint Operators on Unstructured Grids Using Complex Variables
NASA Technical Reports Server (NTRS)
Nielsen, Eric J.; Kleb, William L.
2005-01-01
A methodology is developed and implemented to mitigate the lengthy software development cycle typically associated with constructing a discrete adjoint solver for aerodynamic simulations. The approach is based on a complex-variable formulation that enables straightforward differentiation of complicated real-valued functions. An automated scripting process is used to create the complex-variable form of the set of discrete equations. An efficient method for assembling the residual and cost function linearizations is developed. The accuracy of the implementation is verified through comparisons with a discrete direct method as well as a previously developed handcoded discrete adjoint approach. Comparisons are also shown for a large-scale configuration to establish the computational efficiency of the present scheme. To ultimately demonstrate the power of the approach, the implementation is extended to high temperature gas flows in chemical nonequilibrium. Finally, several fruitful research and development avenues enabled by the current work are suggested.
New Developments in Modeling MHD Systems on High Performance Computing Architectures
NASA Astrophysics Data System (ADS)
Germaschewski, K.; Raeder, J.; Larson, D. J.; Bhattacharjee, A.
2009-04-01
Modeling the wide range of time and length scales present even in fluid models of plasmas like MHD and X-MHD (Extended MHD including two fluid effects like Hall term, electron inertia, electron pressure gradient) is challenging even on state-of-the-art supercomputers. In the last years, HPC capacity has continued to grow exponentially, but at the expense of making the computer systems more and more difficult to program in order to get maximum performance. In this paper, we will present a new approach to managing the complexity caused by the need to write efficient codes: Separating the numerical description of the problem, in our case a discretized right hand side (r.h.s.), from the actual implementation of efficiently evaluating it. An automatic code generator is used to describe the r.h.s. in a quasi-symbolic form while leaving the translation into efficient and parallelized code to a computer program itself. We implemented this approach for OpenGGCM (Open General Geospace Circulation Model), a model of the Earth's magnetosphere, which was accelerated by a factor of three on regular x86 architecture and a factor of 25 on the Cell BE architecture (commonly known for its deployment in Sony's PlayStation 3).
Quantum computing with incoherent resources and quantum jumps.
Santos, M F; Cunha, M Terra; Chaves, R; Carvalho, A R R
2012-04-27
Spontaneous emission and the inelastic scattering of photons are two natural processes usually associated with decoherence and the reduction in the capacity to process quantum information. Here we show that, when suitably detected, these photons are sufficient to build all the fundamental blocks needed to perform quantum computation in the emitting qubits while protecting them from deleterious dissipative effects. We exemplify this by showing how to efficiently prepare graph states for the implementation of measurement-based quantum computation.
Li, Chuan; Petukh, Marharyta; Li, Lin; Alexov, Emil
2013-08-15
Due to the enormous importance of electrostatics in molecular biology, calculating the electrostatic potential and corresponding energies has become a standard computational approach for the study of biomolecules and nano-objects immersed in water and salt phase or other media. However, the electrostatics of large macromolecules and macromolecular complexes, including nano-objects, may not be obtainable via explicit methods and even the standard continuum electrostatics methods may not be applicable due to high computational time and memory requirements. Here, we report further development of the parallelization scheme reported in our previous work (Li, et al., J. Comput. Chem. 2012, 33, 1960) to include parallelization of the molecular surface and energy calculations components of the algorithm. The parallelization scheme utilizes different approaches such as space domain parallelization, algorithmic parallelization, multithreading, and task scheduling, depending on the quantity being calculated. This allows for efficient use of the computing resources of the corresponding computer cluster. The parallelization scheme is implemented in the popular software DelPhi and results in speedup of several folds. As a demonstration of the efficiency and capability of this methodology, the electrostatic potential, and electric field distributions are calculated for the bovine mitochondrial supercomplex illustrating their complex topology, which cannot be obtained by modeling the supercomplex components alone. Copyright © 2013 Wiley Periodicals, Inc.
Romps, David M.
2016-03-01
Convective entrainment is a process that is poorly represented in existing convective parameterizations. By many estimates, convective entrainment is the leading source of error in global climate models. As a potential remedy, an Eulerian implementation of the Stochastic Parcel Model (SPM) is presented here as a convective parameterization that treats entrainment in a physically realistic and computationally efficient way. Drawing on evidence that convecting clouds comprise air parcels subject to Poisson-process entrainment events, the SPM calculates the deterministic limit of an infinite number of such parcels. For computational efficiency, the SPM groups parcels at each height by their purity, whichmore » is a measure of their total entrainment up to that height. This reduces the calculation of convective fluxes to a sequence of matrix multiplications. The SPM is implemented in a single-column model and compared with a large-eddy simulation of deep convection.« less
User-Defined Data Distributions in High-Level Programming Languages
NASA Technical Reports Server (NTRS)
Diaconescu, Roxana E.; Zima, Hans P.
2006-01-01
One of the characteristic features of today s high performance computing systems is a physically distributed memory. Efficient management of locality is essential for meeting key performance requirements for these architectures. The standard technique for dealing with this issue has involved the extension of traditional sequential programming languages with explicit message passing, in the context of a processor-centric view of parallel computation. This has resulted in complex and error-prone assembly-style codes in which algorithms and communication are inextricably interwoven. This paper presents a high-level approach to the design and implementation of data distributions. Our work is motivated by the need to improve the current parallel programming methodology by introducing a paradigm supporting the development of efficient and reusable parallel code. This approach is currently being implemented in the context of a new programming language called Chapel, which is designed in the HPCS project Cascade.
An optimized and low-cost FPGA-based DNA sequence alignment--a step towards personal genomics.
Shah, Hurmat Ali; Hasan, Laiq; Ahmad, Nasir
2013-01-01
DNA sequence alignment is a cardinal process in computational biology but also is much expensive computationally when performing through traditional computational platforms like CPU. Of many off the shelf platforms explored for speeding up the computation process, FPGA stands as the best candidate due to its performance per dollar spent and performance per watt. These two advantages make FPGA as the most appropriate choice for realizing the aim of personal genomics. The previous implementation of DNA sequence alignment did not take into consideration the price of the device on which optimization was performed. This paper presents optimization over previous FPGA implementation that increases the overall speed-up achieved as well as the price incurred by the platform that was optimized. The optimizations are (1) The array of processing elements is made to run on change in input value and not on clock, so eliminating the need for tight clock synchronization, (2) the implementation is unrestrained by the size of the sequences to be aligned, (3) the waiting time required for the sequences to load to FPGA is reduced to the minimum possible and (4) an efficient method is devised to store the output matrix that make possible to save the diagonal elements to be used in next pass, in parallel with the computation of output matrix. Implemented on Spartan3 FPGA, this implementation achieved 20 times performance improvement in terms of CUPS over GPP implementation.
Reaching out to clinicians: implementation of a computerized alert system.
Degnan, Dan; Merryfield, Dave; Hultgren, Steve
2004-01-01
Several published articles have identified that providing automated, computer-generated clinical alerts about potentially critical clinical situations should result in better quality of care. In 1999, the pharmacy department at a community hospital network implemented and refined a commercially available, computerized clinical alert system. This case report discusses the implementation process, gives examples of how the system is used, and describes results following implementation. The use of the clinical alert system in this hospital network resulted in improved patient safety as well as in greater efficiency and decreased costs.
Granovsky, Alexander A
2015-12-21
We present a new, very efficient semi-numerical approach for the computation of state-specific nuclear gradients of a generic state-averaged multi-configuration self consistent field wavefunction. Our approach eliminates the costly coupled-perturbed multi-configuration Hartree-Fock step as well as the associated integral transformation stage. The details of the implementation within the Firefly quantum chemistry package are discussed and several sample applications are given. The new approach is routinely applicable to geometry optimization of molecular systems with 1000+ basis functions using a standalone multi-core workstation.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Granovsky, Alexander A., E-mail: alex.granovsky@gmail.com
We present a new, very efficient semi-numerical approach for the computation of state-specific nuclear gradients of a generic state-averaged multi-configuration self consistent field wavefunction. Our approach eliminates the costly coupled-perturbed multi-configuration Hartree-Fock step as well as the associated integral transformation stage. The details of the implementation within the Firefly quantum chemistry package are discussed and several sample applications are given. The new approach is routinely applicable to geometry optimization of molecular systems with 1000+ basis functions using a standalone multi-core workstation.
Efficient grid-based techniques for density functional theory
NASA Astrophysics Data System (ADS)
Rodriguez-Hernandez, Juan Ignacio
Understanding the chemical and physical properties of molecules and materials at a fundamental level often requires quantum-mechanical models for these substance's electronic structure. This type of many body quantum mechanics calculation is computationally demanding, hindering its application to substances with more than a few hundreds atoms. The supreme goal of many researches in quantum chemistry---and the topic of this dissertation---is to develop more efficient computational algorithms for electronic structure calculations. In particular, this dissertation develops two new numerical integration techniques for computing molecular and atomic properties within conventional Kohn-Sham-Density Functional Theory (KS-DFT) of molecular electronic structure. The first of these grid-based techniques is based on the transformed sparse grid construction. In this construction, a sparse grid is generated in the unit cube and then mapped to real space according to the pro-molecular density using the conditional distribution transformation. The transformed sparse grid was implemented in program deMon2k, where it is used as the numerical integrator for the exchange-correlation energy and potential in the KS-DFT procedure. We tested our grid by computing ground state energies, equilibrium geometries, and atomization energies. The accuracy on these test calculations shows that our grid is more efficient than some previous integration methods: our grids use fewer points to obtain the same accuracy. The transformed sparse grids were also tested for integrating, interpolating and differentiating in different dimensions (n = 1,2,3,6). The second technique is a grid-based method for computing atomic properties within QTAIM. It was also implemented in deMon2k. The performance of the method was tested by computing QTAIM atomic energies, charges, dipole moments, and quadrupole moments. For medium accuracy, our method is the fastest one we know of.
NASA Astrophysics Data System (ADS)
Kan, Guangyuan; He, Xiaoyan; Ding, Liuqian; Li, Jiren; Hong, Yang; Zuo, Depeng; Ren, Minglei; Lei, Tianjie; Liang, Ke
2018-01-01
Hydrological model calibration has been a hot issue for decades. The shuffled complex evolution method developed at the University of Arizona (SCE-UA) has been proved to be an effective and robust optimization approach. However, its computational efficiency deteriorates significantly when the amount of hydrometeorological data increases. In recent years, the rise of heterogeneous parallel computing has brought hope for the acceleration of hydrological model calibration. This study proposed a parallel SCE-UA method and applied it to the calibration of a watershed rainfall-runoff model, the Xinanjiang model. The parallel method was implemented on heterogeneous computing systems using OpenMP and CUDA. Performance testing and sensitivity analysis were carried out to verify its correctness and efficiency. Comparison results indicated that heterogeneous parallel computing-accelerated SCE-UA converged much more quickly than the original serial version and possessed satisfactory accuracy and stability for the task of fast hydrological model calibration.
Exploiting the chaotic behaviour of atmospheric models with reconfigurable architectures
NASA Astrophysics Data System (ADS)
Russell, Francis P.; Düben, Peter D.; Niu, Xinyu; Luk, Wayne; Palmer, T. N.
2017-12-01
Reconfigurable architectures are becoming mainstream: Amazon, Microsoft and IBM are supporting such architectures in their data centres. The computationally intensive nature of atmospheric modelling is an attractive target for hardware acceleration using reconfigurable computing. Performance of hardware designs can be improved through the use of reduced-precision arithmetic, but maintaining appropriate accuracy is essential. We explore reduced-precision optimisation for simulating chaotic systems, targeting atmospheric modelling, in which even minor changes in arithmetic behaviour will cause simulations to diverge quickly. The possibility of equally valid simulations having differing outcomes means that standard techniques for comparing numerical accuracy are inappropriate. We use the Hellinger distance to compare statistical behaviour between reduced-precision CPU implementations to guide reconfigurable designs of a chaotic system, then analyse accuracy, performance and power efficiency of the resulting implementations. Our results show that with only a limited loss in accuracy corresponding to less than 10% uncertainty in input parameters, the throughput and energy efficiency of a single-precision chaotic system implemented on a Xilinx Virtex-6 SX475T Field Programmable Gate Array (FPGA) can be more than doubled.
Scemama, Anthony; Renon, Nicolas; Rapacioli, Mathias
2014-06-10
We present an algorithm and its parallel implementation for solving a self-consistent problem as encountered in Hartree-Fock or density functional theory. The algorithm takes advantage of the sparsity of matrices through the use of local molecular orbitals. The implementation allows one to exploit efficiently modern symmetric multiprocessing (SMP) computer architectures. As a first application, the algorithm is used within the density-functional-based tight binding method, for which most of the computational time is spent in the linear algebra routines (diagonalization of the Fock/Kohn-Sham matrix). We show that with this algorithm (i) single point calculations on very large systems (millions of atoms) can be performed on large SMP machines, (ii) calculations involving intermediate size systems (1000-100 000 atoms) are also strongly accelerated and can run efficiently on standard servers, and (iii) the error on the total energy due to the use of a cutoff in the molecular orbital coefficients can be controlled such that it remains smaller than the SCF convergence criterion.
Massively parallel multicanonical simulations
NASA Astrophysics Data System (ADS)
Gross, Jonathan; Zierenberg, Johannes; Weigel, Martin; Janke, Wolfhard
2018-03-01
Generalized-ensemble Monte Carlo simulations such as the multicanonical method and similar techniques are among the most efficient approaches for simulations of systems undergoing discontinuous phase transitions or with rugged free-energy landscapes. As Markov chain methods, they are inherently serial computationally. It was demonstrated recently, however, that a combination of independent simulations that communicate weight updates at variable intervals allows for the efficient utilization of parallel computational resources for multicanonical simulations. Implementing this approach for the many-thread architecture provided by current generations of graphics processing units (GPUs), we show how it can be efficiently employed with of the order of 104 parallel walkers and beyond, thus constituting a versatile tool for Monte Carlo simulations in the era of massively parallel computing. We provide the fully documented source code for the approach applied to the paradigmatic example of the two-dimensional Ising model as starting point and reference for practitioners in the field.
LaRC local area networks to support distributed computing
NASA Technical Reports Server (NTRS)
Riddle, E. P.
1984-01-01
The Langley Research Center's (LaRC) Local Area Network (LAN) effort is discussed. LaRC initiated the development of a LAN to support a growing distributed computing environment at the Center. The purpose of the network is to provide an improved capability (over inteactive and RJE terminal access) for sharing multivendor computer resources. Specifically, the network will provide a data highway for the transfer of files between mainframe computers, minicomputers, work stations, and personal computers. An important influence on the overall network design was the vital need of LaRC researchers to efficiently utilize the large CDC mainframe computers in the central scientific computing facility. Although there was a steady migration from a centralized to a distributed computing environment at LaRC in recent years, the work load on the central resources increased. Major emphasis in the network design was on communication with the central resources within the distributed environment. The network to be implemented will allow researchers to utilize the central resources, distributed minicomputers, work stations, and personal computers to obtain the proper level of computing power to efficiently perform their jobs.
NASA Astrophysics Data System (ADS)
Murni, Bustamam, A.; Ernastuti, Handhika, T.; Kerami, D.
2017-07-01
Calculation of the matrix-vector multiplication in the real-world problems often involves large matrix with arbitrary size. Therefore, parallelization is needed to speed up the calculation process that usually takes a long time. Graph partitioning techniques that have been discussed in the previous studies cannot be used to complete the parallelized calculation of matrix-vector multiplication with arbitrary size. This is due to the assumption of graph partitioning techniques that can only solve the square and symmetric matrix. Hypergraph partitioning techniques will overcome the shortcomings of the graph partitioning technique. This paper addresses the efficient parallelization of matrix-vector multiplication through hypergraph partitioning techniques using CUDA GPU-based parallel computing. CUDA (compute unified device architecture) is a parallel computing platform and programming model that was created by NVIDIA and implemented by the GPU (graphics processing unit).
Arterial waveguide model for shear wave elastography: implementation and in vitro validation
NASA Astrophysics Data System (ADS)
Vaziri Astaneh, Ali; Urban, Matthew W.; Aquino, Wilkins; Greenleaf, James F.; Guddati, Murthy N.
2017-07-01
Arterial stiffness is found to be an early indicator of many cardiovascular diseases. Among various techniques, shear wave elastography has emerged as a promising tool for estimating local arterial stiffness through the observed dispersion of guided waves. In this paper, we develop efficient models for the computational simulation of guided wave dispersion in arterial walls. The models are capable of considering fluid-loaded tubes, immersed in fluid or embedded in a solid, which are encountered in in vitro/ex vivo, and in vivo experiments. The proposed methods are based on judiciously combining Fourier transformation and finite element discretization, leading to a significant reduction in computational cost while fully capturing complex 3D wave propagation. The developed methods are implemented in open-source code, and verified by comparing them with significantly more expensive, fully 3D finite element models. We also validate the models using the shear wave elastography of tissue-mimicking phantoms. The computational efficiency of the developed methods indicates the possibility of being able to estimate arterial stiffness in real time, which would be beneficial in clinical settings.
Robust optimization with transiently chaotic dynamical systems
NASA Astrophysics Data System (ADS)
Sumi, R.; Molnár, B.; Ercsey-Ravasz, M.
2014-05-01
Efficiently solving hard optimization problems has been a strong motivation for progress in analog computing. In a recent study we presented a continuous-time dynamical system for solving the NP-complete Boolean satisfiability (SAT) problem, with a one-to-one correspondence between its stable attractors and the SAT solutions. While physical implementations could offer great efficiency, the transiently chaotic dynamics raises the question of operability in the presence of noise, unavoidable on analog devices. Here we show that the probability of finding solutions is robust to noise intensities well above those present on real hardware. We also developed a cellular neural network model realizable with analog circuits, which tolerates even larger noise intensities. These methods represent an opportunity for robust and efficient physical implementations.
Hardware realization of an SVM algorithm implemented in FPGAs
NASA Astrophysics Data System (ADS)
Wiśniewski, Remigiusz; Bazydło, Grzegorz; Szcześniak, Paweł
2017-08-01
The paper proposes a technique of hardware realization of a space vector modulation (SVM) of state function switching in matrix converter (MC), oriented on the implementation in a single field programmable gate array (FPGA). In MC the SVM method is based on the instantaneous space-vector representation of input currents and output voltages. The traditional computation algorithms usually involve digital signal processors (DSPs) which consumes the large number of power transistors (18 transistors and 18 independent PWM outputs) and "non-standard positions of control pulses" during the switching sequence. Recently, hardware implementations become popular since computed operations may be executed much faster and efficient due to nature of the digital devices (especially concurrency). In the paper, we propose a hardware algorithm of SVM computation. In opposite to the existing techniques, the presented solution applies COordinate Rotation DIgital Computer (CORDIC) method to solve the trigonometric operations. Furthermore, adequate arithmetic modules (that is, sub-devices) used for intermediate calculations, such as code converters or proper sectors selectors (for output voltages and input current) are presented in detail. The proposed technique has been implemented as a design described with the use of Verilog hardware description language. The preliminary results of logic implementation oriented on the Xilinx FPGA (particularly, low-cost device from Artix-7 family from Xilinx was used) are also presented.
NASA Astrophysics Data System (ADS)
Franzke, Yannick J.; Middendorf, Nils; Weigend, Florian
2018-03-01
We present an efficient algorithm for one- and two-component analytical energy gradients with respect to nuclear displacements in the exact two-component decoupling approach to the one-electron Dirac equation (X2C). Our approach is a generalization of the spin-free ansatz by Cheng and Gauss [J. Chem. Phys. 135, 084114 (2011)], where the perturbed one-electron Hamiltonian is calculated by solving a first-order response equation. Computational costs are drastically reduced by applying the diagonal local approximation to the unitary decoupling transformation (DLU) [D. Peng and M. Reiher, J. Chem. Phys. 136, 244108 (2012)] to the X2C Hamiltonian. The introduced error is found to be almost negligible as the mean absolute error of the optimized structures amounts to only 0.01 pm. Our implementation in TURBOMOLE is also available within the finite nucleus model based on a Gaussian charge distribution. For a X2C/DLU gradient calculation, computational effort scales cubically with the molecular size, while storage increases quadratically. The efficiency is demonstrated in calculations of large silver clusters and organometallic iridium complexes.
A time-efficient algorithm for implementing the Catmull-Clark subdivision method
NASA Astrophysics Data System (ADS)
Ioannou, G.; Savva, A.; Stylianou, V.
2015-10-01
Splines are the most popular methods in Figure Modeling and CAGD (Computer Aided Geometric Design) in generating smooth surfaces from a number of control points. The control points define the shape of a figure and splines calculate the required number of points which when displayed on a computer screen the result is a smooth surface. However, spline methods are based on a rectangular topological structure of points, i.e., a two-dimensional table of vertices, and thus cannot generate complex figures, such as the human and animal bodies that their complex structure does not allow them to be defined by a regular rectangular grid. On the other hand surface subdivision methods, which are derived by splines, generate surfaces which are defined by an arbitrary topology of control points. This is the reason that during the last fifteen years subdivision methods have taken the lead over regular spline methods in all areas of modeling in both industry and research. The cost of executing computer software developed to read control points and calculate the surface is run-time, due to the fact that the surface-structure required for handling arbitrary topological grids is very complicate. There are many software programs that have been developed related to the implementation of subdivision surfaces however, not many algorithms are documented in the literature, to support developers for writing efficient code. This paper aims to assist programmers by presenting a time-efficient algorithm for implementing subdivision splines. The Catmull-Clark which is the most popular of the subdivision methods has been employed to illustrate the algorithm.
Simulation system architecture design for generic communications link
NASA Technical Reports Server (NTRS)
Tsang, Chit-Sang; Ratliff, Jim
1986-01-01
This paper addresses a computer simulation system architecture design for generic digital communications systems. It addresses the issues of an overall system architecture in order to achieve a user-friendly, efficient, and yet easily implementable simulation system. The system block diagram and its individual functional components are described in detail. Software implementation is discussed with the VAX/VMS operating system used as a target environment.
Higher-order methods for simulations on quantum computers
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sornborger, A.T.; Stewart, E.D.
1999-09-01
To implement many-qubit gates for use in quantum simulations on quantum computers efficiently, we develop and present methods reexpressing exp[[minus]i(H[sub 1]+H[sub 2]+[center dot][center dot][center dot])[Delta]t] as a product of factors exp[[minus]iH[sub 1][Delta]t], exp[[minus]iH[sub 2][Delta]t],[hor ellipsis], which is accurate to third or fourth order in [Delta]t. The methods we derive are an extended form of the symplectic method, and can also be used for an integration of classical Hamiltonians on classical computers. We derive both integral and irrational methods, and find the most efficient methods in both cases. [copyright] [ital 1999] [ital The American Physical Society
Yan, Hao; Wang, Xiaoyu; Shi, Feng; Bai, Ti; Folkerts, Michael; Cervino, Laura; Jiang, Steve B.; Jia, Xun
2014-01-01
Purpose: Compressed sensing (CS)-based iterative reconstruction (IR) techniques are able to reconstruct cone-beam CT (CBCT) images from undersampled noisy data, allowing for imaging dose reduction. However, there are a few practical concerns preventing the clinical implementation of these techniques. On the image quality side, data truncation along the superior–inferior direction under the cone-beam geometry produces severe cone artifacts in the reconstructed images. Ring artifacts are also seen in the half-fan scan mode. On the reconstruction efficiency side, the long computation time hinders clinical use in image-guided radiation therapy (IGRT). Methods: Image quality improvement methods are proposed to mitigate the cone and ring image artifacts in IR. The basic idea is to use weighting factors in the IR data fidelity term to improve projection data consistency with the reconstructed volume. In order to improve the computational efficiency, a multiple graphics processing units (GPUs)-based CS-IR system was developed. The parallelization scheme, detailed analyses of computation time at each step, their relationship with image resolution, and the acceleration factors were studied. The whole system was evaluated in various phantom and patient cases. Results: Ring artifacts can be mitigated by properly designing a weighting factor as a function of the spatial location on the detector. As for the cone artifact, without applying a correction method, it contaminated 13 out of 80 slices in a head-neck case (full-fan). Contamination was even more severe in a pelvis case under half-fan mode, where 36 out of 80 slices were affected, leading to poorer soft tissue delineation and reduced superior–inferior coverage. The proposed method effectively corrects those contaminated slices with mean intensity differences compared to FDK results decreasing from ∼497 and ∼293 HU to ∼39 and ∼27 HU for the full-fan and half-fan cases, respectively. In terms of efficiency boost, an overall 3.1 × speedup factor has been achieved with four GPU cards compared to a single GPU-based reconstruction. The total computation time is ∼30 s for typical clinical cases. Conclusions: The authors have developed a low-dose CBCT IR system for IGRT. By incorporating data consistency-based weighting factors in the IR model, cone/ring artifacts can be mitigated. A boost in computational efficiency is achieved by multi-GPU implementation. PMID:25370645
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yan, Hao, E-mail: steve.jiang@utsouthwestern.edu, E-mail: xun.jia@utsouthwestern.edu; Shi, Feng; Jiang, Steve B.
Purpose: Compressed sensing (CS)-based iterative reconstruction (IR) techniques are able to reconstruct cone-beam CT (CBCT) images from undersampled noisy data, allowing for imaging dose reduction. However, there are a few practical concerns preventing the clinical implementation of these techniques. On the image quality side, data truncation along the superior–inferior direction under the cone-beam geometry produces severe cone artifacts in the reconstructed images. Ring artifacts are also seen in the half-fan scan mode. On the reconstruction efficiency side, the long computation time hinders clinical use in image-guided radiation therapy (IGRT). Methods: Image quality improvement methods are proposed to mitigate the conemore » and ring image artifacts in IR. The basic idea is to use weighting factors in the IR data fidelity term to improve projection data consistency with the reconstructed volume. In order to improve the computational efficiency, a multiple graphics processing units (GPUs)-based CS-IR system was developed. The parallelization scheme, detailed analyses of computation time at each step, their relationship with image resolution, and the acceleration factors were studied. The whole system was evaluated in various phantom and patient cases. Results: Ring artifacts can be mitigated by properly designing a weighting factor as a function of the spatial location on the detector. As for the cone artifact, without applying a correction method, it contaminated 13 out of 80 slices in a head-neck case (full-fan). Contamination was even more severe in a pelvis case under half-fan mode, where 36 out of 80 slices were affected, leading to poorer soft tissue delineation and reduced superior–inferior coverage. The proposed method effectively corrects those contaminated slices with mean intensity differences compared to FDK results decreasing from ∼497 and ∼293 HU to ∼39 and ∼27 HU for the full-fan and half-fan cases, respectively. In terms of efficiency boost, an overall 3.1 × speedup factor has been achieved with four GPU cards compared to a single GPU-based reconstruction. The total computation time is ∼30 s for typical clinical cases. Conclusions: The authors have developed a low-dose CBCT IR system for IGRT. By incorporating data consistency-based weighting factors in the IR model, cone/ring artifacts can be mitigated. A boost in computational efficiency is achieved by multi-GPU implementation.« less
Analysis of multigrid methods on massively parallel computers: Architectural implications
NASA Technical Reports Server (NTRS)
Matheson, Lesley R.; Tarjan, Robert E.
1993-01-01
We study the potential performance of multigrid algorithms running on massively parallel computers with the intent of discovering whether presently envisioned machines will provide an efficient platform for such algorithms. We consider the domain parallel version of the standard V cycle algorithm on model problems, discretized using finite difference techniques in two and three dimensions on block structured grids of size 10(exp 6) and 10(exp 9), respectively. Our models of parallel computation were developed to reflect the computing characteristics of the current generation of massively parallel multicomputers. These models are based on an interconnection network of 256 to 16,384 message passing, 'workstation size' processors executing in an SPMD mode. The first model accomplishes interprocessor communications through a multistage permutation network. The communication cost is a logarithmic function which is similar to the costs in a variety of different topologies. The second model allows single stage communication costs only. Both models were designed with information provided by machine developers and utilize implementation derived parameters. With the medium grain parallelism of the current generation and the high fixed cost of an interprocessor communication, our analysis suggests an efficient implementation requires the machine to support the efficient transmission of long messages, (up to 1000 words) or the high initiation cost of a communication must be significantly reduced through an alternative optimization technique. Furthermore, with variable length message capability, our analysis suggests the low diameter multistage networks provide little or no advantage over a simple single stage communications network.
Efficient Computer Implementations of Fast Fourier Transforms.
1980-12-01
fit in computer? Yes, continue (9) Determine fastest algorithm between WFTA and PFA from Table 4.6. For N=420, WFTA PFA Mult 1296 2528 Add 11352 10956...real adds = 24tN/4 + 2(3tN/4) = 15tN/2 (G.8) 260 All odd prime C<ictors ciual to or (,rater than 5 iso the general transform section. Based on the
Efficient Computation Of Manipulator Inertia Matrix
NASA Technical Reports Server (NTRS)
Fijany, Amir; Bejczy, Antal K.
1991-01-01
Improved method for computation of manipulator inertia matrix developed, based on concept of spatial inertia of composite rigid body. Required for implementation of advanced dynamic-control schemes as well as dynamic simulation of manipulator motion. Motivated by increasing demand for fast algorithms to provide real-time control and simulation capability and, particularly, need for faster-than-real-time simulation capability, required in many anticipated space teleoperation applications.
NASA Astrophysics Data System (ADS)
Zhang, Hongqin; Tian, Xiangjun
2018-04-01
Ensemble-based data assimilation methods often use the so-called localization scheme to improve the representation of the ensemble background error covariance (Be). Extensive research has been undertaken to reduce the computational cost of these methods by using the localized ensemble samples to localize Be by means of a direct decomposition of the local correlation matrix C. However, the computational costs of the direct decomposition of the local correlation matrix C are still extremely high due to its high dimension. In this paper, we propose an efficient local correlation matrix decomposition approach based on the concept of alternating directions. This approach is intended to avoid direct decomposition of the correlation matrix. Instead, we first decompose the correlation matrix into 1-D correlation matrices in the three coordinate directions, then construct their empirical orthogonal function decomposition at low resolution. This procedure is followed by the 1-D spline interpolation process to transform the above decompositions to the high-resolution grid. Finally, an efficient correlation matrix decomposition is achieved by computing the very similar Kronecker product. We conducted a series of comparison experiments to illustrate the validity and accuracy of the proposed local correlation matrix decomposition approach. The effectiveness of the proposed correlation matrix decomposition approach and its efficient localization implementation of the nonlinear least-squares four-dimensional variational assimilation are further demonstrated by several groups of numerical experiments based on the Advanced Research Weather Research and Forecasting model.
Acceleration of FDTD mode solver by high-performance computing techniques.
Han, Lin; Xi, Yanping; Huang, Wei-Ping
2010-06-21
A two-dimensional (2D) compact finite-difference time-domain (FDTD) mode solver is developed based on wave equation formalism in combination with the matrix pencil method (MPM). The method is validated for calculation of both real guided and complex leaky modes of typical optical waveguides against the bench-mark finite-difference (FD) eigen mode solver. By taking advantage of the inherent parallel nature of the FDTD algorithm, the mode solver is implemented on graphics processing units (GPUs) using the compute unified device architecture (CUDA). It is demonstrated that the high-performance computing technique leads to significant acceleration of the FDTD mode solver with more than 30 times improvement in computational efficiency in comparison with the conventional FDTD mode solver running on CPU of a standard desktop computer. The computational efficiency of the accelerated FDTD method is in the same order of magnitude of the standard finite-difference eigen mode solver and yet require much less memory (e.g., less than 10%). Therefore, the new method may serve as an efficient, accurate and robust tool for mode calculation of optical waveguides even when the conventional eigen value mode solvers are no longer applicable due to memory limitation.
NASA Astrophysics Data System (ADS)
Schwörer, Magnus; Lorenzen, Konstantin; Mathias, Gerald; Tavan, Paul
2015-03-01
Recently, a novel approach to hybrid quantum mechanics/molecular mechanics (QM/MM) molecular dynamics (MD) simulations has been suggested [Schwörer et al., J. Chem. Phys. 138, 244103 (2013)]. Here, the forces acting on the atoms are calculated by grid-based density functional theory (DFT) for a solute molecule and by a polarizable molecular mechanics (PMM) force field for a large solvent environment composed of several 103-105 molecules as negative gradients of a DFT/PMM hybrid Hamiltonian. The electrostatic interactions are efficiently described by a hierarchical fast multipole method (FMM). Adopting recent progress of this FMM technique [Lorenzen et al., J. Chem. Theory Comput. 10, 3244 (2014)], which particularly entails a strictly linear scaling of the computational effort with the system size, and adapting this revised FMM approach to the computation of the interactions between the DFT and PMM fragments of a simulation system, here, we show how one can further enhance the efficiency and accuracy of such DFT/PMM-MD simulations. The resulting gain of total performance, as measured for alanine dipeptide (DFT) embedded in water (PMM) by the product of the gains in efficiency and accuracy, amounts to about one order of magnitude. We also demonstrate that the jointly parallelized implementation of the DFT and PMM-MD parts of the computation enables the efficient use of high-performance computing systems. The associated software is available online.
Computer-intensive simulation of solid-state NMR experiments using SIMPSON.
Tošner, Zdeněk; Andersen, Rasmus; Stevensson, Baltzar; Edén, Mattias; Nielsen, Niels Chr; Vosegaard, Thomas
2014-09-01
Conducting large-scale solid-state NMR simulations requires fast computer software potentially in combination with efficient computational resources to complete within a reasonable time frame. Such simulations may involve large spin systems, multiple-parameter fitting of experimental spectra, or multiple-pulse experiment design using parameter scan, non-linear optimization, or optimal control procedures. To efficiently accommodate such simulations, we here present an improved version of the widely distributed open-source SIMPSON NMR simulation software package adapted to contemporary high performance hardware setups. The software is optimized for fast performance on standard stand-alone computers, multi-core processors, and large clusters of identical nodes. We describe the novel features for fast computation including internal matrix manipulations, propagator setups and acquisition strategies. For efficient calculation of powder averages, we implemented interpolation method of Alderman, Solum, and Grant, as well as recently introduced fast Wigner transform interpolation technique. The potential of the optimal control toolbox is greatly enhanced by higher precision gradients in combination with the efficient optimization algorithm known as limited memory Broyden-Fletcher-Goldfarb-Shanno. In addition, advanced parallelization can be used in all types of calculations, providing significant time reductions. SIMPSON is thus reflecting current knowledge in the field of numerical simulations of solid-state NMR experiments. The efficiency and novel features are demonstrated on the representative simulations. Copyright © 2014 Elsevier Inc. All rights reserved.
Pinthong, Watthanai; Muangruen, Panya
2016-01-01
Development of high-throughput technologies, such as Next-generation sequencing, allows thousands of experiments to be performed simultaneously while reducing resource requirement. Consequently, a massive amount of experiment data is now rapidly generated. Nevertheless, the data are not readily usable or meaningful until they are further analysed and interpreted. Due to the size of the data, a high performance computer (HPC) is required for the analysis and interpretation. However, the HPC is expensive and difficult to access. Other means were developed to allow researchers to acquire the power of HPC without a need to purchase and maintain one such as cloud computing services and grid computing system. In this study, we implemented grid computing in a computer training center environment using Berkeley Open Infrastructure for Network Computing (BOINC) as a job distributor and data manager combining all desktop computers to virtualize the HPC. Fifty desktop computers were used for setting up a grid system during the off-hours. In order to test the performance of the grid system, we adapted the Basic Local Alignment Search Tools (BLAST) to the BOINC system. Sequencing results from Illumina platform were aligned to the human genome database by BLAST on the grid system. The result and processing time were compared to those from a single desktop computer and HPC. The estimated durations of BLAST analysis for 4 million sequence reads on a desktop PC, HPC and the grid system were 568, 24 and 5 days, respectively. Thus, the grid implementation of BLAST by BOINC is an efficient alternative to the HPC for sequence alignment. The grid implementation by BOINC also helped tap unused computing resources during the off-hours and could be easily modified for other available bioinformatics software. PMID:27547555
Haidar, Azzam; Jagode, Heike; Vaccaro, Phil; ...
2018-03-22
The emergence of power efficiency as a primary constraint in processor and system design poses new challenges concerning power and energy awareness for numerical libraries and scientific applications. Power consumption also plays a major role in the design of data centers, which may house petascale or exascale-level computing systems. At these extreme scales, understanding and improving the energy efficiency of numerical libraries and their related applications becomes a crucial part of the successful implementation and operation of the computing system. In this paper, we study and investigate the practice of controlling a compute system's power usage, and we explore howmore » different power caps affect the performance of numerical algorithms with different computational intensities. Further, we determine the impact, in terms of performance and energy usage, that these caps have on a system running scientific applications. This analysis will enable us to characterize the types of algorithms that benefit most from these power management schemes. Our experiments are performed using a set of representative kernels and several popular scientific benchmarks. Lastly, we quantify a number of power and performance measurements and draw observations and conclusions that can be viewed as a roadmap to achieving energy efficiency in the design and execution of scientific algorithms.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Haidar, Azzam; Jagode, Heike; Vaccaro, Phil
The emergence of power efficiency as a primary constraint in processor and system design poses new challenges concerning power and energy awareness for numerical libraries and scientific applications. Power consumption also plays a major role in the design of data centers, which may house petascale or exascale-level computing systems. At these extreme scales, understanding and improving the energy efficiency of numerical libraries and their related applications becomes a crucial part of the successful implementation and operation of the computing system. In this paper, we study and investigate the practice of controlling a compute system's power usage, and we explore howmore » different power caps affect the performance of numerical algorithms with different computational intensities. Further, we determine the impact, in terms of performance and energy usage, that these caps have on a system running scientific applications. This analysis will enable us to characterize the types of algorithms that benefit most from these power management schemes. Our experiments are performed using a set of representative kernels and several popular scientific benchmarks. Lastly, we quantify a number of power and performance measurements and draw observations and conclusions that can be viewed as a roadmap to achieving energy efficiency in the design and execution of scientific algorithms.« less
NASA Astrophysics Data System (ADS)
Joost, William J.
2012-09-01
Transportation accounts for approximately 28% of U.S. energy consumption with the majority of transportation energy derived from petroleum sources. Many technologies such as vehicle electrification, advanced combustion, and advanced fuels can reduce transportation energy consumption by improving the efficiency of cars and trucks. Lightweight materials are another important technology that can improve passenger vehicle fuel efficiency by 6-8% for each 10% reduction in weight while also making electric and alternative vehicles more competitive. Despite the opportunities for improved efficiency, widespread deployment of lightweight materials for automotive structures is hampered by technology gaps most often associated with performance, manufacturability, and cost. In this report, the impact of reduced vehicle weight on energy efficiency is discussed with a particular emphasis on quantitative relationships determined by several researchers. The most promising lightweight materials systems are described along with a brief review of the most significant technical barriers to their implementation. For each material system, the development of accurate material models is critical to support simulation-intensive processing and structural design for vehicles; improved models also contribute to an integrated computational materials engineering (ICME) approach for addressing technical barriers and accelerating deployment. The value of computational techniques is described by considering recent ICME and computational materials science success stories with an emphasis on applying problem-specific methods.
Multi-GPU implementation of a VMAT treatment plan optimization algorithm
DOE Office of Scientific and Technical Information (OSTI.GOV)
Tian, Zhen, E-mail: Zhen.Tian@UTSouthwestern.edu, E-mail: Xun.Jia@UTSouthwestern.edu, E-mail: Steve.Jiang@UTSouthwestern.edu; Folkerts, Michael; Tan, Jun
Purpose: Volumetric modulated arc therapy (VMAT) optimization is a computationally challenging problem due to its large data size, high degrees of freedom, and many hardware constraints. High-performance graphics processing units (GPUs) have been used to speed up the computations. However, GPU’s relatively small memory size cannot handle cases with a large dose-deposition coefficient (DDC) matrix in cases of, e.g., those with a large target size, multiple targets, multiple arcs, and/or small beamlet size. The main purpose of this paper is to report an implementation of a column-generation-based VMAT algorithm, previously developed in the authors’ group, on a multi-GPU platform tomore » solve the memory limitation problem. While the column-generation-based VMAT algorithm has been previously developed, the GPU implementation details have not been reported. Hence, another purpose is to present detailed techniques employed for GPU implementation. The authors also would like to utilize this particular problem as an example problem to study the feasibility of using a multi-GPU platform to solve large-scale problems in medical physics. Methods: The column-generation approach generates VMAT apertures sequentially by solving a pricing problem (PP) and a master problem (MP) iteratively. In the authors’ method, the sparse DDC matrix is first stored on a CPU in coordinate list format (COO). On the GPU side, this matrix is split into four submatrices according to beam angles, which are stored on four GPUs in compressed sparse row format. Computation of beamlet price, the first step in PP, is accomplished using multi-GPUs. A fast inter-GPU data transfer scheme is accomplished using peer-to-peer access. The remaining steps of PP and MP problems are implemented on CPU or a single GPU due to their modest problem scale and computational loads. Barzilai and Borwein algorithm with a subspace step scheme is adopted here to solve the MP problem. A head and neck (H and N) cancer case is then used to validate the authors’ method. The authors also compare their multi-GPU implementation with three different single GPU implementation strategies, i.e., truncating DDC matrix (S1), repeatedly transferring DDC matrix between CPU and GPU (S2), and porting computations involving DDC matrix to CPU (S3), in terms of both plan quality and computational efficiency. Two more H and N patient cases and three prostate cases are used to demonstrate the advantages of the authors’ method. Results: The authors’ multi-GPU implementation can finish the optimization process within ∼1 min for the H and N patient case. S1 leads to an inferior plan quality although its total time was 10 s shorter than the multi-GPU implementation due to the reduced matrix size. S2 and S3 yield the same plan quality as the multi-GPU implementation but take ∼4 and ∼6 min, respectively. High computational efficiency was consistently achieved for the other five patient cases tested, with VMAT plans of clinically acceptable quality obtained within 23–46 s. Conversely, to obtain clinically comparable or acceptable plans for all six of these VMAT cases that the authors have tested in this paper, the optimization time needed in a commercial TPS system on CPU was found to be in an order of several minutes. Conclusions: The results demonstrate that the multi-GPU implementation of the authors’ column-generation-based VMAT optimization can handle the large-scale VMAT optimization problem efficiently without sacrificing plan quality. The authors’ study may serve as an example to shed some light on other large-scale medical physics problems that require multi-GPU techniques.« less
Saturn S-2 Automatic Software System /SASS/
NASA Technical Reports Server (NTRS)
Parker, P. E.
1967-01-01
SATURN S-2 Automatic Software System /SASS/ was designed and implemented to aid SATURN S-2 program development and to increase the overall operating efficiency within the S-2 data laboratory. This program is written in FORTRAN 2 for SDS 920 computers.
Field testing of eco-speed control using V2I communication.
DOT National Transportation Integrated Search
2016-04-15
This research focused on the development of an Eco-Cooperative Adaptive Cruise Control (EcoCACC) : System and addressed the implementation issues associated with applying it in the field. : The Eco-CACC system computes and recommends a fuel-efficient...
A new framework for comprehensive, robust, and efficient global sensitivity analysis: 1. Theory
NASA Astrophysics Data System (ADS)
Razavi, Saman; Gupta, Hoshin V.
2016-01-01
Computer simulation models are continually growing in complexity with increasingly more factors to be identified. Sensitivity Analysis (SA) provides an essential means for understanding the role and importance of these factors in producing model responses. However, conventional approaches to SA suffer from (1) an ambiguous characterization of sensitivity, and (2) poor computational efficiency, particularly as the problem dimension grows. Here, we present a new and general sensitivity analysis framework (called VARS), based on an analogy to "variogram analysis," that provides an intuitive and comprehensive characterization of sensitivity across the full spectrum of scales in the factor space. We prove, theoretically, that Morris (derivative-based) and Sobol (variance-based) methods and their extensions are special cases of VARS, and that their SA indices can be computed as by-products of the VARS framework. Synthetic functions that resemble actual model response surfaces are used to illustrate the concepts, and show VARS to be as much as two orders of magnitude more computationally efficient than the state-of-the-art Sobol approach. In a companion paper, we propose a practical implementation strategy, and demonstrate the effectiveness, efficiency, and reliability (robustness) of the VARS framework on real-data case studies.
The symmetric MSD encoder for one-step adder of ternary optical computer
NASA Astrophysics Data System (ADS)
Kai, Song; LiPing, Yan
2016-08-01
The symmetric Modified Signed-Digit (MSD) encoding is important for achieving the one-step MSD adder of Ternary Optical Computer (TOC). The paper described the symmetric MSD encoding algorithm in detail, and developed its truth table which has nine rows and nine columns. According to the truth table, the state table was developed, and the optical-path structure and circuit-implementation scheme of the symmetric MSD encoder (SME) for one-step adder of TOC were proposed. Finally, a series of experiments were designed and performed. The observed results of the experiments showed that the scheme to implement SME was correct, feasible and efficient.
Quantum computation with classical light: Implementation of the Deutsch-Jozsa algorithm
NASA Astrophysics Data System (ADS)
Perez-Garcia, Benjamin; McLaren, Melanie; Goyal, Sandeep K.; Hernandez-Aranda, Raul I.; Forbes, Andrew; Konrad, Thomas
2016-05-01
We propose an optical implementation of the Deutsch-Jozsa Algorithm using classical light in a binary decision-tree scheme. Our approach uses a ring cavity and linear optical devices in order to efficiently query the oracle functional values. In addition, we take advantage of the intrinsic Fourier transforming properties of a lens to read out whether the function given by the oracle is balanced or constant.
An Analogue VLSI Implementation of the Meddis Inner Hair Cell Model
NASA Astrophysics Data System (ADS)
McEwan, Alistair; van Schaik, André
2003-12-01
The Meddis inner hair cell model is a widely accepted, but computationally intensive computer model of mammalian inner hair cell function. We have produced an analogue VLSI implementation of this model that operates in real time in the current domain by using translinear and log-domain circuits. The circuit has been fabricated on a chip and tested against the Meddis model for (a) rate level functions for onset and steady-state response, (b) recovery after masking, (c) additivity, (d) two-component adaptation, (e) phase locking, (f) recovery of spontaneous activity, and (g) computational efficiency. The advantage of this circuit, over other electronic inner hair cell models, is its nearly exact implementation of the Meddis model which can be tuned to behave similarly to the biological inner hair cell. This has important implications on our ability to simulate the auditory system in real time. Furthermore, the technique of mapping a mathematical model of first-order differential equations to a circuit of log-domain filters allows us to implement real-time neuromorphic signal processors for a host of models using the same approach.
High-Density Liquid-State Machine Circuitry for Time-Series Forecasting.
Rosselló, Josep L; Alomar, Miquel L; Morro, Antoni; Oliver, Antoni; Canals, Vincent
2016-08-01
Spiking neural networks (SNN) are the last neural network generation that try to mimic the real behavior of biological neurons. Although most research in this area is done through software applications, it is in hardware implementations in which the intrinsic parallelism of these computing systems are more efficiently exploited. Liquid state machines (LSM) have arisen as a strategic technique to implement recurrent designs of SNN with a simple learning methodology. In this work, we show a new low-cost methodology to implement high-density LSM by using Boolean gates. The proposed method is based on the use of probabilistic computing concepts to reduce hardware requirements, thus considerably increasing the neuron count per chip. The result is a highly functional system that is applied to high-speed time series forecasting.
NASA Astrophysics Data System (ADS)
Work, Paul R.
1991-12-01
This thesis investigates the parallelization of existing serial programs in computational electromagnetics for use in a parallel environment. Existing algorithms for calculating the radar cross section of an object are covered, and a ray-tracing code is chosen for implementation on a parallel machine. Current parallel architectures are introduced and a suitable parallel machine is selected for the implementation of the chosen ray-tracing algorithm. The standard techniques for the parallelization of serial codes are discussed, including load balancing and decomposition considerations, and appropriate methods for the parallelization effort are selected. A load balancing algorithm is modified to increase the efficiency of the application, and a high level design of the structure of the serial program is presented. A detailed design of the modifications for the parallel implementation is also included, with both the high level and the detailed design specified in a high level design language called UNITY. The correctness of the design is proven using UNITY and standard logic operations. The theoretical and empirical results show that it is possible to achieve an efficient parallel application for a serial computational electromagnetic program where the characteristics of the algorithm and the target architecture critically influence the development of such an implementation.
A Two-Stage Procedure Toward the Efficient Implementation of PANS and Other Hybrid Turbulence Models
NASA Technical Reports Server (NTRS)
Abdol-Hamid, Khaled S.; Girimaji, Sharath S.
2004-01-01
The main objective of this article is to introduce and to show the implementation of a novel two-stage procedure to efficiently estimate the level of scale resolution possible for a given flow on a given grid for Partial Averaged Navier-Stokes (PANS) and other hybrid models. It has been found that the prescribed scale resolution can play a major role in obtaining accurate flow solutions. The first step is to solve the unsteady or steady Reynolds Averaged Navier-Stokes (URANS/RANS) equations. From this preprocessing step, the turbulence length-scale field is obtained. This is then used to compute the characteristic length-scale ratio between the turbulence scale and the grid spacing. Based on this ratio, we can assess the finest scale resolution that a given grid for a given flow can support. Along with other additional criteria, we are able to analytically identify the appropriate hybrid solver resolution for different regions of the flow. This procedure removes the grid dependency issue that affects the results produced by different hybrid procedures in solving unsteady flows. The formulation, implementation methodology, and validation example are presented. We implemented this capability in a production Computational Fluid Dynamics (CFD) code, PAB3D, for the simulation of unsteady flows.
ELIPS: Toward a Sensor Fusion Processor on a Chip
NASA Technical Reports Server (NTRS)
Daud, Taher; Stoica, Adrian; Tyson, Thomas; Li, Wei-te; Fabunmi, James
1998-01-01
The paper presents the concept and initial tests from the hardware implementation of a low-power, high-speed reconfigurable sensor fusion processor. The Extended Logic Intelligent Processing System (ELIPS) processor is developed to seamlessly combine rule-based systems, fuzzy logic, and neural networks to achieve parallel fusion of sensor in compact low power VLSI. The first demonstration of the ELIPS concept targets interceptor functionality; other applications, mainly in robotics and autonomous systems are considered for the future. The main assumption behind ELIPS is that fuzzy, rule-based and neural forms of computation can serve as the main primitives of an "intelligent" processor. Thus, in the same way classic processors are designed to optimize the hardware implementation of a set of fundamental operations, ELIPS is developed as an efficient implementation of computational intelligence primitives, and relies on a set of fuzzy set, fuzzy inference and neural modules, built in programmable analog hardware. The hardware programmability allows the processor to reconfigure into different machines, taking the most efficient hardware implementation during each phase of information processing. Following software demonstrations on several interceptor data, three important ELIPS building blocks (a fuzzy set preprocessor, a rule-based fuzzy system and a neural network) have been fabricated in analog VLSI hardware and demonstrated microsecond-processing times.
A Stochastic Spiking Neural Network for Virtual Screening.
Morro, A; Canals, V; Oliver, A; Alomar, M L; Galan-Prado, F; Ballester, P J; Rossello, J L
2018-04-01
Virtual screening (VS) has become a key computational tool in early drug design and screening performance is of high relevance due to the large volume of data that must be processed to identify molecules with the sought activity-related pattern. At the same time, the hardware implementations of spiking neural networks (SNNs) arise as an emerging computing technique that can be applied to parallelize processes that normally present a high cost in terms of computing time and power. Consequently, SNN represents an attractive alternative to perform time-consuming processing tasks, such as VS. In this brief, we present a smart stochastic spiking neural architecture that implements the ultrafast shape recognition (USR) algorithm achieving two order of magnitude of speed improvement with respect to USR software implementations. The neural system is implemented in hardware using field-programmable gate arrays allowing a highly parallelized USR implementation. The results show that, due to the high parallelization of the system, millions of compounds can be checked in reasonable times. From these results, we can state that the proposed architecture arises as a feasible methodology to efficiently enhance time-consuming data-mining processes such as 3-D molecular similarity search.
Higher-order ice-sheet modelling accelerated by multigrid on graphics cards
NASA Astrophysics Data System (ADS)
Brædstrup, Christian; Egholm, David
2013-04-01
Higher-order ice flow modelling is a very computer intensive process owing primarily to the nonlinear influence of the horizontal stress coupling. When applied for simulating long-term glacial landscape evolution, the ice-sheet models must consider very long time series, while both high temporal and spatial resolution is needed to resolve small effects. The use of higher-order and full stokes models have therefore seen very limited usage in this field. However, recent advances in graphics card (GPU) technology for high performance computing have proven extremely efficient in accelerating many large-scale scientific computations. The general purpose GPU (GPGPU) technology is cheap, has a low power consumption and fits into a normal desktop computer. It could therefore provide a powerful tool for many glaciologists working on ice flow models. Our current research focuses on utilising the GPU as a tool in ice-sheet and glacier modelling. To this extent we have implemented the Integrated Second-Order Shallow Ice Approximation (iSOSIA) equations on the device using the finite difference method. To accelerate the computations, the GPU solver uses a non-linear Red-Black Gauss-Seidel iterator coupled with a Full Approximation Scheme (FAS) multigrid setup to further aid convergence. The GPU finite difference implementation provides the inherent parallelization that scales from hundreds to several thousands of cores on newer cards. We demonstrate the efficiency of the GPU multigrid solver using benchmark experiments.
NASA Astrophysics Data System (ADS)
Zhang, Ling; Nan, Zhuotong; Liang, Xu; Xu, Yi; Hernández, Felipe; Li, Lianxia
2018-03-01
Although process-based distributed hydrological models (PDHMs) are evolving rapidly over the last few decades, their extensive applications are still challenged by the computational expenses. This study attempted, for the first time, to apply the numerically efficient MacCormack algorithm to overland flow routing in a representative high-spatial resolution PDHM, i.e., the distributed hydrology-soil-vegetation model (DHSVM), in order to improve its computational efficiency. The analytical verification indicates that both the semi and full versions of the MacCormack schemes exhibit robust numerical stability and are more computationally efficient than the conventional explicit linear scheme. The full-version outperforms the semi-version in terms of simulation accuracy when a same time step is adopted. The semi-MacCormack scheme was implemented into DHSVM (version 3.1.2) to solve the kinematic wave equations for overland flow routing. The performance and practicality of the enhanced DHSVM-MacCormack model was assessed by performing two groups of modeling experiments in the Mercer Creek watershed, a small urban catchment near Bellevue, Washington. The experiments show that DHSVM-MacCormack can considerably improve the computational efficiency without compromising the simulation accuracy of the original DHSVM model. More specifically, with the same computational environment and model settings, the computational time required by DHSVM-MacCormack can be reduced to several dozen minutes for a simulation period of three months (in contrast with one day and a half by the original DHSVM model) without noticeable sacrifice of the accuracy. The MacCormack scheme proves to be applicable to overland flow routing in DHSVM, which implies that it can be coupled into other PHDMs for watershed routing to either significantly improve their computational efficiency or to make the kinematic wave routing for high resolution modeling computational feasible.
A programming language for composable DNA circuits
Phillips, Andrew; Cardelli, Luca
2009-01-01
Recently, a range of information-processing circuits have been implemented in DNA by using strand displacement as their main computational mechanism. Examples include digital logic circuits and catalytic signal amplification circuits that function as efficient molecular detectors. As new paradigms for DNA computation emerge, the development of corresponding languages and tools for these paradigms will help to facilitate the design of DNA circuits and their automatic compilation to nucleotide sequences. We present a programming language for designing and simulating DNA circuits in which strand displacement is the main computational mechanism. The language includes basic elements of sequence domains, toeholds and branch migration, and assumes that strands do not possess any secondary structure. The language is used to model and simulate a variety of circuits, including an entropy-driven catalytic gate, a simple gate motif for synthesizing large-scale circuits and a scheme for implementing an arbitrary system of chemical reactions. The language is a first step towards the design of modelling and simulation tools for DNA strand displacement, which complements the emergence of novel implementation strategies for DNA computing. PMID:19535415
Computational Burden Resulting from Image Recognition of High Resolution Radar Sensors
López-Rodríguez, Patricia; Fernández-Recio, Raúl; Bravo, Ignacio; Gardel, Alfredo; Lázaro, José L.; Rufo, Elena
2013-01-01
This paper presents a methodology for high resolution radar image generation and automatic target recognition emphasizing the computational cost involved in the process. In order to obtain focused inverse synthetic aperture radar (ISAR) images certain signal processing algorithms must be applied to the information sensed by the radar. From actual data collected by radar the stages and algorithms needed to obtain ISAR images are revised, including high resolution range profile generation, motion compensation and ISAR formation. Target recognition is achieved by comparing the generated set of actual ISAR images with a database of ISAR images generated by electromagnetic software. High resolution radar image generation and target recognition processes are burdensome and time consuming, so to determine the most suitable implementation platform the analysis of the computational complexity is of great interest. To this end and since target identification must be completed in real time, computational burden of both processes the generation and comparison with a database is explained separately. Conclusions are drawn about implementation platforms and calculation efficiency in order to reduce time consumption in a possible future implementation. PMID:23609804
Computational burden resulting from image recognition of high resolution radar sensors.
López-Rodríguez, Patricia; Fernández-Recio, Raúl; Bravo, Ignacio; Gardel, Alfredo; Lázaro, José L; Rufo, Elena
2013-04-22
This paper presents a methodology for high resolution radar image generation and automatic target recognition emphasizing the computational cost involved in the process. In order to obtain focused inverse synthetic aperture radar (ISAR) images certain signal processing algorithms must be applied to the information sensed by the radar. From actual data collected by radar the stages and algorithms needed to obtain ISAR images are revised, including high resolution range profile generation, motion compensation and ISAR formation. Target recognition is achieved by comparing the generated set of actual ISAR images with a database of ISAR images generated by electromagnetic software. High resolution radar image generation and target recognition processes are burdensome and time consuming, so to determine the most suitable implementation platform the analysis of the computational complexity is of great interest. To this end and since target identification must be completed in real time, computational burden of both processes the generation and comparison with a database is explained separately. Conclusions are drawn about implementation platforms and calculation efficiency in order to reduce time consumption in a possible future implementation.
A programming language for composable DNA circuits.
Phillips, Andrew; Cardelli, Luca
2009-08-06
Recently, a range of information-processing circuits have been implemented in DNA by using strand displacement as their main computational mechanism. Examples include digital logic circuits and catalytic signal amplification circuits that function as efficient molecular detectors. As new paradigms for DNA computation emerge, the development of corresponding languages and tools for these paradigms will help to facilitate the design of DNA circuits and their automatic compilation to nucleotide sequences. We present a programming language for designing and simulating DNA circuits in which strand displacement is the main computational mechanism. The language includes basic elements of sequence domains, toeholds and branch migration, and assumes that strands do not possess any secondary structure. The language is used to model and simulate a variety of circuits, including an entropy-driven catalytic gate, a simple gate motif for synthesizing large-scale circuits and a scheme for implementing an arbitrary system of chemical reactions. The language is a first step towards the design of modelling and simulation tools for DNA strand displacement, which complements the emergence of novel implementation strategies for DNA computing.
Further optimization of SeDDaRA blind image deconvolution algorithm and its DSP implementation
NASA Astrophysics Data System (ADS)
Wen, Bo; Zhang, Qiheng; Zhang, Jianlin
2011-11-01
Efficient algorithm for blind image deconvolution and its high-speed implementation is of great value in practice. Further optimization of SeDDaRA is developed, from algorithm structure to numerical calculation methods. The main optimization covers that, the structure's modularization for good implementation feasibility, reducing the data computation and dependency of 2D-FFT/IFFT, and acceleration of power operation by segmented look-up table. Then the Fast SeDDaRA is proposed and specialized for low complexity. As the final implementation, a hardware system of image restoration is conducted by using the multi-DSP parallel processing. Experimental results show that, the processing time and memory demand of Fast SeDDaRA decreases 50% at least; the data throughput of image restoration system is over 7.8Msps. The optimization is proved efficient and feasible, and the Fast SeDDaRA is able to support the real-time application.
Large-Scale Distributed Computational Fluid Dynamics on the Information Power Grid Using Globus
NASA Technical Reports Server (NTRS)
Barnard, Stephen; Biswas, Rupak; Saini, Subhash; VanderWijngaart, Robertus; Yarrow, Maurice; Zechtzer, Lou; Foster, Ian; Larsson, Olle
1999-01-01
This paper describes an experiment in which a large-scale scientific application development for tightly-coupled parallel machines is adapted to the distributed execution environment of the Information Power Grid (IPG). A brief overview of the IPG and a description of the computational fluid dynamics (CFD) algorithm are given. The Globus metacomputing toolkit is used as the enabling device for the geographically-distributed computation. Modifications related to latency hiding and Load balancing were required for an efficient implementation of the CFD application in the IPG environment. Performance results on a pair of SGI Origin 2000 machines indicate that real scientific applications can be effectively implemented on the IPG; however, a significant amount of continued effort is required to make such an environment useful and accessible to scientists and engineers.
Discrete Fourier Transform Analysis in a Complex Vector Space
NASA Technical Reports Server (NTRS)
Dean, Bruce H.
2009-01-01
Alternative computational strategies for the Discrete Fourier Transform (DFT) have been developed using analysis of geometric manifolds. This approach provides a general framework for performing DFT calculations, and suggests a more efficient implementation of the DFT for applications using iterative transform methods, particularly phase retrieval. The DFT can thus be implemented using fewer operations when compared to the usual DFT counterpart. The software decreases the run time of the DFT in certain applications such as phase retrieval that iteratively call the DFT function. The algorithm exploits a special computational approach based on analysis of the DFT as a transformation in a complex vector space. As such, this approach has the potential to realize a DFT computation that approaches N operations versus Nlog(N) operations for the equivalent Fast Fourier Transform (FFT) calculation.
Lattice surgery on the Raussendorf lattice
NASA Astrophysics Data System (ADS)
Herr, Daniel; Paler, Alexandru; Devitt, Simon J.; Nori, Franco
2018-07-01
Lattice surgery is a method to perform quantum computation fault-tolerantly by using operations on boundary qubits between different patches of the planar code. This technique allows for universal planar code computation without eliminating the intrinsic two-dimensional nearest-neighbor properties of the surface code that eases physical hardware implementations. Lattice surgery approaches to algorithmic compilation and optimization have been demonstrated to be more resource efficient for resource-intensive components of a fault-tolerant algorithm, and consequently may be preferable over braid-based logic. Lattice surgery can be extended to the Raussendorf lattice, providing a measurement-based approach to the surface code. In this paper we describe how lattice surgery can be performed on the Raussendorf lattice and therefore give a viable alternative to computation using braiding in measurement-based implementations of topological codes.
AnRAD: A Neuromorphic Anomaly Detection Framework for Massive Concurrent Data Streams.
Chen, Qiuwen; Luley, Ryan; Wu, Qing; Bishop, Morgan; Linderman, Richard W; Qiu, Qinru
2018-05-01
The evolution of high performance computing technologies has enabled the large-scale implementation of neuromorphic models and pushed the research in computational intelligence into a new era. Among the machine learning applications, unsupervised detection of anomalous streams is especially challenging due to the requirements of detection accuracy and real-time performance. Designing a computing framework that harnesses the growing computing power of the multicore systems while maintaining high sensitivity and specificity to the anomalies is an urgent research topic. In this paper, we propose anomaly recognition and detection (AnRAD), a bioinspired detection framework that performs probabilistic inferences. We analyze the feature dependency and develop a self-structuring method that learns an efficient confabulation network using unlabeled data. This network is capable of fast incremental learning, which continuously refines the knowledge base using streaming data. Compared with several existing anomaly detection approaches, our method provides competitive detection quality. Furthermore, we exploit the massive parallel structure of the AnRAD framework. Our implementations of the detection algorithm on the graphic processing unit and the Xeon Phi coprocessor both obtain substantial speedups over the sequential implementation on general-purpose microprocessor. The framework provides real-time service to concurrent data streams within diversified knowledge contexts, and can be applied to large problems with multiple local patterns. Experimental results demonstrate high computing performance and memory efficiency. For vehicle behavior detection, the framework is able to monitor up to 16000 vehicles (data streams) and their interactions in real time with a single commodity coprocessor, and uses less than 0.2 ms for one testing subject. Finally, the detection network is ported to our spiking neural network simulator to show the potential of adapting to the emerging neuromorphic architectures.
NASA Technical Reports Server (NTRS)
1991-01-01
The technical effort and computer code enhancements performed during the sixth year of the Probabilistic Structural Analysis Methods program are summarized. Various capabilities are described to probabilistically combine structural response and structural resistance to compute component reliability. A library of structural resistance models is implemented in the Numerical Evaluations of Stochastic Structures Under Stress (NESSUS) code that included fatigue, fracture, creep, multi-factor interaction, and other important effects. In addition, a user interface was developed for user-defined resistance models. An accurate and efficient reliability method was developed and was successfully implemented in the NESSUS code to compute component reliability based on user-selected response and resistance models. A risk module was developed to compute component risk with respect to cost, performance, or user-defined criteria. The new component risk assessment capabilities were validated and demonstrated using several examples. Various supporting methodologies were also developed in support of component risk assessment.
FPGA Implementation of Optimal 3D-Integer DCT Structure for Video Compression
2015-01-01
A novel optimal structure for implementing 3D-integer discrete cosine transform (DCT) is presented by analyzing various integer approximation methods. The integer set with reduced mean squared error (MSE) and high coding efficiency are considered for implementation in FPGA. The proposed method proves that the least resources are utilized for the integer set that has shorter bit values. Optimal 3D-integer DCT structure is determined by analyzing the MSE, power dissipation, coding efficiency, and hardware complexity of different integer sets. The experimental results reveal that direct method of computing the 3D-integer DCT using the integer set [10, 9, 6, 2, 3, 1, 1] performs better when compared to other integer sets in terms of resource utilization and power dissipation. PMID:26601120
An easily implemented static condensation method for structural sensitivity analysis
NASA Technical Reports Server (NTRS)
Gangadharan, S. N.; Haftka, R. T.; Nikolaidis, E.
1990-01-01
A black-box approach to static condensation for sensitivity analysis is presented with illustrative examples of a cube and a car structure. The sensitivity of the structural response with respect to joint stiffness parameter is calculated using the direct method, forward-difference, and central-difference schemes. The efficiency of the various methods for identifying joint stiffness parameters from measured static deflections of these structures is compared. The results indicate that the use of static condensation can reduce computation times significantly and the black-box approach is only slightly less efficient than the standard implementation of static condensation. The ease of implementation of the black-box approach recommends it for use with general-purpose finite element codes that do not have a built-in facility for static condensation.
Andrianov, Alexey; Szabo, Aron; Sergeev, Alexander; Kim, Arkady; Chvykov, Vladimir; Kalashnikov, Mikhail
2016-11-14
We developed an improved approach to calculate the Fourier transform of signals with arbitrary large quadratic phase which can be efficiently implemented in numerical simulations utilizing Fast Fourier transform. The proposed algorithm significantly reduces the computational cost of Fourier transform of a highly chirped and stretched pulse by splitting it into two separate transforms of almost transform limited pulses, thereby reducing the required grid size roughly by a factor of the pulse stretching. The application of our improved Fourier transform algorithm in the split-step method for numerical modeling of CPA and OPCPA shows excellent agreement with standard algorithms.
Sarment: Python modules for HMM analysis and partitioning of sequences.
Guéguen, Laurent
2005-08-15
Sarment is a package of Python modules for easy building and manipulation of sequence segmentations. It provides efficient implementation of usual algorithms for hidden Markov Model computation, as well as for maximal predictive partitioning. Owing to its very large variety of criteria for computing segmentations, Sarment can handle many kinds of models. Because of object-oriented programming, the results of the segmentation are very easy tomanipulate.
Characterizing and Implementing Efficient Primitives for Privacy-Preserving Computation
2015-07-01
the mobile device. From this, the mobile will detect any tampering from the malicious party by a discrepancy in these returned values, eliminating...the need for an output MAC. If no tampering is detected , the mobile device then decrypts the output of computation. APPROVED FOR PUBLIC RELEASE...useful error messages when the compiler detects a problem with an application, making debugging the application significantly easier than with other
Efficient, massively parallel eigenvalue computation
NASA Technical Reports Server (NTRS)
Huo, Yan; Schreiber, Robert
1993-01-01
In numerical simulations of disordered electronic systems, one of the most common approaches is to diagonalize random Hamiltonian matrices and to study the eigenvalues and eigenfunctions of a single electron in the presence of a random potential. An effort to implement a matrix diagonalization routine for real symmetric dense matrices on massively parallel SIMD computers, the Maspar MP-1 and MP-2 systems, is described. Results of numerical tests and timings are also presented.
NASA Astrophysics Data System (ADS)
Aviat, Félix; Lagardère, Louis; Piquemal, Jean-Philip
2017-10-01
In a recent paper [F. Aviat et al., J. Chem. Theory Comput. 13, 180-190 (2017)], we proposed the Truncated Conjugate Gradient (TCG) approach to compute the polarization energy and forces in polarizable molecular simulations. The method consists in truncating the conjugate gradient algorithm at a fixed predetermined order leading to a fixed computational cost and can thus be considered "non-iterative." This gives the possibility to derive analytical forces avoiding the usual energy conservation (i.e., drifts) issues occurring with iterative approaches. A key point concerns the evaluation of the analytical gradients, which is more complex than that with a usual solver. In this paper, after reviewing the present state of the art of polarization solvers, we detail a viable strategy for the efficient implementation of the TCG calculation. The complete cost of the approach is then measured as it is tested using a multi-time step scheme and compared to timings using usual iterative approaches. We show that the TCG methods are more efficient than traditional techniques, making it a method of choice for future long molecular dynamics simulations using polarizable force fields where energy conservation matters. We detail the various steps required for the implementation of the complete method by software developers.
Aviat, Félix; Lagardère, Louis; Piquemal, Jean-Philip
2017-10-28
In a recent paper [F. Aviat et al., J. Chem. Theory Comput. 13, 180-190 (2017)], we proposed the Truncated Conjugate Gradient (TCG) approach to compute the polarization energy and forces in polarizable molecular simulations. The method consists in truncating the conjugate gradient algorithm at a fixed predetermined order leading to a fixed computational cost and can thus be considered "non-iterative." This gives the possibility to derive analytical forces avoiding the usual energy conservation (i.e., drifts) issues occurring with iterative approaches. A key point concerns the evaluation of the analytical gradients, which is more complex than that with a usual solver. In this paper, after reviewing the present state of the art of polarization solvers, we detail a viable strategy for the efficient implementation of the TCG calculation. The complete cost of the approach is then measured as it is tested using a multi-time step scheme and compared to timings using usual iterative approaches. We show that the TCG methods are more efficient than traditional techniques, making it a method of choice for future long molecular dynamics simulations using polarizable force fields where energy conservation matters. We detail the various steps required for the implementation of the complete method by software developers.
Phase quality map based on local multi-unwrapped results for two-dimensional phase unwrapping.
Zhong, Heping; Tang, Jinsong; Zhang, Sen
2015-02-01
The efficiency of a phase unwrapping algorithm and the reliability of the corresponding unwrapped result are two key problems in reconstructing the digital elevation model of a scene from its interferometric synthetic aperture radar (InSAR) or interferometric synthetic aperture sonar (InSAS) data. In this paper, a new phase quality map is designed and implemented in a graphic processing unit (GPU) environment, which greatly accelerates the unwrapping process of the quality-guided algorithm and enhances the correctness of the unwrapped result. In a local wrapped phase window, the center point is selected as the reference point, and then two unwrapped results are computed by integrating in two different simple ways. After the two local unwrapped results are computed, the total difference of the two unwrapped results is regarded as the phase quality value of the center point. In order to accelerate the computing process of the new proposed quality map, we have implemented it in a GPU environment. The wrapped phase data are first uploaded to the memory of a device, and then the kernel function is called in the device to compute the phase quality in parallel by blocks of threads. Unwrapping tests performed on the simulated and real InSAS data confirm the accuracy and efficiency of the proposed method.
Continuing challenges for computer-based neuropsychological tests.
Letz, Richard
2003-08-01
A number of issues critical to the development of computer-based neuropsychological testing systems that remain continuing challenges to their widespread use in occupational and environmental health are reviewed. Several computer-based neuropsychological testing systems have been developed over the last 20 years, and they have contributed substantially to the study of neurologic effects of a number of environmental exposures. However, many are no longer supported and do not run on contemporary personal computer operating systems. Issues that are continuing challenges for development of computer-based neuropsychological tests in environmental and occupational health are discussed: (1) some current technological trends that generally make test development more difficult; (2) lack of availability of usable speech recognition of the type required for computer-based testing systems; (3) implementing computer-based procedures and tasks that are improvements over, not just adaptations of, their manually-administered predecessors; (4) implementing tests of a wider range of memory functions than the limited range now available; (5) paying more attention to motivational influences that affect the reliability and validity of computer-based measurements; and (6) increasing the usability of and audience for computer-based systems. Partial solutions to some of these challenges are offered. The challenges posed by current technological trends are substantial and generally beyond the control of testing system developers. Widespread acceptance of the "tablet PC" and implementation of accurate small vocabulary, discrete, speaker-independent speech recognition would enable revolutionary improvements to computer-based testing systems, particularly for testing memory functions not covered in existing systems. Dynamic, adaptive procedures, particularly ones based on item-response theory (IRT) and computerized-adaptive testing (CAT) methods, will be implemented in new tests that will be more efficient, reliable, and valid than existing test procedures. These additional developments, along with implementation of innovative reporting formats, are necessary for more widespread acceptance of the testing systems.
Chalmers, Eric; Luczak, Artur; Gruber, Aaron J.
2016-01-01
The mammalian brain is thought to use a version of Model-based Reinforcement Learning (MBRL) to guide “goal-directed” behavior, wherein animals consider goals and make plans to acquire desired outcomes. However, conventional MBRL algorithms do not fully explain animals' ability to rapidly adapt to environmental changes, or learn multiple complex tasks. They also require extensive computation, suggesting that goal-directed behavior is cognitively expensive. We propose here that key features of processing in the hippocampus support a flexible MBRL mechanism for spatial navigation that is computationally efficient and can adapt quickly to change. We investigate this idea by implementing a computational MBRL framework that incorporates features inspired by computational properties of the hippocampus: a hierarchical representation of space, “forward sweeps” through future spatial trajectories, and context-driven remapping of place cells. We find that a hierarchical abstraction of space greatly reduces the computational load (mental effort) required for adaptation to changing environmental conditions, and allows efficient scaling to large problems. It also allows abstract knowledge gained at high levels to guide adaptation to new obstacles. Moreover, a context-driven remapping mechanism allows learning and memory of multiple tasks. Simulating dorsal or ventral hippocampal lesions in our computational framework qualitatively reproduces behavioral deficits observed in rodents with analogous lesions. The framework may thus embody key features of how the brain organizes model-based RL to efficiently solve navigation and other difficult tasks. PMID:28018203
NASA Technical Reports Server (NTRS)
Kellner, A.
1987-01-01
Extremely large knowledge sources and efficient knowledge access characterizing future real-life artificial intelligence applications represent crucial requirements for on-board artificial intelligence systems due to obvious computer time and storage constraints on spacecraft. A type of knowledge representation and corresponding reasoning mechanism is proposed which is particularly suited for the efficient processing of such large knowledge bases in expert systems.
Implementation of a parallel unstructured Euler solver on the CM-5
NASA Technical Reports Server (NTRS)
Morano, Eric; Mavriplis, D. J.
1995-01-01
An efficient unstructured 3D Euler solver is parallelized on a Thinking Machine Corporation Connection Machine 5, distributed memory computer with vectoring capability. In this paper, the single instruction multiple data (SIMD) strategy is employed through the use of the CM Fortran language and the CMSSL scientific library. The performance of the CMSSL mesh partitioner is evaluated and the overall efficiency of the parallel flow solver is discussed.
NASA Astrophysics Data System (ADS)
Cai, Xiaohui; Liu, Yang; Ren, Zhiming
2018-06-01
Reverse-time migration (RTM) is a powerful tool for imaging geologically complex structures such as steep-dip and subsalt. However, its implementation is quite computationally expensive. Recently, as a low-cost solution, the graphic processing unit (GPU) was introduced to improve the efficiency of RTM. In the paper, we develop three ameliorative strategies to implement RTM on GPU card. First, given the high accuracy and efficiency of the adaptive optimal finite-difference (FD) method based on least squares (LS) on central processing unit (CPU), we study the optimal LS-based FD method on GPU. Second, we develop the CPU-based hybrid absorbing boundary condition (ABC) to the GPU-based one by addressing two issues of the former when introduced to GPU card: time-consuming and chaotic threads. Third, for large-scale data, the combinatorial strategy for optimal checkpointing and efficient boundary storage is introduced for the trade-off between memory and recomputation. To save the time of communication between host and disk, the portable operating system interface (POSIX) thread is utilized to create the other CPU core at the checkpoints. Applications of the three strategies on GPU with the compute unified device architecture (CUDA) programming language in RTM demonstrate their efficiency and validity.
Bahar, Ali Newaz; Waheed, Sajjad
2016-01-01
The fundamental logical element of a quantum-dot cellular automata (QCA) circuit is majority voter gate (MV). The efficiency of a QCA circuit is depends on the efficiency of the MV. This paper presents an efficient single layer five-input majority voter gate (MV5). The structure of proposed MV5 is very simple and easy to implement in any logical circuit. This proposed MV5 reduce number of cells and use conventional QCA cells. However, using MV5 a multilayer 1-bit full-adder (FA) is designed. The functional accuracy of the proposed MV5 and FA are confirmed by QCADesigner a well-known QCA layout design and verification tools. Furthermore, the power dissipation of proposed circuits are estimated, which shows that those circuits dissipate extremely small amount of energy and suitable for reversible computing. The simulation outcomes demonstrate the superiority of the proposed circuit.
omniClassifier: a Desktop Grid Computing System for Big Data Prediction Modeling
Phan, John H.; Kothari, Sonal; Wang, May D.
2016-01-01
Robust prediction models are important for numerous science, engineering, and biomedical applications. However, best-practice procedures for optimizing prediction models can be computationally complex, especially when choosing models from among hundreds or thousands of parameter choices. Computational complexity has further increased with the growth of data in these fields, concurrent with the era of “Big Data”. Grid computing is a potential solution to the computational challenges of Big Data. Desktop grid computing, which uses idle CPU cycles of commodity desktop machines, coupled with commercial cloud computing resources can enable research labs to gain easier and more cost effective access to vast computing resources. We have developed omniClassifier, a multi-purpose prediction modeling application that provides researchers with a tool for conducting machine learning research within the guidelines of recommended best-practices. omniClassifier is implemented as a desktop grid computing system using the Berkeley Open Infrastructure for Network Computing (BOINC) middleware. In addition to describing implementation details, we use various gene expression datasets to demonstrate the potential scalability of omniClassifier for efficient and robust Big Data prediction modeling. A prototype of omniClassifier can be accessed at http://omniclassifier.bme.gatech.edu/. PMID:27532062
NASA Technical Reports Server (NTRS)
Taylor, Arthur C., III
2004-01-01
This final report will document the accomplishments of the work of this project. 1) The incremental-iterative (II) form of the reverse-mode (adjoint) method for computing first-order (FO) aerodynamic sensitivity derivatives (SDs) has been successfully implemented and tested in a 2D CFD code (called ANSERS) using the reverse-mode capability of ADIFOR 3.0. These preceding results compared very well with similar SDS computed via a black-box (BB) application of the reverse-mode capability of ADIFOR 3.0, and also with similar SDs calculated via the method of finite differences. 2) Second-order (SO) SDs have been implemented in the 2D ASNWERS code using the very efficient strategy that was originally proposed (but not previously tested) of Reference 3, Appendix A. Furthermore, these SO SOs have been validated for accuracy and computational efficiency. 3) Studies were conducted in Quasi-1D and 2D concerning the smoothness (or lack of smoothness) of the FO and SO SD's for flows with shock waves. The phenomenon is documented in the publications of this study (listed subsequently), however, the specific numerical mechanism which is responsible for this unsmoothness phenomenon was not discovered. 4) The FO and SO derivatives for Quasi-1D and 2D flows were applied to predict aerodynamic design uncertainties, and were also applied in robust design optimization studies.
Bozkaya, Uğur
2018-03-15
Efficient implementations of analytic gradients for the orbital-optimized MP3 and MP2.5 and their standard versions with the density-fitting approximation, which are denoted as DF-MP3, DF-MP2.5, DF-OMP3, and DF-OMP2.5, are presented. The DF-MP3, DF-MP2.5, DF-OMP3, and DF-OMP2.5 methods are applied to a set of alkanes and noncovalent interaction complexes to compare the computational cost with the conventional MP3, MP2.5, OMP3, and OMP2.5. Our results demonstrate that density-fitted perturbation theory (DF-MP) methods considered substantially reduce the computational cost compared to conventional MP methods. The efficiency of our DF-MP methods arise from the reduced input/output (I/O) time and the acceleration of gradient related terms, such as computations of particle density and generalized Fock matrices (PDMs and GFM), solution of the Z-vector equation, back-transformations of PDMs and GFM, and evaluation of analytic gradients in the atomic orbital basis. Further, application results show that errors introduced by the DF approach are negligible. Mean absolute errors for bond lengths of a molecular set, with the cc-pCVQZ basis set, is 0.0001-0.0002 Å. © 2017 Wiley Periodicals, Inc. © 2017 Wiley Periodicals, Inc.
Metalevel programming in robotics: Some issues
NASA Technical Reports Server (NTRS)
Kumarn, A.; Parameswaran, N.
1987-01-01
Computing in robotics has two important requirements: efficiency and flexibility. Algorithms for robot actions are implemented usually in procedural languages such as VAL and AL. But, since their excessive bindings create inflexible structures of computation, it is proposed that Logic Programming is a more suitable language for robot programming due to its non-determinism, declarative nature, and provision for metalevel programming. Logic Programming, however, results in inefficient computations. As a solution to this problem, researchers discuss a framework in which controls can be described to improve efficiency. They have divided controls into: (1) in-code and (2) metalevel and discussed them with reference to selection of rules and dataflow. Researchers illustrated the merit of Logic Programming by modelling the motion of a robot from one point to another avoiding obstacles.
Compiling probabilistic, bio-inspired circuits on a field programmable analog array
Marr, Bo; Hasler, Jennifer
2014-01-01
A field programmable analog array (FPAA) is presented as an energy and computational efficiency engine: a mixed mode processor for which functions can be compiled at significantly less energy costs using probabilistic computing circuits. More specifically, it will be shown that the core computation of any dynamical system can be computed on the FPAA at significantly less energy per operation than a digital implementation. A stochastic system that is dynamically controllable via voltage controlled amplifier and comparator thresholds is implemented, which computes Bernoulli random variables. From Bernoulli variables it is shown exponentially distributed random variables, and random variables of an arbitrary distribution can be computed. The Gillespie algorithm is simulated to show the utility of this system by calculating the trajectory of a biological system computed stochastically with this probabilistic hardware where over a 127X performance improvement over current software approaches is shown. The relevance of this approach is extended to any dynamical system. The initial circuits and ideas for this work were generated at the 2008 Telluride Neuromorphic Workshop. PMID:24847199
An Architecture for Cross-Cloud System Management
NASA Astrophysics Data System (ADS)
Dodda, Ravi Teja; Smith, Chris; van Moorsel, Aad
The emergence of the cloud computing paradigm promises flexibility and adaptability through on-demand provisioning of compute resources. As the utilization of cloud resources extends beyond a single provider, for business as well as technical reasons, the issue of effectively managing such resources comes to the fore. Different providers expose different interfaces to their compute resources utilizing varied architectures and implementation technologies. This heterogeneity poses a significant system management problem, and can limit the extent to which the benefits of cross-cloud resource utilization can be realized. We address this problem through the definition of an architecture to facilitate the management of compute resources from different cloud providers in an homogenous manner. This preserves the flexibility and adaptability promised by the cloud computing paradigm, whilst enabling the benefits of cross-cloud resource utilization to be realized. The practical efficacy of the architecture is demonstrated through an implementation utilizing compute resources managed through different interfaces on the Amazon Elastic Compute Cloud (EC2) service. Additionally, we provide empirical results highlighting the performance differential of these different interfaces, and discuss the impact of this performance differential on efficiency and profitability.
A parallel computational model for GATE simulations.
Rannou, F R; Vega-Acevedo, N; El Bitar, Z
2013-12-01
GATE/Geant4 Monte Carlo simulations are computationally demanding applications, requiring thousands of processor hours to produce realistic results. The classical strategy of distributing the simulation of individual events does not apply efficiently for Positron Emission Tomography (PET) experiments, because it requires a centralized coincidence processing and large communication overheads. We propose a parallel computational model for GATE that handles event generation and coincidence processing in a simple and efficient way by decentralizing event generation and processing but maintaining a centralized event and time coordinator. The model is implemented with the inclusion of a new set of factory classes that can run the same executable in sequential or parallel mode. A Mann-Whitney test shows that the output produced by this parallel model in terms of number of tallies is equivalent (but not equal) to its sequential counterpart. Computational performance evaluation shows that the software is scalable and well balanced. Copyright © 2013 Elsevier Ireland Ltd. All rights reserved.
Real-time depth processing for embedded platforms
NASA Astrophysics Data System (ADS)
Rahnama, Oscar; Makarov, Aleksej; Torr, Philip
2017-05-01
Obtaining depth information of a scene is an important requirement in many computer-vision and robotics applications. For embedded platforms, passive stereo systems have many advantages over their active counterparts (i.e. LiDAR, Infrared). They are power efficient, cheap, robust to lighting conditions and inherently synchronized to the RGB images of the scene. However, stereo depth estimation is a computationally expensive task that operates over large amounts of data. For embedded applications which are often constrained by power consumption, obtaining accurate results in real-time is a challenge. We demonstrate a computationally and memory efficient implementation of a stereo block-matching algorithm in FPGA. The computational core achieves a throughput of 577 fps at standard VGA resolution whilst consuming less than 3 Watts of power. The data is processed using an in-stream approach that minimizes memory-access bottlenecks and best matches the raster scan readout of modern digital image sensors.
Accelerating Full Configuration Interaction Calculations for Nuclear Structure
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yang, Chao; Sternberg, Philip; Maris, Pieter
2008-04-14
One of the emerging computational approaches in nuclear physics is the full configuration interaction (FCI) method for solving the many-body nuclear Hamiltonian in a sufficiently large single-particle basis space to obtain exact answers - either directly or by extrapolation. The lowest eigenvalues and correspondingeigenvectors for very large, sparse and unstructured nuclear Hamiltonian matrices are obtained and used to evaluate additional experimental quantities. These matrices pose a significant challenge to the design and implementation of efficient and scalable algorithms for obtaining solutions on massively parallel computer systems. In this paper, we describe the computational strategies employed in a state-of-the-art FCI codemore » MFDn (Many Fermion Dynamics - nuclear) as well as techniques we recently developed to enhance the computational efficiency of MFDn. We will demonstrate the current capability of MFDn and report the latest performance improvement we have achieved. We will also outline our future research directions.« less
A 3D staggered-grid finite difference scheme for poroelastic wave equation
NASA Astrophysics Data System (ADS)
Zhang, Yijie; Gao, Jinghuai
2014-10-01
Three dimensional numerical modeling has been a viable tool for understanding wave propagation in real media. The poroelastic media can better describe the phenomena of hydrocarbon reservoirs than acoustic and elastic media. However, the numerical modeling in 3D poroelastic media demands significantly more computational capacity, including both computational time and memory. In this paper, we present a 3D poroelastic staggered-grid finite difference (SFD) scheme. During the procedure, parallel computing is implemented to reduce the computational time. Parallelization is based on domain decomposition, and communication between processors is performed using message passing interface (MPI). Parallel analysis shows that the parallelized SFD scheme significantly improves the simulation efficiency and 3D decomposition in domain is the most efficient. We also analyze the numerical dispersion and stability condition of the 3D poroelastic SFD method. Numerical results show that the 3D numerical simulation can provide a real description of wave propagation.
OpenFOAM: Open source CFD in research and industry
NASA Astrophysics Data System (ADS)
Jasak, Hrvoje
2009-12-01
The current focus of development in industrial Computational Fluid Dynamics (CFD) is integration of CFD into Computer-Aided product development, geometrical optimisation, robust design and similar. On the other hand, in CFD research aims to extend the boundaries ofpractical engineering use in "non-traditional " areas. Requirements of computational flexibility and code integration are contradictory: a change of coding paradigm, with object orientation, library components, equation mimicking is proposed as a way forward. This paper describes OpenFOAM, a C++ object oriented library for Computational Continuum Mechanics (CCM) developed by the author. Efficient and flexible implementation of complex physical models is achieved by mimicking the form ofpartial differential equation in software, with code functionality provided in library form. Open Source deployment and development model allows the user to achieve desired versatility in physical modeling without the sacrifice of complex geometry support and execution efficiency.
A GPU accelerated and error-controlled solver for the unbounded Poisson equation in three dimensions
NASA Astrophysics Data System (ADS)
Exl, Lukas
2017-12-01
An efficient solver for the three dimensional free-space Poisson equation is presented. The underlying numerical method is based on finite Fourier series approximation. While the error of all involved approximations can be fully controlled, the overall computation error is driven by the convergence of the finite Fourier series of the density. For smooth and fast-decaying densities the proposed method will be spectrally accurate. The method scales with O(N log N) operations, where N is the total number of discretization points in the Cartesian grid. The majority of the computational costs come from fast Fourier transforms (FFT), which makes it ideal for GPU computation. Several numerical computations on CPU and GPU validate the method and show efficiency and convergence behavior. Tests are performed using the Vienna Scientific Cluster 3 (VSC3). A free MATLAB implementation for CPU and GPU is provided to the interested community.
Human motion planning based on recursive dynamics and optimal control techniques
NASA Technical Reports Server (NTRS)
Lo, Janzen; Huang, Gang; Metaxas, Dimitris
2002-01-01
This paper presents an efficient optimal control and recursive dynamics-based computer animation system for simulating and controlling the motion of articulated figures. A quasi-Newton nonlinear programming technique (super-linear convergence) is implemented to solve minimum torque-based human motion-planning problems. The explicit analytical gradients needed in the dynamics are derived using a matrix exponential formulation and Lie algebra. Cubic spline functions are used to make the search space for an optimal solution finite. Based on our formulations, our method is well conditioned and robust, in addition to being computationally efficient. To better illustrate the efficiency of our method, we present results of natural looking and physically correct human motions for a variety of human motion tasks involving open and closed loop kinematic chains.
Computing the Baker-Campbell-Hausdorff series and the Zassenhaus product
NASA Astrophysics Data System (ADS)
Weyrauch, Michael; Scholz, Daniel
2009-09-01
The Baker-Campbell-Hausdorff (BCH) series and the Zassenhaus product are of fundamental importance for the theory of Lie groups and their applications in physics and physical chemistry. Standard methods for the explicit construction of the BCH and Zassenhaus terms yield polynomial representations, which must be translated into the usually required commutator representation. We prove that a new translation proposed recently yields a correct representation of the BCH and Zassenhaus terms. This representation entails fewer terms than the well-known Dynkin-Specht-Wever representation, which is of relevance for practical applications. Furthermore, various methods for the computation of the BCH and Zassenhaus terms are compared, and a new efficient approach for the calculation of the Zassenhaus terms is proposed. Mathematica implementations for the most efficient algorithms are provided together with comparisons of efficiency.
Biamonte, Jacob; Wittek, Peter; Pancotti, Nicola; Rebentrost, Patrick; Wiebe, Nathan; Lloyd, Seth
2017-09-13
Fuelled by increasing computer power and algorithmic advances, machine learning techniques have become powerful tools for finding patterns in data. Quantum systems produce atypical patterns that classical systems are thought not to produce efficiently, so it is reasonable to postulate that quantum computers may outperform classical computers on machine learning tasks. The field of quantum machine learning explores how to devise and implement quantum software that could enable machine learning that is faster than that of classical computers. Recent work has produced quantum algorithms that could act as the building blocks of machine learning programs, but the hardware and software challenges are still considerable.
NASA Astrophysics Data System (ADS)
Biamonte, Jacob; Wittek, Peter; Pancotti, Nicola; Rebentrost, Patrick; Wiebe, Nathan; Lloyd, Seth
2017-09-01
Fuelled by increasing computer power and algorithmic advances, machine learning techniques have become powerful tools for finding patterns in data. Quantum systems produce atypical patterns that classical systems are thought not to produce efficiently, so it is reasonable to postulate that quantum computers may outperform classical computers on machine learning tasks. The field of quantum machine learning explores how to devise and implement quantum software that could enable machine learning that is faster than that of classical computers. Recent work has produced quantum algorithms that could act as the building blocks of machine learning programs, but the hardware and software challenges are still considerable.
Parallel Implicit Runge-Kutta Methods Applied to Coupled Orbit/Attitude Propagation
NASA Astrophysics Data System (ADS)
Hatten, Noble; Russell, Ryan P.
2017-12-01
A variable-step Gauss-Legendre implicit Runge-Kutta (GLIRK) propagator is applied to coupled orbit/attitude propagation. Concepts previously shown to improve efficiency in 3DOF propagation are modified and extended to the 6DOF problem, including the use of variable-fidelity dynamics models. The impact of computing the stage dynamics of a single step in parallel is examined using up to 23 threads and 22 associated GLIRK stages; one thread is reserved for an extra dynamics function evaluation used in the estimation of the local truncation error. Efficiency is found to peak for typical examples when using approximately 8 to 12 stages for both serial and parallel implementations. Accuracy and efficiency compare favorably to explicit Runge-Kutta and linear-multistep solvers for representative scenarios. However, linear-multistep methods are found to be more efficient for some applications, particularly in a serial computing environment, or when parallelism can be applied across multiple trajectories.
A high-speed linear algebra library with automatic parallelism
NASA Technical Reports Server (NTRS)
Boucher, Michael L.
1994-01-01
Parallel or distributed processing is key to getting highest performance workstations. However, designing and implementing efficient parallel algorithms is difficult and error-prone. It is even more difficult to write code that is both portable to and efficient on many different computers. Finally, it is harder still to satisfy the above requirements and include the reliability and ease of use required of commercial software intended for use in a production environment. As a result, the application of parallel processing technology to commercial software has been extremely small even though there are numerous computationally demanding programs that would significantly benefit from application of parallel processing. This paper describes DSSLIB, which is a library of subroutines that perform many of the time-consuming computations in engineering and scientific software. DSSLIB combines the high efficiency and speed of parallel computation with a serial programming model that eliminates many undesirable side-effects of typical parallel code. The result is a simple way to incorporate the power of parallel processing into commercial software without compromising maintainability, reliability, or ease of use. This gives significant advantages over less powerful non-parallel entries in the market.
Parallelization and implementation of approximate root isolation for nonlinear system by Monte Carlo
NASA Astrophysics Data System (ADS)
Khosravi, Ebrahim
1998-12-01
This dissertation solves a fundamental problem of isolating the real roots of nonlinear systems of equations by Monte-Carlo that were published by Bush Jones. This algorithm requires only function values and can be applied readily to complicated systems of transcendental functions. The implementation of this sequential algorithm provides scientists with the means to utilize function analysis in mathematics or other fields of science. The algorithm, however, is so computationally intensive that the system is limited to a very small set of variables, and this will make it unfeasible for large systems of equations. Also a computational technique was needed for investigating a metrology of preventing the algorithm structure from converging to the same root along different paths of computation. The research provides techniques for improving the efficiency and correctness of the algorithm. The sequential algorithm for this technique was corrected and a parallel algorithm is presented. This parallel method has been formally analyzed and is compared with other known methods of root isolation. The effectiveness, efficiency, enhanced overall performance of the parallel processing of the program in comparison to sequential processing is discussed. The message passing model was used for this parallel processing, and it is presented and implemented on Intel/860 MIMD architecture. The parallel processing proposed in this research has been implemented in an ongoing high energy physics experiment: this algorithm has been used to track neutrinoes in a super K detector. This experiment is located in Japan, and data can be processed on-line or off-line locally or remotely.
White, Mary Jo; Stark, Jennifer R; Luckmann, Roger; Rosal, Milagros C; Clemow, Lynn; Costanza, Mary E
2006-06-01
Computer-assisted telephone interviewing (CATI) systems used by telephone counselors (TCs) may be efficient mechanisms to counsel patients on cancer and recommended preventive screening tests in order to extend a primary care provider's reach to his/her patients. The implementation process of such a system for promoting colorectal (CRC) cancer screening using a computer-assisted telephone interview (CATI) system is reported in this paper. The process evaluation assessed three components of the intervention: message production, program implementation and audience reception. Of 1181 potentially eligible patients, 1025 (87%) patients were reached by the TCs and 725 of those patients (71%) were eligible to receive counseling. Five hundred eighty-two (80%) patients agreed to counseling. It is feasible to design and use CATI systems for prevention counseling of patients in primary care practices. CATI systems have the potential of being used as a referral service by primary care providers and health care organizations for patient education.
DENA: A Configurable Microarchitecture and Design Flow for Biomedical DNA-Based Logic Design.
Beiki, Zohre; Jahanian, Ali
2017-10-01
DNA is known as the building block for storing the life codes and transferring the genetic features through the generations. However, it is found that DNA strands can be used for a new type of computation that opens fascinating horizons in computational medicine. Significant contributions are addressed on design of DNA-based logic gates for medical and computational applications but there are serious challenges for designing the medium and large-scale DNA circuits. In this paper, a new microarchitecture and corresponding design flow is proposed to facilitate the design of multistage large-scale DNA logic systems. Feasibility and efficiency of the proposed microarchitecture are evaluated by implementing a full adder and, then, its cascadability is determined by implementing a multistage 8-bit adder. Simulation results show the highlight features of the proposed design style and microarchitecture in terms of the scalability, implementation cost, and signal integrity of the DNA-based logic system compared to the traditional approaches.
spMC: an R-package for 3D lithological reconstructions based on spatial Markov chains
NASA Astrophysics Data System (ADS)
Sartore, Luca; Fabbri, Paolo; Gaetan, Carlo
2016-09-01
The paper presents the spatial Markov Chains (spMC) R-package and a case study of subsoil simulation/prediction located in a plain site of Northeastern Italy. spMC is a quite complete collection of advanced methods for data inspection, besides spMC implements Markov Chain models to estimate experimental transition probabilities of categorical lithological data. Furthermore, simulation methods based on most known prediction methods (as indicator Kriging and CoKriging) were implemented in spMC package. Moreover, other more advanced methods are available for simulations, e.g. path methods and Bayesian procedures, that exploit the maximum entropy. Since the spMC package was developed for intensive geostatistical computations, part of the code is implemented for parallel computations via the OpenMP constructs. A final analysis of this computational efficiency compares the simulation/prediction algorithms by using different numbers of CPU cores, and considering the example data set of the case study included in the package.
Implementing universal nonadiabatic holonomic quantum gates with transmons
NASA Astrophysics Data System (ADS)
Hong, Zhuo-Ping; Liu, Bao-Jie; Cai, Jia-Qi; Zhang, Xin-Ding; Hu, Yong; Wang, Z. D.; Xue, Zheng-Yuan
2018-02-01
Geometric phases are well known to be noise resilient in quantum evolutions and operations. Holonomic quantum gates provide us with a robust way towards universal quantum computation, as these quantum gates are actually induced by non-Abelian geometric phases. Here we propose and elaborate how to efficiently implement universal nonadiabatic holonomic quantum gates on simpler superconducting circuits, with a single transmon serving as a qubit. In our proposal, an arbitrary single-qubit holonomic gate can be realized in a single-loop scenario by varying the amplitudes and phase difference of two microwave fields resonantly coupled to a transmon, while nontrivial two-qubit holonomic gates may be generated with a transmission-line resonator being simultaneously coupled to the two target transmons in an effective resonant way. Moreover, our scenario may readily be scaled up to a two-dimensional lattice configuration, which is able to support large scalable quantum computation, paving the way for practically implementing universal nonadiabatic holonomic quantum computation with superconducting circuits.
Efficient integration method for fictitious domain approaches
NASA Astrophysics Data System (ADS)
Duczek, Sascha; Gabbert, Ulrich
2015-10-01
In the current article, we present an efficient and accurate numerical method for the integration of the system matrices in fictitious domain approaches such as the finite cell method (FCM). In the framework of the FCM, the physical domain is embedded in a geometrically larger domain of simple shape which is discretized using a regular Cartesian grid of cells. Therefore, a spacetree-based adaptive quadrature technique is normally deployed to resolve the geometry of the structure. Depending on the complexity of the structure under investigation this method accounts for most of the computational effort. To reduce the computational costs for computing the system matrices an efficient quadrature scheme based on the divergence theorem (Gauß-Ostrogradsky theorem) is proposed. Using this theorem the dimension of the integral is reduced by one, i.e. instead of solving the integral for the whole domain only its contour needs to be considered. In the current paper, we present the general principles of the integration method and its implementation. The results to several two-dimensional benchmark problems highlight its properties. The efficiency of the proposed method is compared to conventional spacetree-based integration techniques.
Reducing the computational footprint for real-time BCPNN learning
Vogginger, Bernhard; Schüffny, René; Lansner, Anders; Cederström, Love; Partzsch, Johannes; Höppner, Sebastian
2015-01-01
The implementation of synaptic plasticity in neural simulation or neuromorphic hardware is usually very resource-intensive, often requiring a compromise between efficiency and flexibility. A versatile, but computationally-expensive plasticity mechanism is provided by the Bayesian Confidence Propagation Neural Network (BCPNN) paradigm. Building upon Bayesian statistics, and having clear links to biological plasticity processes, the BCPNN learning rule has been applied in many fields, ranging from data classification, associative memory, reward-based learning, probabilistic inference to cortical attractor memory networks. In the spike-based version of this learning rule the pre-, postsynaptic and coincident activity is traced in three low-pass-filtering stages, requiring a total of eight state variables, whose dynamics are typically simulated with the fixed step size Euler method. We derive analytic solutions allowing an efficient event-driven implementation of this learning rule. Further speedup is achieved by first rewriting the model which reduces the number of basic arithmetic operations per update to one half, and second by using look-up tables for the frequently calculated exponential decay. Ultimately, in a typical use case, the simulation using our approach is more than one order of magnitude faster than with the fixed step size Euler method. Aiming for a small memory footprint per BCPNN synapse, we also evaluate the use of fixed-point numbers for the state variables, and assess the number of bits required to achieve same or better accuracy than with the conventional explicit Euler method. All of this will allow a real-time simulation of a reduced cortex model based on BCPNN in high performance computing. More important, with the analytic solution at hand and due to the reduced memory bandwidth, the learning rule can be efficiently implemented in dedicated or existing digital neuromorphic hardware. PMID:25657618
Specialized Computer Systems for Environment Visualization
NASA Astrophysics Data System (ADS)
Al-Oraiqat, Anas M.; Bashkov, Evgeniy A.; Zori, Sergii A.
2018-06-01
The need for real time image generation of landscapes arises in various fields as part of tasks solved by virtual and augmented reality systems, as well as geographic information systems. Such systems provide opportunities for collecting, storing, analyzing and graphically visualizing geographic data. Algorithmic and hardware software tools for increasing the realism and efficiency of the environment visualization in 3D visualization systems are proposed. This paper discusses a modified path tracing algorithm with a two-level hierarchy of bounding volumes and finding intersections with Axis-Aligned Bounding Box. The proposed algorithm eliminates the branching and hence makes the algorithm more suitable to be implemented on the multi-threaded CPU and GPU. A modified ROAM algorithm is used to solve the qualitative visualization of reliefs' problems and landscapes. The algorithm is implemented on parallel systems—cluster and Compute Unified Device Architecture-networks. Results show that the implementation on MPI clusters is more efficient than Graphics Processing Unit/Graphics Processing Clusters and allows real-time synthesis. The organization and algorithms of the parallel GPU system for the 3D pseudo stereo image/video synthesis are proposed. With realizing possibility analysis on a parallel GPU-architecture of each stage, 3D pseudo stereo synthesis is performed. An experimental prototype of a specialized hardware-software system 3D pseudo stereo imaging and video was developed on the CPU/GPU. The experimental results show that the proposed adaptation of 3D pseudo stereo imaging to the architecture of GPU-systems is efficient. Also it accelerates the computational procedures of 3D pseudo-stereo synthesis for the anaglyph and anamorphic formats of the 3D stereo frame without performing optimization procedures. The acceleration is on average 11 and 54 times for test GPUs.
Reducing the computational footprint for real-time BCPNN learning.
Vogginger, Bernhard; Schüffny, René; Lansner, Anders; Cederström, Love; Partzsch, Johannes; Höppner, Sebastian
2015-01-01
The implementation of synaptic plasticity in neural simulation or neuromorphic hardware is usually very resource-intensive, often requiring a compromise between efficiency and flexibility. A versatile, but computationally-expensive plasticity mechanism is provided by the Bayesian Confidence Propagation Neural Network (BCPNN) paradigm. Building upon Bayesian statistics, and having clear links to biological plasticity processes, the BCPNN learning rule has been applied in many fields, ranging from data classification, associative memory, reward-based learning, probabilistic inference to cortical attractor memory networks. In the spike-based version of this learning rule the pre-, postsynaptic and coincident activity is traced in three low-pass-filtering stages, requiring a total of eight state variables, whose dynamics are typically simulated with the fixed step size Euler method. We derive analytic solutions allowing an efficient event-driven implementation of this learning rule. Further speedup is achieved by first rewriting the model which reduces the number of basic arithmetic operations per update to one half, and second by using look-up tables for the frequently calculated exponential decay. Ultimately, in a typical use case, the simulation using our approach is more than one order of magnitude faster than with the fixed step size Euler method. Aiming for a small memory footprint per BCPNN synapse, we also evaluate the use of fixed-point numbers for the state variables, and assess the number of bits required to achieve same or better accuracy than with the conventional explicit Euler method. All of this will allow a real-time simulation of a reduced cortex model based on BCPNN in high performance computing. More important, with the analytic solution at hand and due to the reduced memory bandwidth, the learning rule can be efficiently implemented in dedicated or existing digital neuromorphic hardware.
ClimateSpark: An in-memory distributed computing framework for big climate data analytics
NASA Astrophysics Data System (ADS)
Hu, Fei; Yang, Chaowei; Schnase, John L.; Duffy, Daniel Q.; Xu, Mengchao; Bowen, Michael K.; Lee, Tsengdar; Song, Weiwei
2018-06-01
The unprecedented growth of climate data creates new opportunities for climate studies, and yet big climate data pose a grand challenge to climatologists to efficiently manage and analyze big data. The complexity of climate data content and analytical algorithms increases the difficulty of implementing algorithms on high performance computing systems. This paper proposes an in-memory, distributed computing framework, ClimateSpark, to facilitate complex big data analytics and time-consuming computational tasks. Chunking data structure improves parallel I/O efficiency, while a spatiotemporal index is built for the chunks to avoid unnecessary data reading and preprocessing. An integrated, multi-dimensional, array-based data model (ClimateRDD) and ETL operations are developed to address big climate data variety by integrating the processing components of the climate data lifecycle. ClimateSpark utilizes Spark SQL and Apache Zeppelin to develop a web portal to facilitate the interaction among climatologists, climate data, analytic operations and computing resources (e.g., using SQL query and Scala/Python notebook). Experimental results show that ClimateSpark conducts different spatiotemporal data queries/analytics with high efficiency and data locality. ClimateSpark is easily adaptable to other big multiple-dimensional, array-based datasets in various geoscience domains.
NASA Astrophysics Data System (ADS)
Xu, M.; van Overloop, P. J.; van de Giesen, N. C.
2011-02-01
Model predictive control (MPC) of open channel flow is becoming an important tool in water management. The complexity of the prediction model has a large influence on the MPC application in terms of control effectiveness and computational efficiency. The Saint-Venant equations, called SV model in this paper, and the Integrator Delay (ID) model are either accurate but computationally costly, or simple but restricted to allowed flow changes. In this paper, a reduced Saint-Venant (RSV) model is developed through a model reduction technique, Proper Orthogonal Decomposition (POD), on the SV equations. The RSV model keeps the main flow dynamics and functions over a large flow range but is easier to implement in MPC. In the test case of a modeled canal reach, the number of states and disturbances in the RSV model is about 45 and 16 times less than the SV model, respectively. The computational time of MPC with the RSV model is significantly reduced, while the controller remains effective. Thus, the RSV model is a promising means to balance the control effectiveness and computational efficiency.
Stamatakis, Alexandros; Ott, Michael
2008-12-27
The continuous accumulation of sequence data, for example, due to novel wet-laboratory techniques such as pyrosequencing, coupled with the increasing popularity of multi-gene phylogenies and emerging multi-core processor architectures that face problems of cache congestion, poses new challenges with respect to the efficient computation of the phylogenetic maximum-likelihood (ML) function. Here, we propose two approaches that can significantly speed up likelihood computations that typically represent over 95 per cent of the computational effort conducted by current ML or Bayesian inference programs. Initially, we present a method and an appropriate data structure to efficiently compute the likelihood score on 'gappy' multi-gene alignments. By 'gappy' we denote sampling-induced gaps owing to missing sequences in individual genes (partitions), i.e. not real alignment gaps. A first proof-of-concept implementation in RAXML indicates that this approach can accelerate inferences on large and gappy alignments by approximately one order of magnitude. Moreover, we present insights and initial performance results on multi-core architectures obtained during the transition from an OpenMP-based to a Pthreads-based fine-grained parallelization of the ML function.
Efficient computation of the joint sample frequency spectra for multiple populations.
Kamm, John A; Terhorst, Jonathan; Song, Yun S
2017-01-01
A wide range of studies in population genetics have employed the sample frequency spectrum (SFS), a summary statistic which describes the distribution of mutant alleles at a polymorphic site in a sample of DNA sequences and provides a highly efficient dimensional reduction of large-scale population genomic variation data. Recently, there has been much interest in analyzing the joint SFS data from multiple populations to infer parameters of complex demographic histories, including variable population sizes, population split times, migration rates, admixture proportions, and so on. SFS-based inference methods require accurate computation of the expected SFS under a given demographic model. Although much methodological progress has been made, existing methods suffer from numerical instability and high computational complexity when multiple populations are involved and the sample size is large. In this paper, we present new analytic formulas and algorithms that enable accurate, efficient computation of the expected joint SFS for thousands of individuals sampled from hundreds of populations related by a complex demographic model with arbitrary population size histories (including piecewise-exponential growth). Our results are implemented in a new software package called momi (MOran Models for Inference). Through an empirical study we demonstrate our improvements to numerical stability and computational complexity.
Shulaker, Max M; Hills, Gage; Patil, Nishant; Wei, Hai; Chen, Hong-Yu; Wong, H-S Philip; Mitra, Subhasish
2013-09-26
The miniaturization of electronic devices has been the principal driving force behind the semiconductor industry, and has brought about major improvements in computational power and energy efficiency. Although advances with silicon-based electronics continue to be made, alternative technologies are being explored. Digital circuits based on transistors fabricated from carbon nanotubes (CNTs) have the potential to outperform silicon by improving the energy-delay product, a metric of energy efficiency, by more than an order of magnitude. Hence, CNTs are an exciting complement to existing semiconductor technologies. Owing to substantial fundamental imperfections inherent in CNTs, however, only very basic circuit blocks have been demonstrated. Here we show how these imperfections can be overcome, and demonstrate the first computer built entirely using CNT-based transistors. The CNT computer runs an operating system that is capable of multitasking: as a demonstration, we perform counting and integer-sorting simultaneously. In addition, we implement 20 different instructions from the commercial MIPS instruction set to demonstrate the generality of our CNT computer. This experimental demonstration is the most complex carbon-based electronic system yet realized. It is a considerable advance because CNTs are prominent among a variety of emerging technologies that are being considered for the next generation of highly energy-efficient electronic systems.
Efficient computation of the joint sample frequency spectra for multiple populations
Kamm, John A.; Terhorst, Jonathan; Song, Yun S.
2016-01-01
A wide range of studies in population genetics have employed the sample frequency spectrum (SFS), a summary statistic which describes the distribution of mutant alleles at a polymorphic site in a sample of DNA sequences and provides a highly efficient dimensional reduction of large-scale population genomic variation data. Recently, there has been much interest in analyzing the joint SFS data from multiple populations to infer parameters of complex demographic histories, including variable population sizes, population split times, migration rates, admixture proportions, and so on. SFS-based inference methods require accurate computation of the expected SFS under a given demographic model. Although much methodological progress has been made, existing methods suffer from numerical instability and high computational complexity when multiple populations are involved and the sample size is large. In this paper, we present new analytic formulas and algorithms that enable accurate, efficient computation of the expected joint SFS for thousands of individuals sampled from hundreds of populations related by a complex demographic model with arbitrary population size histories (including piecewise-exponential growth). Our results are implemented in a new software package called momi (MOran Models for Inference). Through an empirical study we demonstrate our improvements to numerical stability and computational complexity. PMID:28239248
Quantum Monte Carlo with very large multideterminant wavefunctions.
Scemama, Anthony; Applencourt, Thomas; Giner, Emmanuel; Caffarel, Michel
2016-07-01
An algorithm to compute efficiently the first two derivatives of (very) large multideterminant wavefunctions for quantum Monte Carlo calculations is presented. The calculation of determinants and their derivatives is performed using the Sherman-Morrison formula for updating the inverse Slater matrix. An improved implementation based on the reduction of the number of column substitutions and on a very efficient implementation of the calculation of the scalar products involved is presented. It is emphasized that multideterminant expansions contain in general a large number of identical spin-specific determinants: for typical configuration interaction-type wavefunctions the number of unique spin-specific determinants Ndetσ ( σ=↑,↓) with a non-negligible weight in the expansion is of order O(Ndet). We show that a careful implementation of the calculation of the Ndet -dependent contributions can make this step negligible enough so that in practice the algorithm scales as the total number of unique spin-specific determinants, Ndet↑+Ndet↓, over a wide range of total number of determinants (here, Ndet up to about one million), thus greatly reducing the total computational cost. Finally, a new truncation scheme for the multideterminant expansion is proposed so that larger expansions can be considered without increasing the computational time. The algorithm is illustrated with all-electron fixed-node diffusion Monte Carlo calculations of the total energy of the chlorine atom. Calculations using a trial wavefunction including about 750,000 determinants with a computational increase of ∼400 compared to a single-determinant calculation are shown to be feasible. © 2016 Wiley Periodicals, Inc. © 2016 Wiley Periodicals, Inc.
FPGA accelerator for protein secondary structure prediction based on the GOR algorithm
2011-01-01
Background Protein is an important molecule that performs a wide range of functions in biological systems. Recently, the protein folding attracts much more attention since the function of protein can be generally derived from its molecular structure. The GOR algorithm is one of the most successful computational methods and has been widely used as an efficient analysis tool to predict secondary structure from protein sequence. However, the execution time is still intolerable with the steep growth in protein database. Recently, FPGA chips have emerged as one promising application accelerator to accelerate bioinformatics algorithms by exploiting fine-grained custom design. Results In this paper, we propose a complete fine-grained parallel hardware implementation on FPGA to accelerate the GOR-IV package for 2D protein structure prediction. To improve computing efficiency, we partition the parameter table into small segments and access them in parallel. We aggressively exploit data reuse schemes to minimize the need for loading data from external memory. The whole computation structure is carefully pipelined to overlap the sequence loading, computing and back-writing operations as much as possible. We implemented a complete GOR desktop system based on an FPGA chip XC5VLX330. Conclusions The experimental results show a speedup factor of more than 430x over the original GOR-IV version and 110x speedup over the optimized version with multi-thread SIMD implementation running on a PC platform with AMD Phenom 9650 Quad CPU for 2D protein structure prediction. However, the power consumption is only about 30% of that of current general-propose CPUs. PMID:21342582
Quantum Computation: Entangling with the Future
NASA Technical Reports Server (NTRS)
Jiang, Zhang
2017-01-01
Commercial applications of quantum computation have become viable due to the rapid progress of the field in the recent years. Efficient quantum algorithms are discovered to cope with the most challenging real-world problems that are too hard for classical computers. Manufactured quantum hardware has reached unprecedented precision and controllability, enabling fault-tolerant quantum computation. Here, I give a brief introduction on what principles in quantum mechanics promise its unparalleled computational power. I will discuss several important quantum algorithms that achieve exponential or polynomial speedup over any classical algorithm. Building a quantum computer is a daunting task, and I will talk about the criteria and various implementations of quantum computers. I conclude the talk with near-future commercial applications of a quantum computer.
Fast Multipole Methods for Three-Dimensional N-body Problems
NASA Technical Reports Server (NTRS)
Koumoutsakos, P.
1995-01-01
We are developing computational tools for the simulations of three-dimensional flows past bodies undergoing arbitrary motions. High resolution viscous vortex methods have been developed that allow for extended simulations of two-dimensional configurations such as vortex generators. Our objective is to extend this methodology to three dimensions and develop a robust computational scheme for the simulation of such flows. A fundamental issue in the use of vortex methods is the ability of employing efficiently large numbers of computational elements to resolve the large range of scales that exist in complex flows. The traditional cost of the method scales as Omicron (N(sup 2)) as the N computational elements/particles induce velocities at each other, making the method unacceptable for simulations involving more than a few tens of thousands of particles. In the last decade fast methods have been developed that have operation counts of Omicron (N log N) or Omicron (N) (referred to as BH and GR respectively) depending on the details of the algorithm. These methods are based on the observation that the effect of a cluster of particles at a certain distance may be approximated by a finite series expansion. In order to exploit this observation we need to decompose the element population spatially into clusters of particles and build a hierarchy of clusters (a tree data structure) - smaller neighboring clusters combine to form a cluster of the next size up in the hierarchy and so on. This hierarchy of clusters allows one to determine efficiently when the approximation is valid. This algorithm is an N-body solver that appears in many fields of engineering and science. Some examples of its diverse use are in astrophysics, molecular dynamics, micro-magnetics, boundary element simulations of electromagnetic problems, and computer animation. More recently these N-body solvers have been implemented and applied in simulations involving vortex methods. Koumoutsakos and Leonard (1995) implemented the GR scheme in two dimensions for vector computer architectures allowing for simulations of bluff body flows using millions of particles. Winckelmans presented three-dimensional, viscous simulations of interacting vortex rings, using vortons and an implementation of a BH scheme for parallel computer architectures. Bhatt presented a vortex filament method to perform inviscid vortex ring interactions, with an alternative implementation of a BH scheme for a Connection Machine parallel computer architecture.
Computer-aided acquisition and logistics support (CALS): Concept of Operations for Depot Maintenance
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bourgeois, N.C.; Greer, D.K.
1993-04-01
This CALS Concept of Operations for Depot Maintenance provides the foundation strategy and the near term tactical plan for CALS implementation in the depot maintenance environment. The user requirements enumerated and the overarching architecture outlined serve as the primary framework for implementation planning. The seamless integration of depot maintenance business processes and supporting information systems with the emerging global CALS environment will be critical to the efficient realization of depot user's information requirements, and as, such will be a fundamental theme in depot implementations.
Fast High Resolution Volume Carving for 3D Plant Shoot Reconstruction
Scharr, Hanno; Briese, Christoph; Embgenbroich, Patrick; Fischbach, Andreas; Fiorani, Fabio; Müller-Linow, Mark
2017-01-01
Volume carving is a well established method for visual hull reconstruction and has been successfully applied in plant phenotyping, especially for 3d reconstruction of small plants and seeds. When imaging larger plants at still relatively high spatial resolution (≤1 mm), well known implementations become slow or have prohibitively large memory needs. Here we present and evaluate a computationally efficient algorithm for volume carving, allowing e.g., 3D reconstruction of plant shoots. It combines a well-known multi-grid representation called “Octree” with an efficient image region integration scheme called “Integral image.” Speedup with respect to less efficient octree implementations is about 2 orders of magnitude, due to the introduced refinement strategy “Mark and refine.” Speedup is about a factor 1.6 compared to a highly optimized GPU implementation using equidistant voxel grids, even without using any parallelization. We demonstrate the application of this method for trait derivation of banana and maize plants. PMID:29033961
Linear static structural and vibration analysis on high-performance computers
NASA Technical Reports Server (NTRS)
Baddourah, M. A.; Storaasli, O. O.; Bostic, S. W.
1993-01-01
Parallel computers offer the oppurtunity to significantly reduce the computation time necessary to analyze large-scale aerospace structures. This paper presents algorithms developed for and implemented on massively-parallel computers hereafter referred to as Scalable High-Performance Computers (SHPC), for the most computationally intensive tasks involved in structural analysis, namely, generation and assembly of system matrices, solution of systems of equations and calculation of the eigenvalues and eigenvectors. Results on SHPC are presented for large-scale structural problems (i.e. models for High-Speed Civil Transport). The goal of this research is to develop a new, efficient technique which extends structural analysis to SHPC and makes large-scale structural analyses tractable.
Extending the BEAGLE library to a multi-FPGA platform.
Jin, Zheming; Bakos, Jason D
2013-01-19
Maximum Likelihood (ML)-based phylogenetic inference using Felsenstein's pruning algorithm is a standard method for estimating the evolutionary relationships amongst a set of species based on DNA sequence data, and is used in popular applications such as RAxML, PHYLIP, GARLI, BEAST, and MrBayes. The Phylogenetic Likelihood Function (PLF) and its associated scaling and normalization steps comprise the computational kernel for these tools. These computations are data intensive but contain fine grain parallelism that can be exploited by coprocessor architectures such as FPGAs and GPUs. A general purpose API called BEAGLE has recently been developed that includes optimized implementations of Felsenstein's pruning algorithm for various data parallel architectures. In this paper, we extend the BEAGLE API to a multiple Field Programmable Gate Array (FPGA)-based platform called the Convey HC-1. The core calculation of our implementation, which includes both the phylogenetic likelihood function (PLF) and the tree likelihood calculation, has an arithmetic intensity of 130 floating-point operations per 64 bytes of I/O, or 2.03 ops/byte. Its performance can thus be calculated as a function of the host platform's peak memory bandwidth and the implementation's memory efficiency, as 2.03 × peak bandwidth × memory efficiency. Our FPGA-based platform has a peak bandwidth of 76.8 GB/s and our implementation achieves a memory efficiency of approximately 50%, which gives an average throughput of 78 Gflops. This represents a ~40X speedup when compared with BEAGLE's CPU implementation on a dual Xeon 5520 and 3X speedup versus BEAGLE's GPU implementation on a Tesla T10 GPU for very large data sizes. The power consumption is 92 W, yielding a power efficiency of 1.7 Gflops per Watt. The use of data parallel architectures to achieve high performance for likelihood-based phylogenetic inference requires high memory bandwidth and a design methodology that emphasizes high memory efficiency. To achieve this objective, we integrated 32 pipelined processing elements (PEs) across four FPGAs. For the design of each PE, we developed a specialized synthesis tool to generate a floating-point pipeline with resource and throughput constraints to match the target platform. We have found that using low-latency floating-point operators can significantly reduce FPGA area and still meet timing requirement on the target platform. We found that this design methodology can achieve performance that exceeds that of a GPU-based coprocessor.
Improved treatment of exact exchange in Quantum ESPRESSO
Barnes, Taylor A.; Kurth, Thorsten; Carrier, Pierre; ...
2017-01-18
Here, we present an algorithm and implementation for the parallel computation of exact exchange in Quantum ESPRESSO (QE) that exhibits greatly improved strong scaling. QE is an open-source software package for electronic structure calculations using plane wave density functional theory, and supports the use of local, semi-local, and hybrid DFT functionals. Wider application of hybrid functionals is desirable for the improved simulation of electronic band energy alignments and thermodynamic properties, but the computational complexity of evaluating the exact exchange potential limits the practical application of hybrid functionals to large systems and requires efficient implementations. We demonstrate that existing implementations ofmore » hybrid DFT that utilize a single data structure for both the local and exact exchange regions of the code are significantly limited in the degree of parallelization achievable. We present a band-pair parallelization approach, in which the calculation of exact exchange is parallelized and evaluated independently from the parallelization of the remainder of the calculation, with the wavefunction data being efficiently transformed on-the-fly into a form that is optimal for each part of the calculation. For a 64 water molecule supercell, our new algorithm reduces the overall time to solution by nearly an order of magnitude.« less
Shi, Yang; Fan, Hongfei; Xiong, Guoyue
2015-01-01
With the rapid development of cloud computing techniques, it is attractive for personal health record (PHR) service providers to deploy their PHR applications and store the personal health data in the cloud. However, there could be a serious privacy leakage if the cloud-based system is intruded by attackers, which makes it necessary for the PHR service provider to encrypt all patients' health data on cloud servers. Existing techniques are insufficiently secure under circumstances where advanced threats are considered, or being inefficient when many recipients are involved. Therefore, the objectives of our solution are (1) providing a secure implementation of re-encryption in white-box attack contexts and (2) assuring the efficiency of the implementation even in multi-recipient cases. We designed the multi-recipient re-encryption functionality by randomness-reusing and protecting the implementation by obfuscation. The proposed solution is secure even in white-box attack contexts. Furthermore, a comparison with other related work shows that the computational cost of the proposed solution is lower. The proposed technique can serve as a building block for supporting secure, efficient and privacy-preserving personal health record service systems.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Barnes, Taylor A.; Kurth, Thorsten; Carrier, Pierre
Here, we present an algorithm and implementation for the parallel computation of exact exchange in Quantum ESPRESSO (QE) that exhibits greatly improved strong scaling. QE is an open-source software package for electronic structure calculations using plane wave density functional theory, and supports the use of local, semi-local, and hybrid DFT functionals. Wider application of hybrid functionals is desirable for the improved simulation of electronic band energy alignments and thermodynamic properties, but the computational complexity of evaluating the exact exchange potential limits the practical application of hybrid functionals to large systems and requires efficient implementations. We demonstrate that existing implementations ofmore » hybrid DFT that utilize a single data structure for both the local and exact exchange regions of the code are significantly limited in the degree of parallelization achievable. We present a band-pair parallelization approach, in which the calculation of exact exchange is parallelized and evaluated independently from the parallelization of the remainder of the calculation, with the wavefunction data being efficiently transformed on-the-fly into a form that is optimal for each part of the calculation. For a 64 water molecule supercell, our new algorithm reduces the overall time to solution by nearly an order of magnitude.« less
Accelerating image recognition on mobile devices using GPGPU
NASA Astrophysics Data System (ADS)
Bordallo López, Miguel; Nykänen, Henri; Hannuksela, Jari; Silvén, Olli; Vehviläinen, Markku
2011-01-01
The future multi-modal user interfaces of battery-powered mobile devices are expected to require computationally costly image analysis techniques. The use of Graphic Processing Units for computing is very well suited for parallel processing and the addition of programmable stages and high precision arithmetic provide for opportunities to implement energy-efficient complete algorithms. At the moment the first mobile graphics accelerators with programmable pipelines are available, enabling the GPGPU implementation of several image processing algorithms. In this context, we consider a face tracking approach that uses efficient gray-scale invariant texture features and boosting. The solution is based on the Local Binary Pattern (LBP) features and makes use of the GPU on the pre-processing and feature extraction phase. We have implemented a series of image processing techniques in the shader language of OpenGL ES 2.0, compiled them for a mobile graphics processing unit and performed tests on a mobile application processor platform (OMAP3530). In our contribution, we describe the challenges of designing on a mobile platform, present the performance achieved and provide measurement results for the actual power consumption in comparison to using the CPU (ARM) on the same platform.
NASA Astrophysics Data System (ADS)
Bonano, Manuela; Buonanno, Sabatino; Ojha, Chandrakanta; Berardino, Paolo; Lanari, Riccardo; Zeni, Giovanni; Manunta, Michele
2017-04-01
The advanced DInSAR technique referred to as Small BAseline Subset (SBAS) algorithm has already largely demonstrated its effectiveness to carry out multi-scale and multi-platform surface deformation analyses relevant to both natural and man-made hazards. Thanks to its capability to generate displacement maps and long-term deformation time series at both regional (low resolution analysis) and local (full resolution analysis) spatial scales, it allows to get more insights on the spatial and temporal patterns of localized displacements relevant to single buildings and infrastructures over extended urban areas, with a key role in supporting risk mitigation and preservation activities. The extensive application of the multi-scale SBAS-DInSAR approach in many scientific contexts has gone hand in hand with new SAR satellite mission development, characterized by different frequency bands, spatial resolution, revisit times and ground coverage. This brought to the generation of huge DInSAR data stacks to be efficiently handled, processed and archived, with a strong impact on both the data storage and the computational requirements needed for generating the full resolution SBAS-DInSAR results. Accordingly, innovative and effective solutions for the automatic processing of massive SAR data archives and for the operational management of the derived SBAS-DInSAR products need to be designed and implemented, by exploiting the high efficiency (in terms of portability, scalability and computing performances) of the new ICT methodologies. In this work, we present a novel parallel implementation of the full resolution SBAS-DInSAR processing chain, aimed at investigating localized displacements affecting single buildings and infrastructures relevant to very large urban areas, relying on different granularity level parallelization strategies. The image granularity level is applied in most steps of the SBAS-DInSAR processing chain and exploits the multiprocessor systems with distributed memory. Moreover, in some processing steps very heavy from the computational point of view, the Graphical Processing Units (GPU) are exploited for the processing of blocks working on a pixel-by-pixel basis, requiring strong modifications on some key parts of the sequential full resolution SBAS-DInSAR processing chain. GPU processing is implemented by efficiently exploiting parallel processing architectures (as CUDA) for increasing the computing performances, in terms of optimization of the available GPU memory, as well as reduction of the Input/Output operations on the GPU and of the whole processing time for specific blocks w.r.t. the corresponding sequential implementation, particularly critical in presence of huge DInSAR datasets. Moreover, to efficiently handle the massive amount of DInSAR measurements provided by the new generation SAR constellations (CSK and Sentinel-1), we perform a proper re-design strategy aimed at the robust assimilation of the full resolution SBAS-DInSAR results into the web-based Geonode platform of the Spatial Data Infrastructure, thus allowing the efficient management, analysis and integration of the interferometric results with different data sources.
Synchronization of autonomous objects in discrete event simulation
NASA Technical Reports Server (NTRS)
Rogers, Ralph V.
1990-01-01
Autonomous objects in event-driven discrete event simulation offer the potential to combine the freedom of unrestricted movement and positional accuracy through Euclidean space of time-driven models with the computational efficiency of event-driven simulation. The principal challenge to autonomous object implementation is object synchronization. The concept of a spatial blackboard is offered as a potential methodology for synchronization. The issues facing implementation of a spatial blackboard are outlined and discussed.
Implementation and evaluation of various demons deformable image registration algorithms on a GPU.
Gu, Xuejun; Pan, Hubert; Liang, Yun; Castillo, Richard; Yang, Deshan; Choi, Dongju; Castillo, Edward; Majumdar, Amitava; Guerrero, Thomas; Jiang, Steve B
2010-01-07
Online adaptive radiation therapy (ART) promises the ability to deliver an optimal treatment in response to daily patient anatomic variation. A major technical barrier for the clinical implementation of online ART is the requirement of rapid image segmentation. Deformable image registration (DIR) has been used as an automated segmentation method to transfer tumor/organ contours from the planning image to daily images. However, the current computational time of DIR is insufficient for online ART. In this work, this issue is addressed by using computer graphics processing units (GPUs). A gray-scale-based DIR algorithm called demons and five of its variants were implemented on GPUs using the compute unified device architecture (CUDA) programming environment. The spatial accuracy of these algorithms was evaluated over five sets of pulmonary 4D CT images with an average size of 256 x 256 x 100 and more than 1100 expert-determined landmark point pairs each. For all the testing scenarios presented in this paper, the GPU-based DIR computation required around 7 to 11 s to yield an average 3D error ranging from 1.5 to 1.8 mm. It is interesting to find out that the original passive force demons algorithms outperform subsequently proposed variants based on the combination of accuracy, efficiency and ease of implementation.
A shared position/force control methodology for teleoperation
NASA Technical Reports Server (NTRS)
Lee, Jin S.
1987-01-01
A flexible and computationally efficient shared position/force control concept and its implementation in the Robot Control C Library (RCCL) are presented form the point of teleoperation. This methodology enables certain degrees of freedom to be position-controlled through real time manual inputs and the remaining degrees of freedom to be force-controlled by computer. Functionally, it is a hybrid control scheme in that certain degrees of freedom are designated to be under position control, and the remaining degrees of freedom to be under force control. However, the methodology is also a shared control scheme because some degrees of freedom can be put under manual control and the other degrees of freedom put under computer control. Unlike other hybrid control schemes, which process position and force commands independently, this scheme provides a force control loop built on top of a position control inner loop. This feature minimizes the computational burden and increases disturbance rejection. A simple implementation is achieved partly because the joint control servos that are part of most robots can be used to provide the position control inner loop. Along with this control scheme, several menus were implemented for the convenience of the user. The implemented control scheme was successfully demonstrated for the tasks of hinged-panel opening and peg-in-hole insertion.
Parallelization of Nullspace Algorithm for the computation of metabolic pathways
Jevremović, Dimitrije; Trinh, Cong T.; Srienc, Friedrich; Sosa, Carlos P.; Boley, Daniel
2011-01-01
Elementary mode analysis is a useful metabolic pathway analysis tool in understanding and analyzing cellular metabolism, since elementary modes can represent metabolic pathways with unique and minimal sets of enzyme-catalyzed reactions of a metabolic network under steady state conditions. However, computation of the elementary modes of a genome- scale metabolic network with 100–1000 reactions is very expensive and sometimes not feasible with the commonly used serial Nullspace Algorithm. In this work, we develop a distributed memory parallelization of the Nullspace Algorithm to handle efficiently the computation of the elementary modes of a large metabolic network. We give an implementation in C++ language with the support of MPI library functions for the parallel communication. Our proposed algorithm is accompanied with an analysis of the complexity and identification of major bottlenecks during computation of all possible pathways of a large metabolic network. The algorithm includes methods to achieve load balancing among the compute-nodes and specific communication patterns to reduce the communication overhead and improve efficiency. PMID:22058581
USDA-ARS?s Scientific Manuscript database
Simulation modelers increasingly require greater flexibility for model implementation on diverse operating systems, and they demand high computational speed for efficient iterative simulations. Additionally, model users may differ in preference for proprietary versus open-source software environment...
Smart Networking Decisions: A Kase Study.
ERIC Educational Resources Information Center
Sturgeon, Julie
1999-01-01
Describes one decision-making approach for quickly implementing a communications network into a school district. The use of volunteer labor for wiring installation, computer selection focusing on standardization to aid in troubleshooting, and an intranet system to achieve efficiency and learning opportunities for teachers and administrative…
User participation in the development of the human/computer interface for control centers
NASA Technical Reports Server (NTRS)
Broome, Richard; Quick-Campbell, Marlene; Creegan, James; Dutilly, Robert
1996-01-01
Technological advances coupled with the requirements to reduce operations staffing costs led to the demand for efficient, technologically-sophisticated mission operations control centers. The control center under development for the earth observing system (EOS) is considered. The users are involved in the development of a control center in order to ensure that it is cost-efficient and flexible. A number of measures were implemented in the EOS program in order to encourage user involvement in the area of human-computer interface development. The following user participation exercises carried out in relation to the system analysis and design are described: the shadow participation of the programmers during a day of operations; the flight operations personnel interviews; and the analysis of the flight operations team tasks. The user participation in the interface prototype development, the prototype evaluation, and the system implementation are reported on. The involvement of the users early in the development process enables the requirements to be better understood and the cost to be reduced.
Boson Sampling with Single-Photon Fock States from a Bright Solid-State Source.
Loredo, J C; Broome, M A; Hilaire, P; Gazzano, O; Sagnes, I; Lemaitre, A; Almeida, M P; Senellart, P; White, A G
2017-03-31
A boson-sampling device is a quantum machine expected to perform tasks intractable for a classical computer, yet requiring minimal nonclassical resources as compared to full-scale quantum computers. Photonic implementations to date employed sources based on inefficient processes that only simulate heralded single-photon statistics when strongly reducing emission probabilities. Boson sampling with only single-photon input has thus never been realized. Here, we report on a boson-sampling device operated with a bright solid-state source of single-photon Fock states with high photon-number purity: the emission from an efficient and deterministic quantum dot-micropillar system is demultiplexed into three partially indistinguishable single photons, with a single-photon purity 1-g^{(2)}(0) of 0.990±0.001, interfering in a linear optics network. Our demultiplexed source is between 1 and 2 orders of magnitude more efficient than current heralded multiphoton sources based on spontaneous parametric down-conversion, allowing us to complete the boson-sampling experiment faster than previous equivalent implementations.
graphkernels: R and Python packages for graph comparison
Ghisu, M Elisabetta; Llinares-López, Felipe; Borgwardt, Karsten
2018-01-01
Abstract Summary Measuring the similarity of graphs is a fundamental step in the analysis of graph-structured data, which is omnipresent in computational biology. Graph kernels have been proposed as a powerful and efficient approach to this problem of graph comparison. Here we provide graphkernels, the first R and Python graph kernel libraries including baseline kernels such as label histogram based kernels, classic graph kernels such as random walk based kernels, and the state-of-the-art Weisfeiler-Lehman graph kernel. The core of all graph kernels is implemented in C ++ for efficiency. Using the kernel matrices computed by the package, we can easily perform tasks such as classification, regression and clustering on graph-structured samples. Availability and implementation The R and Python packages including source code are available at https://CRAN.R-project.org/package=graphkernels and https://pypi.python.org/pypi/graphkernels. Contact mahito@nii.ac.jp or elisabetta.ghisu@bsse.ethz.ch Supplementary information Supplementary data are available online at Bioinformatics. PMID:29028902
Scalable domain decomposition solvers for stochastic PDEs in high performance computing
Desai, Ajit; Khalil, Mohammad; Pettit, Chris; ...
2017-09-21
Stochastic spectral finite element models of practical engineering systems may involve solutions of linear systems or linearized systems for non-linear problems with billions of unknowns. For stochastic modeling, it is therefore essential to design robust, parallel and scalable algorithms that can efficiently utilize high-performance computing to tackle such large-scale systems. Domain decomposition based iterative solvers can handle such systems. And though these algorithms exhibit excellent scalabilities, significant algorithmic and implementational challenges exist to extend them to solve extreme-scale stochastic systems using emerging computing platforms. Intrusive polynomial chaos expansion based domain decomposition algorithms are extended here to concurrently handle high resolutionmore » in both spatial and stochastic domains using an in-house implementation. Sparse iterative solvers with efficient preconditioners are employed to solve the resulting global and subdomain level local systems through multi-level iterative solvers. We also use parallel sparse matrix–vector operations to reduce the floating-point operations and memory requirements. Numerical and parallel scalabilities of these algorithms are presented for the diffusion equation having spatially varying diffusion coefficient modeled by a non-Gaussian stochastic process. Scalability of the solvers with respect to the number of random variables is also investigated.« less
Scalable domain decomposition solvers for stochastic PDEs in high performance computing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Desai, Ajit; Khalil, Mohammad; Pettit, Chris
Stochastic spectral finite element models of practical engineering systems may involve solutions of linear systems or linearized systems for non-linear problems with billions of unknowns. For stochastic modeling, it is therefore essential to design robust, parallel and scalable algorithms that can efficiently utilize high-performance computing to tackle such large-scale systems. Domain decomposition based iterative solvers can handle such systems. And though these algorithms exhibit excellent scalabilities, significant algorithmic and implementational challenges exist to extend them to solve extreme-scale stochastic systems using emerging computing platforms. Intrusive polynomial chaos expansion based domain decomposition algorithms are extended here to concurrently handle high resolutionmore » in both spatial and stochastic domains using an in-house implementation. Sparse iterative solvers with efficient preconditioners are employed to solve the resulting global and subdomain level local systems through multi-level iterative solvers. We also use parallel sparse matrix–vector operations to reduce the floating-point operations and memory requirements. Numerical and parallel scalabilities of these algorithms are presented for the diffusion equation having spatially varying diffusion coefficient modeled by a non-Gaussian stochastic process. Scalability of the solvers with respect to the number of random variables is also investigated.« less
Optimal Computing Budget Allocation for Particle Swarm Optimization in Stochastic Optimization.
Zhang, Si; Xu, Jie; Lee, Loo Hay; Chew, Ek Peng; Wong, Wai Peng; Chen, Chun-Hung
2017-04-01
Particle Swarm Optimization (PSO) is a popular metaheuristic for deterministic optimization. Originated in the interpretations of the movement of individuals in a bird flock or fish school, PSO introduces the concept of personal best and global best to simulate the pattern of searching for food by flocking and successfully translate the natural phenomena to the optimization of complex functions. Many real-life applications of PSO cope with stochastic problems. To solve a stochastic problem using PSO, a straightforward approach is to equally allocate computational effort among all particles and obtain the same number of samples of fitness values. This is not an efficient use of computational budget and leaves considerable room for improvement. This paper proposes a seamless integration of the concept of optimal computing budget allocation (OCBA) into PSO to improve the computational efficiency of PSO for stochastic optimization problems. We derive an asymptotically optimal allocation rule to intelligently determine the number of samples for all particles such that the PSO algorithm can efficiently select the personal best and global best when there is stochastic estimation noise in fitness values. We also propose an easy-to-implement sequential procedure. Numerical tests show that our new approach can obtain much better results using the same amount of computational effort.
Optimal Computing Budget Allocation for Particle Swarm Optimization in Stochastic Optimization
Zhang, Si; Xu, Jie; Lee, Loo Hay; Chew, Ek Peng; Chen, Chun-Hung
2017-01-01
Particle Swarm Optimization (PSO) is a popular metaheuristic for deterministic optimization. Originated in the interpretations of the movement of individuals in a bird flock or fish school, PSO introduces the concept of personal best and global best to simulate the pattern of searching for food by flocking and successfully translate the natural phenomena to the optimization of complex functions. Many real-life applications of PSO cope with stochastic problems. To solve a stochastic problem using PSO, a straightforward approach is to equally allocate computational effort among all particles and obtain the same number of samples of fitness values. This is not an efficient use of computational budget and leaves considerable room for improvement. This paper proposes a seamless integration of the concept of optimal computing budget allocation (OCBA) into PSO to improve the computational efficiency of PSO for stochastic optimization problems. We derive an asymptotically optimal allocation rule to intelligently determine the number of samples for all particles such that the PSO algorithm can efficiently select the personal best and global best when there is stochastic estimation noise in fitness values. We also propose an easy-to-implement sequential procedure. Numerical tests show that our new approach can obtain much better results using the same amount of computational effort. PMID:29170617
Complexity of the Quantum Adiabatic Algorithm
NASA Technical Reports Server (NTRS)
Hen, Itay
2013-01-01
The Quantum Adiabatic Algorithm (QAA) has been proposed as a mechanism for efficiently solving optimization problems on a quantum computer. Since adiabatic computation is analog in nature and does not require the design and use of quantum gates, it can be thought of as a simpler and perhaps more profound method for performing quantum computations that might also be easier to implement experimentally. While these features have generated substantial research in QAA, to date there is still a lack of solid evidence that the algorithm can outperform classical optimization algorithms.
Efficient Parallel Kernel Solvers for Computational Fluid Dynamics Applications
NASA Technical Reports Server (NTRS)
Sun, Xian-He
1997-01-01
Distributed-memory parallel computers dominate today's parallel computing arena. These machines, such as Intel Paragon, IBM SP2, and Cray Origin2OO, have successfully delivered high performance computing power for solving some of the so-called "grand-challenge" problems. Despite initial success, parallel machines have not been widely accepted in production engineering environments due to the complexity of parallel programming. On a parallel computing system, a task has to be partitioned and distributed appropriately among processors to reduce communication cost and to attain load balance. More importantly, even with careful partitioning and mapping, the performance of an algorithm may still be unsatisfactory, since conventional sequential algorithms may be serial in nature and may not be implemented efficiently on parallel machines. In many cases, new algorithms have to be introduced to increase parallel performance. In order to achieve optimal performance, in addition to partitioning and mapping, a careful performance study should be conducted for a given application to find a good algorithm-machine combination. This process, however, is usually painful and elusive. The goal of this project is to design and develop efficient parallel algorithms for highly accurate Computational Fluid Dynamics (CFD) simulations and other engineering applications. The work plan is 1) developing highly accurate parallel numerical algorithms, 2) conduct preliminary testing to verify the effectiveness and potential of these algorithms, 3) incorporate newly developed algorithms into actual simulation packages. The work plan has well achieved. Two highly accurate, efficient Poisson solvers have been developed and tested based on two different approaches: (1) Adopting a mathematical geometry which has a better capacity to describe the fluid, (2) Using compact scheme to gain high order accuracy in numerical discretization. The previously developed Parallel Diagonal Dominant (PDD) algorithm and Reduced Parallel Diagonal Dominant (RPDD) algorithm have been carefully studied on different parallel platforms for different applications, and a NASA simulation code developed by Man M. Rai and his colleagues has been parallelized and implemented based on data dependency analysis. These achievements are addressed in detail in the paper.
Uchida, Thomas K.; Sherman, Michael A.; Delp, Scott L.
2015-01-01
Impacts are instantaneous, computationally efficient approximations of collisions. Current impact models sacrifice important physical principles to achieve that efficiency, yielding qualitative and quantitative errors when applied to simultaneous impacts in spatial multibody systems. We present a new impact model that produces behaviour similar to that of a detailed compliant contact model, while retaining the efficiency of an instantaneous method. In our model, time and configuration are fixed, but the impact is resolved into distinct compression and expansion phases, themselves comprising sliding and rolling intervals. A constrained optimization problem is solved for each interval to compute incremental impulses while respecting physical laws and principles of contact mechanics. We present the mathematical model, algorithms for its practical implementation, and examples that demonstrate its effectiveness. In collisions involving materials of various stiffnesses, our model can be more than 20 times faster than integrating through the collision using a compliant contact model. This work extends the use of instantaneous impact models to scientific and engineering applications with strict accuracy requirements, where compliant contact models would otherwise be required. An open-source implementation is available in Simbody, a C++ multibody dynamics library widely used in biomechanical and robotic applications. PMID:27547093
Effects of Computer Architecture on FFT (Fast Fourier Transform) Algorithm Performance.
1983-12-01
Criteria for Efficient Implementation of FFT Algorithms," IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-30, pp. 107-109, Feb...1982. Burrus, C. S. and P. W. Eschenbacher. "An In-Place, In-Order Prime Factor FFT Algorithm," IEEE Transactions on Acoustics, Speech, and Signal... Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-30, pp. 217-226, Apr. 1982. Control Data Corporation. CDC Cyber 170 Computer Systems
Power and Performance Trade-offs for Space Time Adaptive Processing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gawande, Nitin A.; Manzano Franco, Joseph B.; Tumeo, Antonino
Computational efficiency – performance relative to power or energy – is one of the most important concerns when designing RADAR processing systems. This paper analyzes power and performance trade-offs for a typical Space Time Adaptive Processing (STAP) application. We study STAP implementations for CUDA and OpenMP on two computationally efficient architectures, Intel Haswell Core I7-4770TE and NVIDIA Kayla with a GK208 GPU. We analyze the power and performance of STAP’s computationally intensive kernels across the two hardware testbeds. We also show the impact and trade-offs of GPU optimization techniques. We show that data parallelism can be exploited for efficient implementationmore » on the Haswell CPU architecture. The GPU architecture is able to process large size data sets without increase in power requirement. The use of shared memory has a significant impact on the power requirement for the GPU. A balance between the use of shared memory and main memory access leads to an improved performance in a typical STAP application.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yu, Yuqi; Wang, Jinan; Shao, Qiang, E-mail: qshao@mail.shcnc.ac.cn, E-mail: Jiye.Shi@ucb.com, E-mail: wlzhu@mail.shcnc.ac.cn
2015-03-28
The application of temperature replica exchange molecular dynamics (REMD) simulation on protein motion is limited by its huge requirement of computational resource, particularly when explicit solvent model is implemented. In the previous study, we developed a velocity-scaling optimized hybrid explicit/implicit solvent REMD method with the hope to reduce the temperature (replica) number on the premise of maintaining high sampling efficiency. In this study, we utilized this method to characterize and energetically identify the conformational transition pathway of a protein model, the N-terminal domain of calmodulin. In comparison to the standard explicit solvent REMD simulation, the hybrid REMD is much lessmore » computationally expensive but, meanwhile, gives accurate evaluation of the structural and thermodynamic properties of the conformational transition which are in well agreement with the standard REMD simulation. Therefore, the hybrid REMD could highly increase the computational efficiency and thus expand the application of REMD simulation to larger-size protein systems.« less
NASA Astrophysics Data System (ADS)
Sagui, Celeste; Pedersen, Lee G.; Darden, Thomas A.
2004-01-01
The accurate simulation of biologically active macromolecules faces serious limitations that originate in the treatment of electrostatics in the empirical force fields. The current use of "partial charges" is a significant source of errors, since these vary widely with different conformations. By contrast, the molecular electrostatic potential (MEP) obtained through the use of a distributed multipole moment description, has been shown to converge to the quantum MEP outside the van der Waals surface, when higher order multipoles are used. However, in spite of the considerable improvement to the representation of the electronic cloud, higher order multipoles are not part of current classical biomolecular force fields due to the excessive computational cost. In this paper we present an efficient formalism for the treatment of higher order multipoles in Cartesian tensor formalism. The Ewald "direct sum" is evaluated through a McMurchie-Davidson formalism [L. McMurchie and E. Davidson, J. Comput. Phys. 26, 218 (1978)]. The "reciprocal sum" has been implemented in three different ways: using an Ewald scheme, a particle mesh Ewald (PME) method, and a multigrid-based approach. We find that even though the use of the McMurchie-Davidson formalism considerably reduces the cost of the calculation with respect to the standard matrix implementation of multipole interactions, the calculation in direct space remains expensive. When most of the calculation is moved to reciprocal space via the PME method, the cost of a calculation where all multipolar interactions (up to hexadecapole-hexadecapole) are included is only about 8.5 times more expensive than a regular AMBER 7 [D. A. Pearlman et al., Comput. Phys. Commun. 91, 1 (1995)] implementation with only charge-charge interactions. The multigrid implementation is slower but shows very promising results for parallelization. It provides a natural way to interface with continuous, Gaussian-based electrostatics in the future. It is hoped that this new formalism will facilitate the systematic implementation of higher order multipoles in classical biomolecular force fields.
Shen, Yi; Dai, Wei; Richards, Virginia M
2015-03-01
A MATLAB toolbox for the efficient estimation of the threshold, slope, and lapse rate of the psychometric function is described. The toolbox enables the efficient implementation of the updated maximum-likelihood (UML) procedure. The toolbox uses an object-oriented architecture for organizing the experimental variables and computational algorithms, which provides experimenters with flexibility in experimental design and data management. Descriptions of the UML procedure and the UML Toolbox are provided, followed by toolbox use examples. Finally, guidelines and recommendations of parameter configurations are given.
An efficient quantum circuit analyser on qubits and qudits
NASA Astrophysics Data System (ADS)
Loke, T.; Wang, J. B.
2011-10-01
This paper presents a highly efficient decomposition scheme and its associated Mathematica notebook for the analysis of complicated quantum circuits comprised of single/multiple qubit and qudit quantum gates. In particular, this scheme reduces the evaluation of multiple unitary gate operations with many conditionals to just two matrix additions, regardless of the number of conditionals or gate dimensions. This improves significantly the capability of a quantum circuit analyser implemented in a classical computer. This is also the first efficient quantum circuit analyser to include qudit quantum logic gates.
Federated data storage system prototype for LHC experiments and data intensive science
NASA Astrophysics Data System (ADS)
Kiryanov, A.; Klimentov, A.; Krasnopevtsev, D.; Ryabinkin, E.; Zarochentsev, A.
2017-10-01
Rapid increase of data volume from the experiments running at the Large Hadron Collider (LHC) prompted physics computing community to evaluate new data handling and processing solutions. Russian grid sites and universities’ clusters scattered over a large area aim at the task of uniting their resources for future productive work, at the same time giving an opportunity to support large physics collaborations. In our project we address the fundamental problem of designing a computing architecture to integrate distributed storage resources for LHC experiments and other data-intensive science applications and to provide access to data from heterogeneous computing facilities. Studies include development and implementation of federated data storage prototype for Worldwide LHC Computing Grid (WLCG) centres of different levels and University clusters within one National Cloud. The prototype is based on computing resources located in Moscow, Dubna, Saint Petersburg, Gatchina and Geneva. This project intends to implement a federated distributed storage for all kind of operations such as read/write/transfer and access via WAN from Grid centres, university clusters, supercomputers, academic and commercial clouds. The efficiency and performance of the system are demonstrated using synthetic and experiment-specific tests including real data processing and analysis workflows from ATLAS and ALICE experiments, as well as compute-intensive bioinformatics applications (PALEOMIX) running on supercomputers. We present topology and architecture of the designed system, report performance and statistics for different access patterns and show how federated data storage can be used efficiently by physicists and biologists. We also describe how sharing data on a widely distributed storage system can lead to a new computing model and reformations of computing style, for instance how bioinformatics program running on supercomputers can read/write data from the federated storage.
Efficient multitasking of Choleski matrix factorization on CRAY supercomputers
NASA Technical Reports Server (NTRS)
Overman, Andrea L.; Poole, Eugene L.
1991-01-01
A Choleski method is described and used to solve linear systems of equations that arise in large scale structural analysis. The method uses a novel variable-band storage scheme and is structured to exploit fast local memory caches while minimizing data access delays between main memory and vector registers. Several parallel implementations of this method are described for the CRAY-2 and CRAY Y-MP computers demonstrating the use of microtasking and autotasking directives. A portable parallel language, FORCE, is used for comparison with the microtasked and autotasked implementations. Results are presented comparing the matrix factorization times for three representative structural analysis problems from runs made in both dedicated and multi-user modes on both computers. CPU and wall clock timings are given for the parallel implementations and are compared to single processor timings of the same algorithm.
Iterated Gate Teleportation and Blind Quantum Computation.
Pérez-Delgado, Carlos A; Fitzsimons, Joseph F
2015-06-05
Blind quantum computation allows a user to delegate a computation to an untrusted server while keeping the computation hidden. A number of recent works have sought to establish bounds on the communication requirements necessary to implement blind computation, and a bound based on the no-programming theorem of Nielsen and Chuang has emerged as a natural limiting factor. Here we show that this constraint only holds in limited scenarios, and show how to overcome it using a novel method of iterated gate teleportations. This technique enables drastic reductions in the communication required for distributed quantum protocols, extending beyond the blind computation setting. Applied to blind quantum computation, this technique offers significant efficiency improvements, and in some scenarios offers an exponential reduction in communication requirements.
Matching pursuit parallel decomposition of seismic data
NASA Astrophysics Data System (ADS)
Li, Chuanhui; Zhang, Fanchang
2017-07-01
In order to improve the computation speed of matching pursuit decomposition of seismic data, a matching pursuit parallel algorithm is designed in this paper. We pick a fixed number of envelope peaks from the current signal in every iteration according to the number of compute nodes and assign them to the compute nodes on average to search the optimal Morlet wavelets in parallel. With the help of parallel computer systems and Message Passing Interface, the parallel algorithm gives full play to the advantages of parallel computing to significantly improve the computation speed of the matching pursuit decomposition and also has good expandability. Besides, searching only one optimal Morlet wavelet by every compute node in every iteration is the most efficient implementation.
Benchmarking of Computational Models for NDE and SHM of Composites
NASA Technical Reports Server (NTRS)
Wheeler, Kevin; Leckey, Cara; Hafiychuk, Vasyl; Juarez, Peter; Timucin, Dogan; Schuet, Stefan; Hafiychuk, Halyna
2016-01-01
Ultrasonic wave phenomena constitute the leading physical mechanism for nondestructive evaluation (NDE) and structural health monitoring (SHM) of solid composite materials such as carbon-fiber-reinforced polymer (CFRP) laminates. Computational models of ultrasonic guided-wave excitation, propagation, scattering, and detection in quasi-isotropic laminates can be extremely valuable in designing practically realizable NDE and SHM hardware and software with desired accuracy, reliability, efficiency, and coverage. This paper presents comparisons of guided-wave simulations for CFRP composites implemented using three different simulation codes: two commercial finite-element analysis packages, COMSOL and ABAQUS, and a custom code implementing the Elastodynamic Finite Integration Technique (EFIT). Comparisons are also made to experimental laser Doppler vibrometry data and theoretical dispersion curves.
Hardware implementation of CMAC neural network with reduced storage requirement.
Ker, J S; Kuo, Y H; Wen, R C; Liu, B D
1997-01-01
The cerebellar model articulation controller (CMAC) neural network has the advantages of fast convergence speed and low computation complexity. However, it suffers from a low storage space utilization rate on weight memory. In this paper, we propose a direct weight address mapping approach, which can reduce the required weight memory size with a utilization rate near 100%. Based on such an address mapping approach, we developed a pipeline architecture to efficiently perform the addressing operations. The proposed direct weight address mapping approach also speeds up the computation for the generation of weight addresses. Besides, a CMAC hardware prototype used for color calibration has been implemented to confirm the proposed approach and architecture.
Atomdroid: a computational chemistry tool for mobile platforms.
Feldt, Jonas; Mata, Ricardo A; Dieterich, Johannes M
2012-04-23
We present the implementation of a new molecular mechanics program designed for use in mobile platforms, the first specifically built for these devices. The software is designed to run on Android operating systems and is compatible with several modern tablet-PCs and smartphones available in the market. It includes molecular viewer/builder capabilities with integrated routines for geometry optimizations and Monte Carlo simulations. These functionalities allow it to work as a stand-alone tool. We discuss some particular development aspects, as well as the overall feasibility of using computational chemistry software packages in mobile platforms. Benchmark calculations show that through efficient implementation techniques even hand-held devices can be used to simulate midsized systems using force fields.
Correia, J R C C C; Martins, C J A P
2017-10-01
Topological defects unavoidably form at symmetry breaking phase transitions in the early universe. To probe the parameter space of theoretical models and set tighter experimental constraints (exploiting the recent advances in astrophysical observations), one requires more and more demanding simulations, and therefore more hardware resources and computation time. Improving the speed and efficiency of existing codes is essential. Here we present a general purpose graphics-processing-unit implementation of the canonical Press-Ryden-Spergel algorithm for the evolution of cosmological domain wall networks. This is ported to the Open Computing Language standard, and as a consequence significant speedups are achieved both in two-dimensional (2D) and 3D simulations.
Computationally Efficient Multiconfigurational Reactive Molecular Dynamics
Yamashita, Takefumi; Peng, Yuxing; Knight, Chris; Voth, Gregory A.
2012-01-01
It is a computationally demanding task to explicitly simulate the electronic degrees of freedom in a system to observe the chemical transformations of interest, while at the same time sampling the time and length scales required to converge statistical properties and thus reduce artifacts due to initial conditions, finite-size effects, and limited sampling. One solution that significantly reduces the computational expense consists of molecular models in which effective interactions between particles govern the dynamics of the system. If the interaction potentials in these models are developed to reproduce calculated properties from electronic structure calculations and/or ab initio molecular dynamics simulations, then one can calculate accurate properties at a fraction of the computational cost. Multiconfigurational algorithms model the system as a linear combination of several chemical bonding topologies to simulate chemical reactions, also sometimes referred to as “multistate”. These algorithms typically utilize energy and force calculations already found in popular molecular dynamics software packages, thus facilitating their implementation without significant changes to the structure of the code. However, the evaluation of energies and forces for several bonding topologies per simulation step can lead to poor computational efficiency if redundancy is not efficiently removed, particularly with respect to the calculation of long-ranged Coulombic interactions. This paper presents accurate approximations (effective long-range interaction and resulting hybrid methods) and multiple-program parallelization strategies for the efficient calculation of electrostatic interactions in reactive molecular simulations. PMID:25100924
Multi-GPU hybrid programming accelerated three-dimensional phase-field model in binary alloy
NASA Astrophysics Data System (ADS)
Zhu, Changsheng; Liu, Jieqiong; Zhu, Mingfang; Feng, Li
2018-03-01
In the process of dendritic growth simulation, the computational efficiency and the problem scales have extremely important influence on simulation efficiency of three-dimensional phase-field model. Thus, seeking for high performance calculation method to improve the computational efficiency and to expand the problem scales has a great significance to the research of microstructure of the material. A high performance calculation method based on MPI+CUDA hybrid programming model is introduced. Multi-GPU is used to implement quantitative numerical simulations of three-dimensional phase-field model in binary alloy under the condition of multi-physical processes coupling. The acceleration effect of different GPU nodes on different calculation scales is explored. On the foundation of multi-GPU calculation model that has been introduced, two optimization schemes, Non-blocking communication optimization and overlap of MPI and GPU computing optimization, are proposed. The results of two optimization schemes and basic multi-GPU model are compared. The calculation results show that the use of multi-GPU calculation model can improve the computational efficiency of three-dimensional phase-field obviously, which is 13 times to single GPU, and the problem scales have been expanded to 8193. The feasibility of two optimization schemes is shown, and the overlap of MPI and GPU computing optimization has better performance, which is 1.7 times to basic multi-GPU model, when 21 GPUs are used.
Smith, W; Bedayse, S; Lalwah, S L; Paryag, A
2009-08-01
The University of the West Indies (UWI) Dental School is planning to implement computer-based information systems to manage student and patient data. In order to measure the acceptance of the proposed implementation and to determine the degree of training that would be required, a survey was undertaken of the computer literacy and attitude of all staff and students. Data were collected via 230 questionnaires from all staff and students. A 78% response rate was obtained. The computer literacy of the majority of respondents was ranked as 'more than adequate' compared to other European Dental Schools. Respondents < 50 years had significantly higher computer literacy scores than older age groups (P < 0.05). Similarly, respondents who owned an email address, a computer, or were members of online social networking sites had significantly higher computer literacy scores than those who did not (P < 0.05). Sex, nationality and whether the respondent was student/staff were not significant factors. Most respondents felt that computer literacy should be a part of every modern undergraduate curriculum; that computer assisted learning applications and web-based learning activity could effectively supplement the traditional undergraduate curriculum and that a suitable information system would improve the efficiency in the school's management of students, teaching and clinics. The implementation of a computer-based information system is likely to have widespread acceptance among students and staff at the UWI Dental School. The computer literacy of the students and staff are on par with those of schools in the US and Europe.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Fong, K. W.
1977-08-15
This report deals with some techniques in applied programming using the Livermore Timesharing System (LTSS) on the CDC 7600 computers at the National Magnetic Fusion Energy Computer Center (NMFECC) and the Lawrence Livermore Laboratory Computer Center (LLLCC or Octopus network). This report is based on a document originally written specifically about the system as it is implemented at NMFECC but has been revised to accommodate differences between LLLCC and NMFECC implementations. Topics include: maintaining programs, debugging, recovering from system crashes, and using the central processing unit, memory, and input/output devices efficiently and economically. Routines that aid in these procedures aremore » mentioned. The companion report, UCID-17556, An LTSS Compendium, discusses the hardware and operating system and should be read before reading this report.« less
Efficiently passing messages in distributed spiking neural network simulation.
Thibeault, Corey M; Minkovich, Kirill; O'Brien, Michael J; Harris, Frederick C; Srinivasa, Narayan
2013-01-01
Efficiently passing spiking messages in a neural model is an important aspect of high-performance simulation. As the scale of networks has increased so has the size of the computing systems required to simulate them. In addition, the information exchange of these resources has become more of an impediment to performance. In this paper we explore spike message passing using different mechanisms provided by the Message Passing Interface (MPI). A specific implementation, MVAPICH, designed for high-performance clusters with Infiniband hardware is employed. The focus is on providing information about these mechanisms for users of commodity high-performance spiking simulators. In addition, a novel hybrid method for spike exchange was implemented and benchmarked.
Implementation of efficient trajectories for an ultrasonic scanner using chaotic maps
NASA Astrophysics Data System (ADS)
Almeda, A.; Baltazar, A.; Treesatayapun, C.; Mijarez, R.
2012-05-01
Typical ultrasonic methodology for nondestructive scanning evaluation uses systematic scanning paths. In many cases, this approach is time inefficient and also energy and computational power consuming. Here, a methodology for the scanning of defects using an ultrasonic echo-pulse scanning technique combined with chaotic trajectory generation is proposed. This is implemented in a Cartesian coordinate robotic system developed in our lab. To cover the entire search area, a chaotic function and a proposed mirror mapping were incorporated. To improve detection probability, our proposed scanning methodology is complemented with a probabilistic approach of discontinuity detection. The developed methodology was found to be more efficient than traditional ones used to localize and characterize hidden flaws.
Automating ATLAS Computing Operations using the Site Status Board
NASA Astrophysics Data System (ADS)
J, Andreeva; Iglesias C, Borrego; S, Campana; Girolamo A, Di; I, Dzhunov; Curull X, Espinal; S, Gayazov; E, Magradze; M, Nowotka M.; L, Rinaldi; P, Saiz; J, Schovancova; A, Stewart G.; M, Wright
2012-12-01
The automation of operations is essential to reduce manpower costs and improve the reliability of the system. The Site Status Board (SSB) is a framework which allows Virtual Organizations to monitor their computing activities at distributed sites and to evaluate site performance. The ATLAS experiment intensively uses the SSB for the distributed computing shifts, for estimating data processing and data transfer efficiencies at a particular site, and for implementing automatic exclusion of sites from computing activities, in case of potential problems. The ATLAS SSB provides a real-time aggregated monitoring view and keeps the history of the monitoring metrics. Based on this history, usability of a site from the perspective of ATLAS is calculated. The paper will describe how the SSB is integrated in the ATLAS operations and computing infrastructure and will cover implementation details of the ATLAS SSB sensors and alarm system, based on the information in the SSB. It will demonstrate the positive impact of the use of the SSB on the overall performance of ATLAS computing activities and will overview future plans.
Autonomic Closure for Turbulent Flows Using Approximate Bayesian Computation
NASA Astrophysics Data System (ADS)
Doronina, Olga; Christopher, Jason; Hamlington, Peter; Dahm, Werner
2017-11-01
Autonomic closure is a new technique for achieving fully adaptive and physically accurate closure of coarse-grained turbulent flow governing equations, such as those solved in large eddy simulations (LES). Although autonomic closure has been shown in recent a priori tests to more accurately represent unclosed terms than do dynamic versions of traditional LES models, the computational cost of the approach makes it challenging to implement for simulations of practical turbulent flows at realistically high Reynolds numbers. The optimization step used in the approach introduces large matrices that must be inverted and is highly memory intensive. In order to reduce memory requirements, here we propose to use approximate Bayesian computation (ABC) in place of the optimization step, thereby yielding a computationally-efficient implementation of autonomic closure that trades memory-intensive for processor-intensive computations. The latter challenge can be overcome as co-processors such as general purpose graphical processing units become increasingly available on current generation petascale and exascale supercomputers. In this work, we outline the formulation of ABC-enabled autonomic closure and present initial results demonstrating the accuracy and computational cost of the approach.
A computationally efficient method to treat secondary organic aerosol (SOA) from various length and structure alkanes as well as SOA from polycyclic aromatic hydrocarbons (PAHs) is implemented in the Community Multiscale Air Quality (CMAQ) model to predict aerosol concentrations ...
Massive Query Resolution for Rapid Selective Dissemination of Information.
ERIC Educational Resources Information Center
Cohen, Jonathan D.
1999-01-01
Outlines an efficient approach to performing query resolution which, when matched with a keyword scanner, offers rapid selecting and routing for massive Boolean queries, and which is suitable for implementation on a desktop computer. Demonstrates the system's operation with large examples in a practical setting. (AEF)
Turbulence model development and application at Lockheed Fort Worth Company
NASA Technical Reports Server (NTRS)
Smith, Brian R.
1995-01-01
This viewgraph presentation demonstrates that computationally efficient k-l and k-kl turbulence models have been developed and implemented at Lockheed Fort Worth Company. Many years of experience have been gained applying two equation turbulence models to complex three-dimensional flows for design and analysis.
The Comprehensive Competencies Program Reference Manual. Volume I. Introduction.
ERIC Educational Resources Information Center
Taggart, Robert
Chapter 1 of this reference manual is a summary of the comprehensive competencies program (CCP). It describes this system for organizing, implementing, managing, and efficiently delivering individualized self-paced instruction, combined with group and experience-based learning activities, using computer-assisted instruction. (The CCP covers not…
TRIM—3D: a three-dimensional model for accurate simulation of shallow water flow
Casulli, Vincenzo; Bertolazzi, Enrico; Cheng, Ralph T.
1993-01-01
A semi-implicit finite difference formulation for the numerical solution of three-dimensional tidal circulation is discussed. The governing equations are the three-dimensional Reynolds equations in which the pressure is assumed to be hydrostatic. A minimal degree of implicitness has been introduced in the finite difference formula so that the resulting algorithm permits the use of large time steps at a minimal computational cost. This formulation includes the simulation of flooding and drying of tidal flats, and is fully vectorizable for an efficient implementation on modern vector computers. The high computational efficiency of this method has made it possible to provide the fine details of circulation structure in complex regions that previous studies were unable to obtain. For proper interpretation of the model results suitable interactive graphics is also an essential tool.
DNS of Flow in a Low-Pressure Turbine Cascade Using a Discontinuous-Galerkin Spectral-Element Method
NASA Technical Reports Server (NTRS)
Garai, Anirban; Diosady, Laslo Tibor; Murman, Scott; Madavan, Nateri
2015-01-01
A new computational capability under development for accurate and efficient high-fidelity direct numerical simulation (DNS) and large eddy simulation (LES) of turbomachinery is described. This capability is based on an entropy-stable Discontinuous-Galerkin spectral-element approach that extends to arbitrarily high orders of spatial and temporal accuracy and is implemented in a computationally efficient manner on a modern high performance computer architecture. A validation study using this method to perform DNS of flow in a low-pressure turbine airfoil cascade are presented. Preliminary results indicate that the method captures the main features of the flow. Discrepancies between the predicted results and the experiments are likely due to the effects of freestream turbulence not being included in the simulation and will be addressed in the final paper.
On Improving Efficiency of Differential Evolution for Aerodynamic Shape Optimization Applications
NASA Technical Reports Server (NTRS)
Madavan, Nateri K.
2004-01-01
Differential Evolution (DE) is a simple and robust evolutionary strategy that has been proven effective in determining the global optimum for several difficult optimization problems. Although DE offers several advantages over traditional optimization approaches, its use in applications such as aerodynamic shape optimization where the objective function evaluations are computationally expensive is limited by the large number of function evaluations often required. In this paper various approaches for improving the efficiency of DE are reviewed and discussed. These approaches are implemented in a DE-based aerodynamic shape optimization method that uses a Navier-Stokes solver for the objective function evaluations. Parallelization techniques on distributed computers are used to reduce turnaround times. Results are presented for the inverse design of a turbine airfoil. The efficiency improvements achieved by the different approaches are evaluated and compared.
Towards a flexible middleware for context-aware pervasive and wearable systems.
Muro, Marco; Amoretti, Michele; Zanichelli, Francesco; Conte, Gianni
2012-11-01
Ambient intelligence and wearable computing call for innovative hardware and software technologies, including a highly capable, flexible and efficient middleware, allowing for the reuse of existing pervasive applications when developing new ones. In the considered application domain, middleware should also support self-management, interoperability among different platforms, efficient communications, and context awareness. In the on-going "everything is networked" scenario scalability appears as a very important issue, for which the peer-to-peer (P2P) paradigm emerges as an appealing solution for connecting software components in an overlay network, allowing for efficient and balanced data distribution mechanisms. In this paper, we illustrate how all these concepts can be placed into a theoretical tool, called networked autonomic machine (NAM), implemented into a NAM-based middleware, and evaluated against practical problems of pervasive computing.
Optimizing Approximate Weighted Matching on Nvidia Kepler K40
DOE Office of Scientific and Technical Information (OSTI.GOV)
Naim, Md; Manne, Fredrik; Halappanavar, Mahantesh
Matching is a fundamental graph problem with numerous applications in science and engineering. While algorithms for computing optimal matchings are difficult to parallelize, approximation algorithms on the other hand generally compute high quality solutions and are amenable to parallelization. In this paper, we present efficient implementations of the current best algorithm for half-approximate weighted matching, the Suitor algorithm, on Nvidia Kepler K-40 platform. We develop four variants of the algorithm that exploit hardware features to address key challenges for a GPU implementation. We also experiment with different combinations of work assigned to a warp. Using an exhaustive set ofmore » $269$ inputs, we demonstrate that the new implementation outperforms the previous best GPU algorithm by $10$ to $$100\\times$$ for over $100$ instances, and from $100$ to $$1000\\times$$ for $15$ instances. We also demonstrate up to $$20\\times$$ speedup relative to $2$ threads, and up to $$5\\times$$ relative to $16$ threads on Intel Xeon platform with $16$ cores for the same algorithm. The new algorithms and implementations provided in this paper will have a direct impact on several applications that repeatedly use matching as a key compute kernel. Further, algorithm designs and insights provided in this paper will benefit other researchers implementing graph algorithms on modern GPU architectures.« less
Massively Parallel Solution of Poisson Equation on Coarse Grain MIMD Architectures
NASA Technical Reports Server (NTRS)
Fijany, A.; Weinberger, D.; Roosta, R.; Gulati, S.
1998-01-01
In this paper a new algorithm, designated as Fast Invariant Imbedding algorithm, for solution of Poisson equation on vector and massively parallel MIMD architectures is presented. This algorithm achieves the same optimal computational efficiency as other Fast Poisson solvers while offering a much better structure for vector and parallel implementation. Our implementation on the Intel Delta and Paragon shows that a speedup of over two orders of magnitude can be achieved even for moderate size problems.
Terascale spectral element algorithms and implementations.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Fischer, P. F.; Tufo, H. M.
1999-08-17
We describe the development and implementation of an efficient spectral element code for multimillion gridpoint simulations of incompressible flows in general two- and three-dimensional domains. We review basic and recently developed algorithmic underpinnings that have resulted in good parallel and vector performance on a broad range of architectures, including the terascale computing systems now coming online at the DOE labs. Sustained performance of 219 GFLOPS has been recently achieved on 2048 nodes of the Intel ASCI-Red machine at Sandia.
SCUT: clinical data organization for physicians using pen computers.
Wormuth, D. W.
1992-01-01
The role of computers in assisting physicians with patient care is rapidly advancing. One of the significant obstacles to efficient use of computers in patient care has been the unavailability of reasonably configured portable computers. Lightweight portable computers are becoming more attractive as physician data-management devices, but still pose a significant problem with bedside use. The advent of computers designed to accept input from a pen and having no keyboard present a usable computer platform to enable physicians to perform clinical computing at the bedside. This paper describes a prototype system to maintain an electronic "scut" sheet. SCUT makes use of pen-input and background rule checking to enhance patient care. GO Corporation's PenPoint Operating System is used to implement the SCUT project. PMID:1483012
Aerodynamic optimization studies on advanced architecture computers
NASA Technical Reports Server (NTRS)
Chawla, Kalpana
1995-01-01
The approach to carrying out multi-discipline aerospace design studies in the future, especially in massively parallel computing environments, comprises of choosing (1) suitable solvers to compute solutions to equations characterizing a discipline, and (2) efficient optimization methods. In addition, for aerodynamic optimization problems, (3) smart methodologies must be selected to modify the surface shape. In this research effort, a 'direct' optimization method is implemented on the Cray C-90 to improve aerodynamic design. It is coupled with an existing implicit Navier-Stokes solver, OVERFLOW, to compute flow solutions. The optimization method is chosen such that it can accomodate multi-discipline optimization in future computations. In the work , however, only single discipline aerodynamic optimization will be included.
Application of computational aero-acoustics to real world problems
NASA Technical Reports Server (NTRS)
Hardin, Jay C.
1996-01-01
The application of computational aeroacoustics (CAA) to real problems is discussed in relation to the analysis performed with the aim of assessing the application of the various techniques. It is considered that the applications are limited by the inability of the computational resources to resolve the large range of scales involved in high Reynolds number flows. Possible simplifications are discussed. It is considered that problems remain to be solved in relation to the efficient use of the power of parallel computers and in the development of turbulent modeling schemes. The goal of CAA is stated as being the implementation of acoustic design studies on a computer terminal with reasonable run times.
User's Guide for Computer Program that Routes Signal Traces
NASA Technical Reports Server (NTRS)
Hedgley, David R., Jr.
2000-01-01
This disk contains both a FORTRAN computer program and the corresponding user's guide that facilitates both its incorporation into your system and its utility. The computer program represents an efficient algorithm that routes signal traces on layers of a printed circuit with both through-pins and surface mounts. The computer program included is an implementation of the ideas presented in the theoretical paper titled "A Formal Algorithm for Routing Signal Traces on a Printed Circuit Board", NASA TP-3639 published in 1996. The computer program in the "connects" file can be read with a FORTRAN compiler and readily integrated into software unique to each particular environment where it might be used.
Borazjani, Iman; Ge, Liang; Le, Trung; Sotiropoulos, Fotis
2013-01-01
We develop an overset-curvilinear immersed boundary (overset-CURVIB) method in a general non-inertial frame of reference to simulate a wide range of challenging biological flow problems. The method incorporates overset-curvilinear grids to efficiently handle multi-connected geometries and increase the resolution locally near immersed boundaries. Complex bodies undergoing arbitrarily large deformations may be embedded within the overset-curvilinear background grid and treated as sharp interfaces using the curvilinear immersed boundary (CURVIB) method (Ge and Sotiropoulos, Journal of Computational Physics, 2007). The incompressible flow equations are formulated in a general non-inertial frame of reference to enhance the overall versatility and efficiency of the numerical approach. Efficient search algorithms to identify areas requiring blanking, donor cells, and interpolation coefficients for constructing the boundary conditions at grid interfaces of the overset grid are developed and implemented using efficient parallel computing communication strategies to transfer information among sub-domains. The governing equations are discretized using a second-order accurate finite-volume approach and integrated in time via an efficient fractional-step method. Various strategies for ensuring globally conservative interpolation at grid interfaces suitable for incompressible flow fractional step methods are implemented and evaluated. The method is verified and validated against experimental data, and its capabilities are demonstrated by simulating the flow past multiple aquatic swimmers and the systolic flow in an anatomic left ventricle with a mechanical heart valve implanted in the aortic position. PMID:23833331
Programming and Runtime Support to Blaze FPGA Accelerator Deployment at Datacenter Scale
Huang, Muhuan; Wu, Di; Yu, Cody Hao; Fang, Zhenman; Interlandi, Matteo; Condie, Tyson; Cong, Jason
2017-01-01
With the end of CPU core scaling due to dark silicon limitations, customized accelerators on FPGAs have gained increased attention in modern datacenters due to their lower power, high performance and energy efficiency. Evidenced by Microsoft’s FPGA deployment in its Bing search engine and Intel’s 16.7 billion acquisition of Altera, integrating FPGAs into datacenters is considered one of the most promising approaches to sustain future datacenter growth. However, it is quite challenging for existing big data computing systems—like Apache Spark and Hadoop—to access the performance and energy benefits of FPGA accelerators. In this paper we design and implement Blaze to provide programming and runtime support for enabling easy and efficient deployments of FPGA accelerators in datacenters. In particular, Blaze abstracts FPGA accelerators as a service (FaaS) and provides a set of clean programming APIs for big data processing applications to easily utilize those accelerators. Our Blaze runtime implements an FaaS framework to efficiently share FPGA accelerators among multiple heterogeneous threads on a single node, and extends Hadoop YARN with accelerator-centric scheduling to efficiently share them among multiple computing tasks in the cluster. Experimental results using four representative big data applications demonstrate that Blaze greatly reduces the programming efforts to access FPGA accelerators in systems like Apache Spark and YARN, and improves the system throughput by 1.7 × to 3× (and energy efficiency by 1.5× to 2.7×) compared to a conventional CPU-only cluster. PMID:28317049
Programming and Runtime Support to Blaze FPGA Accelerator Deployment at Datacenter Scale.
Huang, Muhuan; Wu, Di; Yu, Cody Hao; Fang, Zhenman; Interlandi, Matteo; Condie, Tyson; Cong, Jason
2016-10-01
With the end of CPU core scaling due to dark silicon limitations, customized accelerators on FPGAs have gained increased attention in modern datacenters due to their lower power, high performance and energy efficiency. Evidenced by Microsoft's FPGA deployment in its Bing search engine and Intel's 16.7 billion acquisition of Altera, integrating FPGAs into datacenters is considered one of the most promising approaches to sustain future datacenter growth. However, it is quite challenging for existing big data computing systems-like Apache Spark and Hadoop-to access the performance and energy benefits of FPGA accelerators. In this paper we design and implement Blaze to provide programming and runtime support for enabling easy and efficient deployments of FPGA accelerators in datacenters. In particular, Blaze abstracts FPGA accelerators as a service (FaaS) and provides a set of clean programming APIs for big data processing applications to easily utilize those accelerators. Our Blaze runtime implements an FaaS framework to efficiently share FPGA accelerators among multiple heterogeneous threads on a single node, and extends Hadoop YARN with accelerator-centric scheduling to efficiently share them among multiple computing tasks in the cluster. Experimental results using four representative big data applications demonstrate that Blaze greatly reduces the programming efforts to access FPGA accelerators in systems like Apache Spark and YARN, and improves the system throughput by 1.7 × to 3× (and energy efficiency by 1.5× to 2.7×) compared to a conventional CPU-only cluster.
NASA Astrophysics Data System (ADS)
Liu, Wei; Chen, Shu-Ming; Zhang, Jian; Wu, Chun-Wang; Wu, Wei; Chen, Ping-Xing
2015-03-01
It is widely believed that Shor’s factoring algorithm provides a driving force to boost the quantum computing research. However, a serious obstacle to its binary implementation is the large number of quantum gates. Non-binary quantum computing is an efficient way to reduce the required number of elemental gates. Here, we propose optimization schemes for Shor’s algorithm implementation and take a ternary version for factorizing 21 as an example. The optimized factorization is achieved by a two-qutrit quantum circuit, which consists of only two single qutrit gates and one ternary controlled-NOT gate. This two-qutrit quantum circuit is then encoded into the nine lower vibrational states of an ion trapped in a weakly anharmonic potential. Optimal control theory (OCT) is employed to derive the manipulation electric field for transferring the encoded states. The ternary Shor’s algorithm can be implemented in one single step. Numerical simulation results show that the accuracy of the state transformations is about 0.9919. Project supported by the National Natural Science Foundation of China (Grant No. 61205108) and the High Performance Computing (HPC) Foundation of National University of Defense Technology, China.
Computational analysis of high resolution unsteady airloads for rotor aeroacoustics
NASA Technical Reports Server (NTRS)
Quackenbush, Todd R.; Lam, C.-M. Gordon; Wachspress, Daniel A.; Bliss, Donald B.
1994-01-01
The study of helicopter aerodynamic loading for acoustics applications requires the application of efficient yet accurate simulations of the velocity field induced by the rotor's vortex wake. This report summarizes work to date on the development of such an analysis, which builds on the Constant Vorticity Contour (CVC) free wake model, previously implemented for the study of vibratory loading in the RotorCRAFT computer code. The present effort has focused on implementation of an airload reconstruction approach that computes high resolution airload solutions of rotor/rotor-wake interactions required for acoustics computations. Supplementary efforts on the development of improved vortex core modeling, unsteady aerodynamic effects, higher spatial resolution of rotor loading, and fast vortex wake implementations have substantially enhanced the capabilities of the resulting software, denoted RotorCRAFT/AA (AeroAcoustics). Results of validation calculations using recently acquired model rotor data show that by employing airload reconstruction it is possible to apply the CVC wake analysis with temporal and spatial resolution suitable for acoustics applications while reducing the computation time required by one to two orders of magnitude relative to that required by direct calculations. Promising correlation with this body of airload and noise data has been obtained for a variety of rotor configurations and operating conditions.
Computationally efficient algorithms for real-time attitude estimation
NASA Technical Reports Server (NTRS)
Pringle, Steven R.
1993-01-01
For many practical spacecraft applications, algorithms for determining spacecraft attitude must combine inputs from diverse sensors and provide redundancy in the event of sensor failure. A Kalman filter is suitable for this task, however, it may impose a computational burden which may be avoided by sub optimal methods. A suboptimal estimator is presented which was implemented successfully on the Delta Star spacecraft which performed a 9 month SDI flight experiment in 1989. This design sought to minimize algorithm complexity to accommodate the limitations of an 8K guidance computer. The algorithm used is interpreted in the framework of Kalman filtering and a derivation is given for the computation.
Xyce™ Parallel Electronic Simulator Users' Guide, Version 6.5.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Keiter, Eric R.; Aadithya, Karthik V.; Mei, Ting
This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows one to developmore » new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandia's needs, including some radiation- aware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase -- a message passing parallel implementation -- which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows. The information herein is subject to change without notice. Copyright © 2002-2016 Sandia Corporation. All rights reserved.« less
FPGA-Based, Self-Checking, Fault-Tolerant Computers
NASA Technical Reports Server (NTRS)
Some, Raphael; Rennels, David
2004-01-01
A proposed computer architecture would exploit the capabilities of commercially available field-programmable gate arrays (FPGAs) to enable computers to detect and recover from bit errors. The main purpose of the proposed architecture is to enable fault-tolerant computing in the presence of single-event upsets (SEUs). [An SEU is a spurious bit flip (also called a soft error) caused by a single impact of ionizing radiation.] The architecture would also enable recovery from some soft errors caused by electrical transients and, to some extent, from intermittent and permanent (hard) errors caused by aging of electronic components. A typical FPGA of the current generation contains one or more complete processor cores, memories, and highspeed serial input/output (I/O) channels, making it possible to shrink a board-level processor node to a single integrated-circuit chip. Custom, highly efficient microcontrollers, general-purpose computers, custom I/O processors, and signal processors can be rapidly and efficiently implemented by use of FPGAs. Unfortunately, FPGAs are susceptible to SEUs. Prior efforts to mitigate the effects of SEUs have yielded solutions that degrade performance of the system and require support from external hardware and software. In comparison with other fault-tolerant- computing architectures (e.g., triple modular redundancy), the proposed architecture could be implemented with less circuitry and lower power demand. Moreover, the fault-tolerant computing functions would require only minimal support from circuitry outside the central processing units (CPUs) of computers, would not require any software support, and would be largely transparent to software and to other computer hardware. There would be two types of modules: a self-checking processor module and a memory system (see figure). The self-checking processor module would be implemented on a single FPGA and would be capable of detecting its own internal errors. It would contain two CPUs executing identical programs in lock step, with comparison of their outputs to detect errors. It would also contain various cache local memory circuits, communication circuits, and configurable special-purpose processors that would use self-checking checkers. (The basic principle of the self-checking checker method is to utilize logic circuitry that generates error signals whenever there is an error in either the checker or the circuit being checked.) The memory system would comprise a main memory and a hardware-controlled check-pointing system (CPS) based on a buffer memory denoted the recovery cache. The main memory would contain random-access memory (RAM) chips and FPGAs that would, in addition to everything else, implement double-error-detecting and single-error-correcting memory functions to enable recovery from single-bit errors.
Effective correlator for RadioAstron project
NASA Astrophysics Data System (ADS)
Sergeev, Sergey
This paper presents the implementation of programme FX-correlator for Very Long Baseline Interferometry, adapted for the project "RadioAstron". Software correlator implemented for heterogeneous computing systems using graphics accelerators. It is shown that for the task interferometry implementation of the graphics hardware has a high efficiency. The host processor of heterogeneous computing system, performs the function of forming the data flow for graphics accelerators, the number of which corresponds to the number of frequency channels. So, for the Radioastron project, such channels is seven. Each accelerator is perform correlation matrix for all bases for a single frequency channel. Initial data is converted to the floating-point format, is correction for the corresponding delay function and computes the entire correlation matrix simultaneously. Calculation of the correlation matrix is performed using the sliding Fourier transform. Thus, thanks to the compliance of a solved problem for architecture graphics accelerators, managed to get a performance for one processor platform Kepler, which corresponds to the performance of this task, the computing cluster platforms Intel on four nodes. This task successfully scaled not only on a large number of graphics accelerators, but also on a large number of nodes with multiple accelerators.
R package MVR for Joint Adaptive Mean-Variance Regularization and Variance Stabilization
Dazard, Jean-Eudes; Xu, Hua; Rao, J. Sunil
2015-01-01
We present an implementation in the R language for statistical computing of our recent non-parametric joint adaptive mean-variance regularization and variance stabilization procedure. The method is specifically suited for handling difficult problems posed by high-dimensional multivariate datasets (p ≫ n paradigm), such as in ‘omics’-type data, among which are that the variance is often a function of the mean, variable-specific estimators of variances are not reliable, and tests statistics have low powers due to a lack of degrees of freedom. The implementation offers a complete set of features including: (i) normalization and/or variance stabilization function, (ii) computation of mean-variance-regularized t and F statistics, (iii) generation of diverse diagnostic plots, (iv) synthetic and real ‘omics’ test datasets, (v) computationally efficient implementation, using C interfacing, and an option for parallel computing, (vi) manual and documentation on how to setup a cluster. To make each feature as user-friendly as possible, only one subroutine per functionality is to be handled by the end-user. It is available as an R package, called MVR (‘Mean-Variance Regularization’), downloadable from the CRAN. PMID:26819572
Fine-grained parallel RNAalifold algorithm for RNA secondary structure prediction on FPGA
Xia, Fei; Dou, Yong; Zhou, Xingming; Yang, Xuejun; Xu, Jiaqing; Zhang, Yang
2009-01-01
Background In the field of RNA secondary structure prediction, the RNAalifold algorithm is one of the most popular methods using free energy minimization. However, general-purpose computers including parallel computers or multi-core computers exhibit parallel efficiency of no more than 50%. Field Programmable Gate-Array (FPGA) chips provide a new approach to accelerate RNAalifold by exploiting fine-grained custom design. Results RNAalifold shows complicated data dependences, in which the dependence distance is variable, and the dependence direction is also across two dimensions. We propose a systolic array structure including one master Processing Element (PE) and multiple slave PEs for fine grain hardware implementation on FPGA. We exploit data reuse schemes to reduce the need to load energy matrices from external memory. We also propose several methods to reduce energy table parameter size by 80%. Conclusion To our knowledge, our implementation with 16 PEs is the only FPGA accelerator implementing the complete RNAalifold algorithm. The experimental results show a factor of 12.2 speedup over the RNAalifold (ViennaPackage – 1.6.5) software for a group of aligned RNA sequences with 2981-residue running on a Personal Computer (PC) platform with Pentium 4 2.6 GHz CPU. PMID:19208138
Acceleration and torque feedback for robotic control - Experimental results
NASA Technical Reports Server (NTRS)
Mclnroy, John E.; Saridis, George N.
1990-01-01
Gross motion control of robotic manipulators typically requires significant on-line computations to compensate for nonlinear dynamics due to gravity, Coriolis, centripetal, and friction nonlinearities. One controller proposed by Luo and Saridis avoids these computations by feeding back joint acceleration and torque. This study implements the controller on a Puma 600 robotic manipulator. Joint acceleration measurement is obtained by measuring linear accelerations of each joint, and deriving a computationally efficient transformation from the linear measurements to the angular accelerations. Torque feedback is obtained by using the previous torque sent to the joints. The implementation has stability problems on the Puma 600 due to the extremely high gains inherent in the feedback structure. Since these high gains excite frequency modes in the Puma 600, the algorithm is modified to decrease the gain inherent in the feedback structure. The resulting compensator is stable and insensitive to high frequency unmodeled dynamics. Moreover, a second compensator is proposed which uses acceleration and torque feedback, but still allows nonlinear terms to be fed forward. Thus, by feeding the increment in the easily calculated gravity terms forward, improved responses are obtained. Both proposed compensators are implemented, and the real time results are compared to those obtained with the computed torque algorithm.
Scan line graphics generation on the massively parallel processor
NASA Technical Reports Server (NTRS)
Dorband, John E.
1988-01-01
Described here is how researchers implemented a scan line graphics generation algorithm on the Massively Parallel Processor (MPP). Pixels are computed in parallel and their results are applied to the Z buffer in large groups. To perform pixel value calculations, facilitate load balancing across the processors and apply the results to the Z buffer efficiently in parallel requires special virtual routing (sort computation) techniques developed by the author especially for use on single-instruction multiple-data (SIMD) architectures.
Nonperturbative methods in HZE ion transport
NASA Technical Reports Server (NTRS)
Wilson, John W.; Badavi, Francis F.; Costen, Robert C.; Shinn, Judy L.
1993-01-01
A nonperturbative analytic solution of the high charge and energy (HZE) Green's function is used to implement a computer code for laboratory ion beam transport. The code is established to operate on the Langley Research Center nuclear fragmentation model used in engineering applications. Computational procedures are established to generate linear energy transfer (LET) distributions for a specified ion beam and target for comparison with experimental measurements. The code is highly efficient and compares well with the perturbation approximations.
NASA Astrophysics Data System (ADS)
Ferrando, N.; Gosálvez, M. A.; Cerdá, J.; Gadea, R.; Sato, K.
2011-02-01
The current success of the continuous cellular automata for the simulation of anisotropic wet chemical etching of silicon in microengineering applications is based on a relatively fast, approximate, constant time stepping implementation (CTS), whose accuracy against the exact algorithm—a computationally slow, variable time stepping implementation (VTS)—has not been previously analyzed in detail. In this study we show that the CTS implementation can generate moderately wrong etch rates and overall etching fronts, thus justifying the presentation of a novel, exact reformulation of the VTS implementation based on a new state variable, referred to as the predicted removal time (PRT), and the use of a self-balanced binary search tree that enables storage and efficient access to the PRT values in each time step in order to quickly remove the corresponding surface atom/s. The proposed PRT method reduces the simulation cost of the exact implementation from {O}(N^{5/3}) to {O}(N^{3/2} log N) without introducing any model simplifications. This enables more precise simulations (only limited by numerical precision errors) with affordable computational times that are similar to the less precise CTS implementation and even faster for low reactivity systems.
Accelerating Subsurface Transport Simulation on Heterogeneous Clusters
DOE Office of Scientific and Technical Information (OSTI.GOV)
Villa, Oreste; Gawande, Nitin A.; Tumeo, Antonino
Reactive transport numerical models simulate chemical and microbiological reactions that occur along a flowpath. These models have to compute reactions for a large number of locations. They solve the set of ordinary differential equations (ODEs) that describes the reaction for each location through the Newton-Raphson technique. This technique involves computing a Jacobian matrix and a residual vector for each set of equation, and then solving iteratively the linearized system by performing Gaussian Elimination and LU decomposition until convergence. STOMP, a well known subsurface flow simulation tool, employs matrices with sizes in the order of 100x100 elements and, for numerical accuracy,more » LU factorization with full pivoting instead of the faster partial pivoting. Modern high performance computing systems are heterogeneous machines whose nodes integrate both CPUs and GPUs, exposing unprecedented amounts of parallelism. To exploit all their computational power, applications must use both the types of processing elements. For the case of subsurface flow simulation, this mainly requires implementing efficient batched LU-based solvers and identifying efficient solutions for enabling load balancing among the different processors of the system. In this paper we discuss two approaches that allows scaling STOMP's performance on heterogeneous clusters. We initially identify the challenges in implementing batched LU-based solvers for small matrices on GPUs, and propose an implementation that fulfills STOMP's requirements. We compare this implementation to other existing solutions. Then, we combine the batched GPU solver with an OpenMP-based CPU solver, and present an adaptive load balancer that dynamically distributes the linear systems to solve between the two components inside a node. We show how these approaches, integrated into the full application, provide speed ups from 6 to 7 times on large problems, executed on up to 16 nodes of a cluster with two AMD Opteron 6272 and a Tesla M2090 per node.« less
Hi-Corrector: a fast, scalable and memory-efficient package for normalizing large-scale Hi-C data.
Li, Wenyuan; Gong, Ke; Li, Qingjiao; Alber, Frank; Zhou, Xianghong Jasmine
2015-03-15
Genome-wide proximity ligation assays, e.g. Hi-C and its variant TCC, have recently become important tools to study spatial genome organization. Removing biases from chromatin contact matrices generated by such techniques is a critical preprocessing step of subsequent analyses. The continuing decline of sequencing costs has led to an ever-improving resolution of the Hi-C data, resulting in very large matrices of chromatin contacts. Such large-size matrices, however, pose a great challenge on the memory usage and speed of its normalization. Therefore, there is an urgent need for fast and memory-efficient methods for normalization of Hi-C data. We developed Hi-Corrector, an easy-to-use, open source implementation of the Hi-C data normalization algorithm. Its salient features are (i) scalability-the software is capable of normalizing Hi-C data of any size in reasonable times; (ii) memory efficiency-the sequential version can run on any single computer with very limited memory, no matter how little; (iii) fast speed-the parallel version can run very fast on multiple computing nodes with limited local memory. The sequential version is implemented in ANSI C and can be easily compiled on any system; the parallel version is implemented in ANSI C with the MPI library (a standardized and portable parallel environment designed for solving large-scale scientific problems). The package is freely available at http://zhoulab.usc.edu/Hi-Corrector/. © The Author 2014. Published by Oxford University Press.
Simple, efficient allocation of modelling runs on heterogeneous clusters with MPI
Donato, David I.
2017-01-01
In scientific modelling and computation, the choice of an appropriate method for allocating tasks for parallel processing depends on the computational setting and on the nature of the computation. The allocation of independent but similar computational tasks, such as modelling runs or Monte Carlo trials, among the nodes of a heterogeneous computational cluster is a special case that has not been specifically evaluated previously. A simulation study shows that a method of on-demand (that is, worker-initiated) pulling from a bag of tasks in this case leads to reliably short makespans for computational jobs despite heterogeneity both within and between cluster nodes. A simple reference implementation in the C programming language with the Message Passing Interface (MPI) is provided.
A singularity free analytical solution of artificial satellite motion with drag
NASA Technical Reports Server (NTRS)
Mueller, A.
1978-01-01
An analytical satellite theory based on the regular, canonical Poincare-Similar (PS phi) elements is described along with an accurate density model which can be implemented into the drag theory. A computationally efficient manner in which to expand the equations of motion into a fourier series is discussed.
ERIC Educational Resources Information Center
Russell, Thyra K.
Morris Library at Southern Illinois University computerized its technical processes using the Library Computer System (LCS), which was implemented in the library to streamline order processing by: (1) providing up-to-date online files to track in-process items; (2) encouraging quick, efficient accessing of information; (3) reducing manual files;…
The United States Environmental Protection Agency is developing a Computer
Aided Process Engineering (CAPE) software tool for the metal finishing
industry that helps users design efficient metal finishing processes that
are less polluting to the environment. Metal finish...
Parallel computing techniques for rotorcraft aerodynamics
NASA Astrophysics Data System (ADS)
Ekici, Kivanc
The modification of unsteady three-dimensional Navier-Stokes codes for application on massively parallel and distributed computing environments is investigated. The Euler/Navier-Stokes code TURNS (Transonic Unsteady Rotor Navier-Stokes) was chosen as a test bed because of its wide use by universities and industry. For the efficient implementation of TURNS on parallel computing systems, two algorithmic changes are developed. First, main modifications to the implicit operator, Lower-Upper Symmetric Gauss Seidel (LU-SGS) originally used in TURNS, is performed. Second, application of an inexact Newton method, coupled with a Krylov subspace iterative method (Newton-Krylov method) is carried out. Both techniques have been tried previously for the Euler equations mode of the code. In this work, we have extended the methods to the Navier-Stokes mode. Several new implicit operators were tried because of convergence problems of traditional operators with the high cell aspect ratio (CAR) grids needed for viscous calculations on structured grids. Promising results for both Euler and Navier-Stokes cases are presented for these operators. For the efficient implementation of Newton-Krylov methods to the Navier-Stokes mode of TURNS, efficient preconditioners must be used. The parallel implicit operators used in the previous step are employed as preconditioners and the results are compared. The Message Passing Interface (MPI) protocol has been used because of its portability to various parallel architectures. It should be noted that the proposed methodology is general and can be applied to several other CFD codes (e.g. OVERFLOW).
TOUGH3: A new efficient version of the TOUGH suite of multiphase flow and transport simulators
NASA Astrophysics Data System (ADS)
Jung, Yoojin; Pau, George Shu Heng; Finsterle, Stefan; Pollyea, Ryan M.
2017-11-01
The TOUGH suite of nonisothermal multiphase flow and transport simulators has been updated by various developers over many years to address a vast range of challenging subsurface problems. The increasing complexity of the simulated processes as well as the growing size of model domains that need to be handled call for an improvement in the simulator's computational robustness and efficiency. Moreover, modifications have been frequently introduced independently, resulting in multiple versions of TOUGH that (1) led to inconsistencies in feature implementation and usage, (2) made code maintenance and development inefficient, and (3) caused confusion to users and developers. TOUGH3-a new base version of TOUGH-addresses these issues. It consolidates both the serial (TOUGH2 V2.1) and parallel (TOUGH2-MP V2.0) implementations, enabling simulations to be performed on desktop computers and supercomputers using a single code. New PETSc parallel linear solvers are added to the existing serial solvers of TOUGH2 and the Aztec solver used in TOUGH2-MP. The PETSc solvers generally perform better than the Aztec solvers in parallel and the internal TOUGH3 linear solver in serial. TOUGH3 also incorporates many new features, addresses bugs, and improves the flexibility of data handling. Due to the improved capabilities and usability, TOUGH3 is more robust and efficient for solving tough and computationally demanding problems in diverse scientific and practical applications related to subsurface flow modeling.
NASA Astrophysics Data System (ADS)
Zackay, Barak; Ofek, Eran O.
2017-01-01
Astronomical radio signals are subjected to phase dispersion while traveling through the interstellar medium. To optimally detect a short-duration signal within a frequency band, we have to precisely compensate for the unknown pulse dispersion, which is a computationally demanding task. We present the “fast dispersion measure transform” algorithm for optimal detection of such signals. Our algorithm has a low theoretical complexity of 2{N}f{N}t+{N}t{N}{{Δ }}{{log}}2({N}f), where Nf, Nt, and NΔ are the numbers of frequency bins, time bins, and dispersion measure bins, respectively. Unlike previously suggested fast algorithms, our algorithm conserves the sensitivity of brute-force dedispersion. Our tests indicate that this algorithm, running on a standard desktop computer and implemented in a high-level programming language, is already faster than the state-of-the-art dedispersion codes running on graphical processing units (GPUs). We also present a variant of the algorithm that can be efficiently implemented on GPUs. The latter algorithm’s computation and data-transport requirements are similar to those of a two-dimensional fast Fourier transform, indicating that incoherent dedispersion can now be considered a nonissue while planning future surveys. We further present a fast algorithm for sensitive detection of pulses shorter than the dispersive smearing limits of incoherent dedispersion. In typical cases, this algorithm is orders of magnitude faster than enumerating dispersion measures and coherently dedispersing by convolution. We analyze the computational complexity of pulsed signal searches by radio interferometers. We conclude that, using our suggested algorithms, maximally sensitive blind searches for dispersed pulses are feasible using existing facilities. We provide an implementation of these algorithms in Python and MATLAB.
Parallel mutual information estimation for inferring gene regulatory networks on GPUs
2011-01-01
Background Mutual information is a measure of similarity between two variables. It has been widely used in various application domains including computational biology, machine learning, statistics, image processing, and financial computing. Previously used simple histogram based mutual information estimators lack the precision in quality compared to kernel based methods. The recently introduced B-spline function based mutual information estimation method is competitive to the kernel based methods in terms of quality but at a lower computational complexity. Results We present a new approach to accelerate the B-spline function based mutual information estimation algorithm with commodity graphics hardware. To derive an efficient mapping onto this type of architecture, we have used the Compute Unified Device Architecture (CUDA) programming model to design and implement a new parallel algorithm. Our implementation, called CUDA-MI, can achieve speedups of up to 82 using double precision on a single GPU compared to a multi-threaded implementation on a quad-core CPU for large microarray datasets. We have used the results obtained by CUDA-MI to infer gene regulatory networks (GRNs) from microarray data. The comparisons to existing methods including ARACNE and TINGe show that CUDA-MI produces GRNs of higher quality in less time. Conclusions CUDA-MI is publicly available open-source software, written in CUDA and C++ programming languages. It obtains significant speedup over sequential multi-threaded implementation by fully exploiting the compute capability of commonly used CUDA-enabled low-cost GPUs. PMID:21672264
An Efficient Pipeline Wavefront Phase Recovery for the CAFADIS Camera for Extremely Large Telescopes
Magdaleno, Eduardo; Rodríguez, Manuel; Rodríguez-Ramos, José Manuel
2010-01-01
In this paper we show a fast, specialized hardware implementation of the wavefront phase recovery algorithm using the CAFADIS camera. The CAFADIS camera is a new plenoptic sensor patented by the Universidad de La Laguna (Canary Islands, Spain): international patent PCT/ES2007/000046 (WIPO publication number WO/2007/082975). It can simultaneously measure the wavefront phase and the distance to the light source in a real-time process. The pipeline algorithm is implemented using Field Programmable Gate Arrays (FPGA). These devices present architecture capable of handling the sensor output stream using a massively parallel approach and they are efficient enough to resolve several Adaptive Optics (AO) problems in Extremely Large Telescopes (ELTs) in terms of processing time requirements. The FPGA implementation of the wavefront phase recovery algorithm using the CAFADIS camera is based on the very fast computation of two dimensional fast Fourier Transforms (FFTs). Thus we have carried out a comparison between our very novel FPGA 2D-FFTa and other implementations. PMID:22315523
Magdaleno, Eduardo; Rodríguez, Manuel; Rodríguez-Ramos, José Manuel
2010-01-01
In this paper we show a fast, specialized hardware implementation of the wavefront phase recovery algorithm using the CAFADIS camera. The CAFADIS camera is a new plenoptic sensor patented by the Universidad de La Laguna (Canary Islands, Spain): international patent PCT/ES2007/000046 (WIPO publication number WO/2007/082975). It can simultaneously measure the wavefront phase and the distance to the light source in a real-time process. The pipeline algorithm is implemented using Field Programmable Gate Arrays (FPGA). These devices present architecture capable of handling the sensor output stream using a massively parallel approach and they are efficient enough to resolve several Adaptive Optics (AO) problems in Extremely Large Telescopes (ELTs) in terms of processing time requirements. The FPGA implementation of the wavefront phase recovery algorithm using the CAFADIS camera is based on the very fast computation of two dimensional fast Fourier Transforms (FFTs). Thus we have carried out a comparison between our very novel FPGA 2D-FFTa and other implementations.
SVM classifier on chip for melanoma detection.
Afifi, Shereen; GholamHosseini, Hamid; Sinha, Roopak
2017-07-01
Support Vector Machine (SVM) is a common classifier used for efficient classification with high accuracy. SVM shows high accuracy for classifying melanoma (skin cancer) clinical images within computer-aided diagnosis systems used by skin cancer specialists to detect melanoma early and save lives. We aim to develop a medical low-cost handheld device that runs a real-time embedded SVM-based diagnosis system for use in primary care for early detection of melanoma. In this paper, an optimized SVM classifier is implemented onto a recent FPGA platform using the latest design methodology to be embedded into the proposed device for realizing online efficient melanoma detection on a single system on chip/device. The hardware implementation results demonstrate a high classification accuracy of 97.9% and a significant acceleration factor of 26 from equivalent software implementation on an embedded processor, with 34% of resources utilization and 2 watts for power consumption. Consequently, the implemented system meets crucial embedded systems constraints of high performance and low cost, resources utilization and power consumption, while achieving high classification accuracy.
Accelerating the coupled-cluster singles and doubles method using the chain-of-sphere approximation
NASA Astrophysics Data System (ADS)
Dutta, Achintya Kumar; Neese, Frank; Izsák, Róbert
2018-06-01
In this paper, we present a chain-of-sphere implementation of the external exchange term, the computational bottleneck of coupled-cluster calculations at the singles and doubles level. This implementation is compared to standard molecular orbital, atomic orbital and resolution of identity implementations of the same term within the ORCA package and turns out to be the most efficient one for larger molecules, with a better accuracy than the resolution-of-identity approximation. Furthermore, it becomes possible to perform a canonical CC calculation on a tetramer of nucleobases in 17 days, 20 hours.
Algorithms for classification of astronomical object spectra
NASA Astrophysics Data System (ADS)
Wasiewicz, P.; Szuppe, J.; Hryniewicz, K.
2015-09-01
Obtaining interesting celestial objects from tens of thousands or even millions of recorded optical-ultraviolet spectra depends not only on the data quality but also on the accuracy of spectra decomposition. Additionally rapidly growing data volumes demands higher computing power and/or more efficient algorithms implementations. In this paper we speed up the process of substracting iron transitions and fitting Gaussian functions to emission peaks utilising C++ and OpenCL methods together with the NOSQL database. In this paper we implemented typical astronomical methods of detecting peaks in comparison to our previous hybrid methods implemented with CUDA.
NASA Technical Reports Server (NTRS)
1976-01-01
The primary objective of this study was to develop an integrated approach for the development, implementation, and utilization of all software that is required to efficiently and cost-effectively support advanced technology laboratory flight and ground operations. It was recognized that certain aspects of the operations would be mandatory computerized services; computerization of other aspects would be optional. Thus, the analyses encompassed not only alternate computer utilization and implementations but trade studies of the programmatic effects of non-computerized versus computerized approaches to the operations. A general overview of the study is presented.
Arias-Vimárlund, V.; Ljunggren, M.; Timpka, T.
1996-01-01
OBJECTIVE: Exploration of the societal health economic effects occurring during the first year after implementation of Computerised Patient Records (CPRs) at Primary Health Care (PHC) centres. DESIGN: Comparative case studies of practice processes and their consequences one year after CPR implementation, using the constant comparison method. Application of transaction-cost analyses at a societal level on the results. SETTING: Two urban PHC centres under a managed care contract in Ostergötland county, Sweden. MAIN OUTCOME MEASURES: Central implementation issues. First-year societal direct normal costs, direct unexpected costs, and indirect costs. Societal benefits. RESULTS: The total societal effect of the CPR implementation was a cost of nearly 250,000 SEK (USD 37,000) per GP team. About 20% of the effect consisted of direct unexpected costs, accured from the reduction of practitioners' leisure time. The main issues in the implementation process were medical informatics knowledge and computer skills, adaptation of the human-computer interaction design to practice routines, and information access through the CPR. CONCLUSIONS: The societal costs exceed the benefits during the first year after CPR implementation at the observed PHC centres. Early investments in requirements engineering and staff training may increase the efficiency. Exploitation of the CPR for disease prevention and clinical quality improvement is necessary to defend the investment in societal terms. The exact calculation of societal costs requires further analysis of the affected groups' willingness to pay. PMID:8947717
ParCAT: A Parallel Climate Analysis Toolkit
NASA Astrophysics Data System (ADS)
Haugen, B.; Smith, B.; Steed, C.; Ricciuto, D. M.; Thornton, P. E.; Shipman, G.
2012-12-01
Climate science has employed increasingly complex models and simulations to analyze the past and predict the future of our climate. The size and dimensionality of climate simulation data has been growing with the complexity of the models. This growth in data is creating a widening gap between the data being produced and the tools necessary to analyze large, high dimensional data sets. With single run data sets increasing into 10's, 100's and even 1000's of gigabytes, parallel computing tools are becoming a necessity in order to analyze and compare climate simulation data. The Parallel Climate Analysis Toolkit (ParCAT) provides basic tools that efficiently use parallel computing techniques to narrow the gap between data set size and analysis tools. ParCAT was created as a collaborative effort between climate scientists and computer scientists in order to provide efficient parallel implementations of the computing tools that are of use to climate scientists. Some of the basic functionalities included in the toolkit are the ability to compute spatio-temporal means and variances, differences between two runs and histograms of the values in a data set. ParCAT is designed to facilitate the "heavy lifting" that is required for large, multidimensional data sets. The toolkit does not focus on performing the final visualizations and presentation of results but rather, reducing large data sets to smaller, more manageable summaries. The output from ParCAT is provided in commonly used file formats (NetCDF, CSV, ASCII) to allow for simple integration with other tools. The toolkit is currently implemented as a command line utility, but will likely also provide a C library for developers interested in tighter software integration. Elements of the toolkit are already being incorporated into projects such as UV-CDAT and CMDX. There is also an effort underway to implement portions of the CCSM Land Model Diagnostics package using ParCAT in conjunction with Python and gnuplot. ParCAT is implemented in C to provide efficient file IO. The file IO operations in the toolkit use the parallel-netcdf library; this enables the code to use the parallel IO capabilities of modern HPC systems. Analysis that currently requires an estimated 12+ hours with the traditional CCSM Land Model Diagnostics Package can now be performed in as little as 30 minutes on a single desktop workstation and a few minutes for relatively small jobs completed on modern HPC systems such as ORNL's Jaguar.
NASA Astrophysics Data System (ADS)
Dickson, Bradley M.; de Waal, Parker W.; Ramjan, Zachary H.; Xu, H. Eric; Rothbart, Scott B.
2016-10-01
In this communication we introduce an efficient implementation of adaptive biasing that greatly improves the speed of free energy computation in molecular dynamics simulations. We investigated the use of accelerated simulations to inform on compound design using a recently reported and clinically relevant inhibitor of the chromatin regulator BRD4 (bromodomain-containing protein 4). Benchmarking on our local compute cluster, our implementation achieves up to 2.5 times more force calls per day than plumed2. Results of five 1 μs-long simulations are presented, which reveal a conformational switch in the BRD4 inhibitor between a binding competent and incompetent state. Stabilization of the switch led to a -3 kcal/mol improvement of absolute binding free energy. These studies suggest an unexplored ligand design principle and offer new actionable hypotheses for medicinal chemistry efforts against this druggable epigenetic target class.
Dickson, Bradley M.; Ramjan, Zachary H.; Xu, H. Eric
2016-01-01
In this communication we introduce an efficient implementation of adaptive biasing that greatly improves the speed of free energy computation in molecular dynamics simulations. We investigated the use of accelerated simulations to inform on compound design using a recently reported and clinically relevant inhibitor of the chromatin regulator BRD4 (bromodomain-containing protein 4). Benchmarking on our local compute cluster, our implementation achieves up to 2.5 times more force calls per day than plumed2. Results of five 1 μs-long simulations are presented, which reveal a conformational switch in the BRD4 inhibitor between a binding competent and incompetent state. Stabilization of the switch led to a −3 kcal/mol improvement of absolute binding free energy. These studies suggest an unexplored ligand design principle and offer new actionable hypotheses for medicinal chemistry efforts against this druggable epigenetic target class. PMID:27782467
Adaptive DIT-Based Fringe Tracking and Prediction at IOTA
NASA Technical Reports Server (NTRS)
Wilson, Edward; Pedretti, Ettore; Bregman, Jesse; Mah, Robert W.; Traub, Wesley A.
2004-01-01
An automatic fringe tracking system has been developed and implemented at the Infrared Optical Telescope Array (IOTA). In testing during May 2002, the system successfully minimized the optical path differences (OPDs) for all three baselines at IOTA. Based on sliding window discrete Fourier transform (DFT) calculations that were optimized for computational efficiency and robustness to atmospheric disturbances, the algorithm has also been tested extensively on off-line data. Implemented in ANSI C on the 266 MHZ PowerPC processor running the VxWorks real-time operating system, the algorithm runs in approximately 2.0 milliseconds per scan (including all three interferograms), using the science camera and piezo scanners to measure and correct the OPDs. Preliminary analysis on an extension of this algorithm indicates a potential for predictive tracking, although at present, real-time implementation of this extension would require significantly more computational capacity.
High-throughput Bayesian Network Learning using Heterogeneous Multicore Computers
Linderman, Michael D.; Athalye, Vivek; Meng, Teresa H.; Asadi, Narges Bani; Bruggner, Robert; Nolan, Garry P.
2017-01-01
Aberrant intracellular signaling plays an important role in many diseases. The causal structure of signal transduction networks can be modeled as Bayesian Networks (BNs), and computationally learned from experimental data. However, learning the structure of Bayesian Networks (BNs) is an NP-hard problem that, even with fast heuristics, is too time consuming for large, clinically important networks (20–50 nodes). In this paper, we present a novel graphics processing unit (GPU)-accelerated implementation of a Monte Carlo Markov Chain-based algorithm for learning BNs that is up to 7.5-fold faster than current general-purpose processor (GPP)-based implementations. The GPU-based implementation is just one of several implementations within the larger application, each optimized for a different input or machine configuration. We describe the methodology we use to build an extensible application, assembled from these variants, that can target a broad range of heterogeneous systems, e.g., GPUs, multicore GPPs. Specifically we show how we use the Merge programming model to efficiently integrate, test and intelligently select among the different potential implementations. PMID:28819655
Production experience with the ATLAS Event Service
NASA Astrophysics Data System (ADS)
Benjamin, D.; Calafiura, P.; Childers, T.; De, K.; Guan, W.; Maeno, T.; Nilsson, P.; Tsulaia, V.; Van Gemmeren, P.; Wenaus, T.; ATLAS Collaboration
2017-10-01
The ATLAS Event Service (AES) has been designed and implemented for efficient running of ATLAS production workflows on a variety of computing platforms, ranging from conventional Grid sites to opportunistic, often short-lived resources, such as spot market commercial clouds, supercomputers and volunteer computing. The Event Service architecture allows real time delivery of fine grained workloads to running payload applications which process dispatched events or event ranges and immediately stream the outputs to highly scalable Object Stores. Thanks to its agile and flexible architecture the AES is currently being used by grid sites for assigning low priority workloads to otherwise idle computing resources; similarly harvesting HPC resources in an efficient back-fill mode; and massively scaling out to the 50-100k concurrent core level on the Amazon spot market to efficiently utilize those transient resources for peak production needs. Platform ports in development include ATLAS@Home (BOINC) and the Google Compute Engine, and a growing number of HPC platforms. After briefly reviewing the concept and the architecture of the Event Service, we will report the status and experience gained in AES commissioning and production operations on supercomputers, and our plans for extending ES application beyond Geant4 simulation to other workflows, such as reconstruction and data analysis.
Kossert, K; Cassette, Ph; Carles, A Grau; Jörg, G; Gostomski, Christroph Lierse V; Nähle, O; Wolf, Ch
2014-05-01
The triple-to-double coincidence ratio (TDCR) method is frequently used to measure the activity of radionuclides decaying by pure β emission or electron capture (EC). Some radionuclides with more complex decays have also been studied, but accurate calculations of decay branches which are accompanied by many coincident γ transitions have not yet been investigated. This paper describes recent extensions of the model to make efficiency computations for more complex decay schemes possible. In particular, the MICELLE2 program that applies a stochastic approach of the free parameter model was extended. With an improved code, efficiencies for β(-), β(+) and EC branches with up to seven coincident γ transitions can be calculated. Moreover, a new parametrization for the computation of electron stopping powers has been implemented to compute the ionization quenching function of 10 commercial scintillation cocktails. In order to demonstrate the capabilities of the TDCR method, the following radionuclides are discussed: (166m)Ho (complex β(-)/γ), (59)Fe (complex β(-)/γ), (64)Cu (β(-), β(+), EC and EC/γ) and (229)Th in equilibrium with its progenies (decay chain with many α, β and complex β(-)/γ transitions). © 2013 Published by Elsevier Ltd.
Improving robustness and computational efficiency using modern C++
DOE Office of Scientific and Technical Information (OSTI.GOV)
Paterno, M.; Kowalkowski, J.; Green, C.
2014-01-01
For nearly two decades, the C++ programming language has been the dominant programming language for experimental HEP. The publication of ISO/IEC 14882:2011, the current version of the international standard for the C++ programming language, makes available a variety of language and library facilities for improving the robustness, expressiveness, and computational efficiency of C++ code. However, much of the C++ written by the experimental HEP community does not take advantage of the features of the language to obtain these benefits, either due to lack of familiarity with these features or concern that these features must somehow be computationally inefficient. In thismore » paper, we address some of the features of modern C+-+, and show how they can be used to make programs that are both robust and computationally efficient. We compare and contrast simple yet realistic examples of some common implementation patterns in C, currently-typical C++, and modern C++, and show (when necessary, down to the level of generated assembly language code) the quality of the executable code produced by recent C++ compilers, with the aim of allowing the HEP community to make informed decisions on the costs and benefits of the use of modern C++.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Krueger, Jens; Micikevicius, Paulius; Williams, Samuel
Reverse Time Migration (RTM) is one of the main approaches in the seismic processing industry for imaging the subsurface structure of the Earth. While RTM provides qualitative advantages over its predecessors, it has a high computational cost warranting implementation on HPC architectures. We focus on three progressively more complex kernels extracted from RTM: for isotropic (ISO), vertical transverse isotropic (VTI) and tilted transverse isotropic (TTI) media. In this work, we examine performance optimization of forward wave modeling, which describes the computational kernels used in RTM, on emerging multi- and manycore processors and introduce a novel common subexpression elimination optimization formore » TTI kernels. We compare attained performance and energy efficiency in both the single-node and distributed memory environments in order to satisfy industry’s demands for fidelity, performance, and energy efficiency. Moreover, we discuss the interplay between architecture (chip and system) and optimizations (both on-node computation) highlighting the importance of NUMA-aware approaches to MPI communication. Ultimately, our results show we can improve CPU energy efficiency by more than 10× on Magny Cours nodes while acceleration via multiple GPUs can surpass the energy-efficient Intel Sandy Bridge by as much as 3.6×.« less
Numerical Algorithms for Acoustic Integrals - The Devil is in the Details
NASA Technical Reports Server (NTRS)
Brentner, Kenneth S.
1996-01-01
The accurate prediction of the aeroacoustic field generated by aerospace vehicles or nonaerospace machinery is necessary for designers to control and reduce source noise. Powerful computational aeroacoustic methods, based on various acoustic analogies (primarily the Lighthill acoustic analogy) and Kirchhoff methods, have been developed for prediction of noise from complicated sources, such as rotating blades. Both methods ultimately predict the noise through a numerical evaluation of an integral formulation. In this paper, we consider three generic acoustic formulations and several numerical algorithms that have been used to compute the solutions to these formulations. Algorithms for retarded-time formulations are the most efficient and robust, but they are difficult to implement for supersonic-source motion. Collapsing-sphere and emission-surface formulations are good alternatives when supersonic-source motion is present, but the numerical implementations of these formulations are more computationally demanding. New algorithms - which utilize solution adaptation to provide a specified error level - are needed.
Solving very large, sparse linear systems on mesh-connected parallel computers
NASA Technical Reports Server (NTRS)
Opsahl, Torstein; Reif, John
1987-01-01
The implementation of Pan and Reif's Parallel Nested Dissection (PND) algorithm on mesh connected parallel computers is described. This is the first known algorithm that allows very large, sparse linear systems of equations to be solved efficiently in polylog time using a small number of processors. How the processor bound of PND can be matched to the number of processors available on a given parallel computer by slowing down the algorithm by constant factors is described. Also, for the important class of problems where G(A) is a grid graph, a unique memory mapping that reduces the inter-processor communication requirements of PND to those that can be executed on mesh connected parallel machines is detailed. A description of an implementation on the Goodyear Massively Parallel Processor (MPP), located at Goddard is given. Also, a detailed discussion of data mappings and performance issues is given.
NASA Astrophysics Data System (ADS)
Jia, Weile; Wang, Jue; Chi, Xuebin; Wang, Lin-Wang
2017-02-01
LS3DF, namely linear scaling three-dimensional fragment method, is an efficient linear scaling ab initio total energy electronic structure calculation code based on a divide-and-conquer strategy. In this paper, we present our GPU implementation of the LS3DF code. Our test results show that the GPU code can calculate systems with about ten thousand atoms fully self-consistently in the order of 10 min using thousands of computing nodes. This makes the electronic structure calculations of 10,000-atom nanosystems routine work. This speed is 4.5-6 times faster than the CPU calculations using the same number of nodes on the Titan machine in the Oak Ridge leadership computing facility (OLCF). Such speedup is achieved by (a) carefully re-designing of the computationally heavy kernels; (b) redesign of the communication pattern for heterogeneous supercomputers.
Nonlinear power flow feedback control for improved stability and performance of airfoil sections
Wilson, David G.; Robinett, III, Rush D.
2013-09-03
A computer-implemented method of determining the pitch stability of an airfoil system, comprising using a computer to numerically integrate a differential equation of motion that includes terms describing PID controller action. In one model, the differential equation characterizes the time-dependent response of the airfoil's pitch angle, .alpha.. The computer model calculates limit-cycles of the model, which represent the stability boundaries of the airfoil system. Once the stability boundary is known, feedback control can be implemented, by using, for example, a PID controller to control a feedback actuator. The method allows the PID controller gain constants, K.sub.I, K.sub.p, and K.sub.d, to be optimized. This permits operation closer to the stability boundaries, while preventing the physical apparatus from unintentionally crossing the stability boundaries. Operating closer to the stability boundaries permits greater power efficiencies to be extracted from the airfoil system.
Song, Chao; Zheng, Shi-Biao; Zhang, Pengfei; Xu, Kai; Zhang, Libo; Guo, Qiujiang; Liu, Wuxin; Xu, Da; Deng, Hui; Huang, Keqiang; Zheng, Dongning; Zhu, Xiaobo; Wang, H
2017-10-20
Geometric phase, associated with holonomy transformation in quantum state space, is an important quantum-mechanical effect. Besides fundamental interest, this effect has practical applications, among which geometric quantum computation is a paradigm, where quantum logic operations are realized through geometric phase manipulation that has some intrinsic noise-resilient advantages and may enable simplified implementation of multi-qubit gates compared to the dynamical approach. Here we report observation of a continuous-variable geometric phase and demonstrate a quantum gate protocol based on this phase in a superconducting circuit, where five qubits are controllably coupled to a resonator. Our geometric approach allows for one-step implementation of n-qubit controlled-phase gates, which represents a remarkable advantage compared to gate decomposition methods, where the number of required steps dramatically increases with n. Following this approach, we realize these gates with n up to 4, verifying the high efficiency of this geometric manipulation for quantum computation.