Development and Applications of a Modular Parallel Process for Large Scale Fluid/Structures Problems
NASA Technical Reports Server (NTRS)
Guruswamy, Guru P.; Byun, Chansup; Kwak, Dochan (Technical Monitor)
2001-01-01
A modular process that can efficiently solve large scale multidisciplinary problems using massively parallel super computers is presented. The process integrates disciplines with diverse physical characteristics by retaining the efficiency of individual disciplines. Computational domain independence of individual disciplines is maintained using a meta programming approach. The process integrates disciplines without affecting the combined performance. Results are demonstrated for large scale aerospace problems on several supercomputers. The super scalability and portability of the approach is demonstrated on several parallel computers.
Jackin, Boaz Jessie; Watanabe, Shinpei; Ootsu, Kanemitsu; Ohkawa, Takeshi; Yokota, Takashi; Hayasaki, Yoshio; Yatagai, Toyohiko; Baba, Takanobu
2018-04-20
A parallel computation method for large-size Fresnel computer-generated hologram (CGH) is reported. The method was introduced by us in an earlier report as a technique for calculating Fourier CGH from 2D object data. In this paper we extend the method to compute Fresnel CGH from 3D object data. The scale of the computation problem is also expanded to 2 gigapixels, making it closer to real application requirements. The significant feature of the reported method is its ability to avoid communication overhead and thereby fully utilize the computing power of parallel devices. The method exhibits three layers of parallelism that favor small to large scale parallel computing machines. Simulation and optical experiments were conducted to demonstrate the workability and to evaluate the efficiency of the proposed technique. A two-times improvement in computation speed has been achieved compared to the conventional method, on a 16-node cluster (one GPU per node) utilizing only one layer of parallelism. A 20-times improvement in computation speed has been estimated utilizing two layers of parallelism on a very large-scale parallel machine with 16 nodes, where each node has 16 GPUs.
NASA Technical Reports Server (NTRS)
Storaasli, Olaf O. (Editor); Housner, Jerrold M. (Editor)
1993-01-01
Computing speed is leaping forward by several orders of magnitude each decade. Engineers and scientists gathered at a NASA Langley symposium to discuss these exciting trends as they apply to parallel computational methods for large-scale structural analysis and design. Among the topics discussed were: large-scale static analysis; dynamic, transient, and thermal analysis; domain decomposition (substructuring); and nonlinear and numerical methods.
Lee, Jae H.; Yao, Yushu; Shrestha, Uttam; Gullberg, Grant T.; Seo, Youngho
2014-01-01
The primary goal of this project is to implement the iterative statistical image reconstruction algorithm, in this case maximum likelihood expectation maximum (MLEM) used for dynamic cardiac single photon emission computed tomography, on Spark/GraphX. This involves porting the algorithm to run on large-scale parallel computing systems. Spark is an easy-to- program software platform that can handle large amounts of data in parallel. GraphX is a graph analytic system running on top of Spark to handle graph and sparse linear algebra operations in parallel. The main advantage of implementing MLEM algorithm in Spark/GraphX is that it allows users to parallelize such computation without any expertise in parallel computing or prior knowledge in computer science. In this paper we demonstrate a successful implementation of MLEM in Spark/GraphX and present the performance gains with the goal to eventually make it useable in clinical setting. PMID:27081299
Lee, Jae H; Yao, Yushu; Shrestha, Uttam; Gullberg, Grant T; Seo, Youngho
2014-11-01
The primary goal of this project is to implement the iterative statistical image reconstruction algorithm, in this case maximum likelihood expectation maximum (MLEM) used for dynamic cardiac single photon emission computed tomography, on Spark/GraphX. This involves porting the algorithm to run on large-scale parallel computing systems. Spark is an easy-to- program software platform that can handle large amounts of data in parallel. GraphX is a graph analytic system running on top of Spark to handle graph and sparse linear algebra operations in parallel. The main advantage of implementing MLEM algorithm in Spark/GraphX is that it allows users to parallelize such computation without any expertise in parallel computing or prior knowledge in computer science. In this paper we demonstrate a successful implementation of MLEM in Spark/GraphX and present the performance gains with the goal to eventually make it useable in clinical setting.
Development and Applications of a Modular Parallel Process for Large Scale Fluid/Structures Problems
NASA Technical Reports Server (NTRS)
Guruswamy, Guru P.; Kwak, Dochan (Technical Monitor)
2002-01-01
A modular process that can efficiently solve large scale multidisciplinary problems using massively parallel supercomputers is presented. The process integrates disciplines with diverse physical characteristics by retaining the efficiency of individual disciplines. Computational domain independence of individual disciplines is maintained using a meta programming approach. The process integrates disciplines without affecting the combined performance. Results are demonstrated for large scale aerospace problems on several supercomputers. The super scalability and portability of the approach is demonstrated on several parallel computers.
Ma, Li; Runesha, H Birali; Dvorkin, Daniel; Garbe, John R; Da, Yang
2008-01-01
Background Genome-wide association studies (GWAS) using single nucleotide polymorphism (SNP) markers provide opportunities to detect epistatic SNPs associated with quantitative traits and to detect the exact mode of an epistasis effect. Computational difficulty is the main bottleneck for epistasis testing in large scale GWAS. Results The EPISNPmpi and EPISNP computer programs were developed for testing single-locus and epistatic SNP effects on quantitative traits in GWAS, including tests of three single-locus effects for each SNP (SNP genotypic effect, additive and dominance effects) and five epistasis effects for each pair of SNPs (two-locus interaction, additive × additive, additive × dominance, dominance × additive, and dominance × dominance) based on the extended Kempthorne model. EPISNPmpi is the parallel computing program for epistasis testing in large scale GWAS and achieved excellent scalability for large scale analysis and portability for various parallel computing platforms. EPISNP is the serial computing program based on the EPISNPmpi code for epistasis testing in small scale GWAS using commonly available operating systems and computer hardware. Three serial computing utility programs were developed for graphical viewing of test results and epistasis networks, and for estimating CPU time and disk space requirements. Conclusion The EPISNPmpi parallel computing program provides an effective computing tool for epistasis testing in large scale GWAS, and the epiSNP serial computing programs are convenient tools for epistasis analysis in small scale GWAS using commonly available computer hardware. PMID:18644146
NASA Astrophysics Data System (ADS)
Xu, Jincheng; Liu, Wei; Wang, Jin; Liu, Linong; Zhang, Jianfeng
2018-02-01
De-absorption pre-stack time migration (QPSTM) compensates for the absorption and dispersion of seismic waves by introducing an effective Q parameter, thereby making it an effective tool for 3D, high-resolution imaging of seismic data. Although the optimal aperture obtained via stationary-phase migration reduces the computational cost of 3D QPSTM and yields 3D stationary-phase QPSTM, the associated computational efficiency is still the main problem in the processing of 3D, high-resolution images for real large-scale seismic data. In the current paper, we proposed a division method for large-scale, 3D seismic data to optimize the performance of stationary-phase QPSTM on clusters of graphics processing units (GPU). Then, we designed an imaging point parallel strategy to achieve an optimal parallel computing performance. Afterward, we adopted an asynchronous double buffering scheme for multi-stream to perform the GPU/CPU parallel computing. Moreover, several key optimization strategies of computation and storage based on the compute unified device architecture (CUDA) were adopted to accelerate the 3D stationary-phase QPSTM algorithm. Compared with the initial GPU code, the implementation of the key optimization steps, including thread optimization, shared memory optimization, register optimization and special function units (SFU), greatly improved the efficiency. A numerical example employing real large-scale, 3D seismic data showed that our scheme is nearly 80 times faster than the CPU-QPSTM algorithm. Our GPU/CPU heterogeneous parallel computing framework significant reduces the computational cost and facilitates 3D high-resolution imaging for large-scale seismic data.
Advances in Parallelization for Large Scale Oct-Tree Mesh Generation
NASA Technical Reports Server (NTRS)
O'Connell, Matthew; Karman, Steve L.
2015-01-01
Despite great advancements in the parallelization of numerical simulation codes over the last 20 years, it is still common to perform grid generation in serial. Generating large scale grids in serial often requires using special "grid generation" compute machines that can have more than ten times the memory of average machines. While some parallel mesh generation techniques have been proposed, generating very large meshes for LES or aeroacoustic simulations is still a challenging problem. An automated method for the parallel generation of very large scale off-body hierarchical meshes is presented here. This work enables large scale parallel generation of off-body meshes by using a novel combination of parallel grid generation techniques and a hybrid "top down" and "bottom up" oct-tree method. Meshes are generated using hardware commonly found in parallel compute clusters. The capability to generate very large meshes is demonstrated by the generation of off-body meshes surrounding complex aerospace geometries. Results are shown including a one billion cell mesh generated around a Predator Unmanned Aerial Vehicle geometry, which was generated on 64 processors in under 45 minutes.
Applications of Parallel Process HiMAP for Large Scale Multidisciplinary Problems
NASA Technical Reports Server (NTRS)
Guruswamy, Guru P.; Potsdam, Mark; Rodriguez, David; Kwak, Dochay (Technical Monitor)
2000-01-01
HiMAP is a three level parallel middleware that can be interfaced to a large scale global design environment for code independent, multidisciplinary analysis using high fidelity equations. Aerospace technology needs are rapidly changing. Computational tools compatible with the requirements of national programs such as space transportation are needed. Conventional computation tools are inadequate for modern aerospace design needs. Advanced, modular computational tools are needed, such as those that incorporate the technology of massively parallel processors (MPP).
Linear static structural and vibration analysis on high-performance computers
NASA Technical Reports Server (NTRS)
Baddourah, M. A.; Storaasli, O. O.; Bostic, S. W.
1993-01-01
Parallel computers offer the oppurtunity to significantly reduce the computation time necessary to analyze large-scale aerospace structures. This paper presents algorithms developed for and implemented on massively-parallel computers hereafter referred to as Scalable High-Performance Computers (SHPC), for the most computationally intensive tasks involved in structural analysis, namely, generation and assembly of system matrices, solution of systems of equations and calculation of the eigenvalues and eigenvectors. Results on SHPC are presented for large-scale structural problems (i.e. models for High-Speed Civil Transport). The goal of this research is to develop a new, efficient technique which extends structural analysis to SHPC and makes large-scale structural analyses tractable.
Efficient parallelization of analytic bond-order potentials for large-scale atomistic simulations
NASA Astrophysics Data System (ADS)
Teijeiro, C.; Hammerschmidt, T.; Drautz, R.; Sutmann, G.
2016-07-01
Analytic bond-order potentials (BOPs) provide a way to compute atomistic properties with controllable accuracy. For large-scale computations of heterogeneous compounds at the atomistic level, both the computational efficiency and memory demand of BOP implementations have to be optimized. Since the evaluation of BOPs is a local operation within a finite environment, the parallelization concepts known from short-range interacting particle simulations can be applied to improve the performance of these simulations. In this work, several efficient parallelization methods for BOPs that use three-dimensional domain decomposition schemes are described. The schemes are implemented into the bond-order potential code BOPfox, and their performance is measured in a series of benchmarks. Systems of up to several millions of atoms are simulated on a high performance computing system, and parallel scaling is demonstrated for up to thousands of processors.
Performance of parallel computation using CUDA for solving the one-dimensional elasticity equations
NASA Astrophysics Data System (ADS)
Darmawan, J. B. B.; Mungkasi, S.
2017-01-01
In this paper, we investigate the performance of parallel computation in solving the one-dimensional elasticity equations. Elasticity equations are usually implemented in engineering science. Solving these equations fast and efficiently is desired. Therefore, we propose the use of parallel computation. Our parallel computation uses CUDA of the NVIDIA. Our research results show that parallel computation using CUDA has a great advantage and is powerful when the computation is of large scale.
OpenMP parallelization of a gridded SWAT (SWATG)
NASA Astrophysics Data System (ADS)
Zhang, Ying; Hou, Jinliang; Cao, Yongpan; Gu, Juan; Huang, Chunlin
2017-12-01
Large-scale, long-term and high spatial resolution simulation is a common issue in environmental modeling. A Gridded Hydrologic Response Unit (HRU)-based Soil and Water Assessment Tool (SWATG) that integrates grid modeling scheme with different spatial representations also presents such problems. The time-consuming problem affects applications of very high resolution large-scale watershed modeling. The OpenMP (Open Multi-Processing) parallel application interface is integrated with SWATG (called SWATGP) to accelerate grid modeling based on the HRU level. Such parallel implementation takes better advantage of the computational power of a shared memory computer system. We conducted two experiments at multiple temporal and spatial scales of hydrological modeling using SWATG and SWATGP on a high-end server. At 500-m resolution, SWATGP was found to be up to nine times faster than SWATG in modeling over a roughly 2000 km2 watershed with 1 CPU and a 15 thread configuration. The study results demonstrate that parallel models save considerable time relative to traditional sequential simulation runs. Parallel computations of environmental models are beneficial for model applications, especially at large spatial and temporal scales and at high resolutions. The proposed SWATGP model is thus a promising tool for large-scale and high-resolution water resources research and management in addition to offering data fusion and model coupling ability.
A Metascalable Computing Framework for Large Spatiotemporal-Scale Atomistic Simulations
DOE Office of Scientific and Technical Information (OSTI.GOV)
Nomura, K; Seymour, R; Wang, W
2009-02-17
A metascalable (or 'design once, scale on new architectures') parallel computing framework has been developed for large spatiotemporal-scale atomistic simulations of materials based on spatiotemporal data locality principles, which is expected to scale on emerging multipetaflops architectures. The framework consists of: (1) an embedded divide-and-conquer (EDC) algorithmic framework based on spatial locality to design linear-scaling algorithms for high complexity problems; (2) a space-time-ensemble parallel (STEP) approach based on temporal locality to predict long-time dynamics, while introducing multiple parallelization axes; and (3) a tunable hierarchical cellular decomposition (HCD) parallelization framework to map these O(N) algorithms onto a multicore cluster based onmore » hybrid implementation combining message passing and critical section-free multithreading. The EDC-STEP-HCD framework exposes maximal concurrency and data locality, thereby achieving: (1) inter-node parallel efficiency well over 0.95 for 218 billion-atom molecular-dynamics and 1.68 trillion electronic-degrees-of-freedom quantum-mechanical simulations on 212,992 IBM BlueGene/L processors (superscalability); (2) high intra-node, multithreading parallel efficiency (nanoscalability); and (3) nearly perfect time/ensemble parallel efficiency (eon-scalability). The spatiotemporal scale covered by MD simulation on a sustained petaflops computer per day (i.e. petaflops {center_dot} day of computing) is estimated as NT = 2.14 (e.g. N = 2.14 million atoms for T = 1 microseconds).« less
Traffic Simulations on Parallel Computers Using Domain Decomposition Techniques
DOT National Transportation Integrated Search
1995-01-01
Large scale simulations of Intelligent Transportation Systems (ITS) can only be acheived by using the computing resources offered by parallel computing architectures. Domain decomposition techniques are proposed which allow the performance of traffic...
NASA Technical Reports Server (NTRS)
Nguyen, D. T.; Watson, Willie R. (Technical Monitor)
2005-01-01
The overall objectives of this research work are to formulate and validate efficient parallel algorithms, and to efficiently design/implement computer software for solving large-scale acoustic problems, arised from the unified frameworks of the finite element procedures. The adopted parallel Finite Element (FE) Domain Decomposition (DD) procedures should fully take advantages of multiple processing capabilities offered by most modern high performance computing platforms for efficient parallel computation. To achieve this objective. the formulation needs to integrate efficient sparse (and dense) assembly techniques, hybrid (or mixed) direct and iterative equation solvers, proper pre-conditioned strategies, unrolling strategies, and effective processors' communicating schemes. Finally, the numerical performance of the developed parallel finite element procedures will be evaluated by solving series of structural, and acoustic (symmetrical and un-symmetrical) problems (in different computing platforms). Comparisons with existing "commercialized" and/or "public domain" software are also included, whenever possible.
Parallel Simulation of Unsteady Turbulent Flames
NASA Technical Reports Server (NTRS)
Menon, Suresh
1996-01-01
Time-accurate simulation of turbulent flames in high Reynolds number flows is a challenging task since both fluid dynamics and combustion must be modeled accurately. To numerically simulate this phenomenon, very large computer resources (both time and memory) are required. Although current vector supercomputers are capable of providing adequate resources for simulations of this nature, the high cost and their limited availability, makes practical use of such machines less than satisfactory. At the same time, the explicit time integration algorithms used in unsteady flow simulations often possess a very high degree of parallelism, making them very amenable to efficient implementation on large-scale parallel computers. Under these circumstances, distributed memory parallel computers offer an excellent near-term solution for greatly increased computational speed and memory, at a cost that may render the unsteady simulations of the type discussed above more feasible and affordable.This paper discusses the study of unsteady turbulent flames using a simulation algorithm that is capable of retaining high parallel efficiency on distributed memory parallel architectures. Numerical studies are carried out using large-eddy simulation (LES). In LES, the scales larger than the grid are computed using a time- and space-accurate scheme, while the unresolved small scales are modeled using eddy viscosity based subgrid models. This is acceptable for the moment/energy closure since the small scales primarily provide a dissipative mechanism for the energy transferred from the large scales. However, for combustion to occur, the species must first undergo mixing at the small scales and then come into molecular contact. Therefore, global models cannot be used. Recently, a new model for turbulent combustion was developed, in which the combustion is modeled, within the subgrid (small-scales) using a methodology that simulates the mixing and the molecular transport and the chemical kinetics within each LES grid cell. Finite-rate kinetics can be included without any closure and this approach actually provides a means to predict the turbulent rates and the turbulent flame speed. The subgrid combustion model requires resolution of the local time scales associated with small-scale mixing, molecular diffusion and chemical kinetics and, therefore, within each grid cell, a significant amount of computations must be carried out before the large-scale (LES resolved) effects are incorporated. Therefore, this approach is uniquely suited for parallel processing and has been implemented on various systems such as: Intel Paragon, IBM SP-2, Cray T3D and SGI Power Challenge (PC) using the system independent Message Passing Interface (MPI) compiler. In this paper, timing data on these machines is reported along with some characteristic results.
Portable parallel stochastic optimization for the design of aeropropulsion components
NASA Technical Reports Server (NTRS)
Sues, Robert H.; Rhodes, G. S.
1994-01-01
This report presents the results of Phase 1 research to develop a methodology for performing large-scale Multi-disciplinary Stochastic Optimization (MSO) for the design of aerospace systems ranging from aeropropulsion components to complete aircraft configurations. The current research recognizes that such design optimization problems are computationally expensive, and require the use of either massively parallel or multiple-processor computers. The methodology also recognizes that many operational and performance parameters are uncertain, and that uncertainty must be considered explicitly to achieve optimum performance and cost. The objective of this Phase 1 research was to initialize the development of an MSO methodology that is portable to a wide variety of hardware platforms, while achieving efficient, large-scale parallelism when multiple processors are available. The first effort in the project was a literature review of available computer hardware, as well as review of portable, parallel programming environments. The first effort was to implement the MSO methodology for a problem using the portable parallel programming language, Parallel Virtual Machine (PVM). The third and final effort was to demonstrate the example on a variety of computers, including a distributed-memory multiprocessor, a distributed-memory network of workstations, and a single-processor workstation. Results indicate the MSO methodology can be well-applied towards large-scale aerospace design problems. Nearly perfect linear speedup was demonstrated for computation of optimization sensitivity coefficients on both a 128-node distributed-memory multiprocessor (the Intel iPSC/860) and a network of workstations (speedups of almost 19 times achieved for 20 workstations). Very high parallel efficiencies (75 percent for 31 processors and 60 percent for 50 processors) were also achieved for computation of aerodynamic influence coefficients on the Intel. Finally, the multi-level parallelization strategy that will be needed for large-scale MSO problems was demonstrated to be highly efficient. The same parallel code instructions were used on both platforms, demonstrating portability. There are many applications for which MSO can be applied, including NASA's High-Speed-Civil Transport, and advanced propulsion systems. The use of MSO will reduce design and development time and testing costs dramatically.
Penas, David R; González, Patricia; Egea, Jose A; Doallo, Ramón; Banga, Julio R
2017-01-21
The development of large-scale kinetic models is one of the current key issues in computational systems biology and bioinformatics. Here we consider the problem of parameter estimation in nonlinear dynamic models. Global optimization methods can be used to solve this type of problems but the associated computational cost is very large. Moreover, many of these methods need the tuning of a number of adjustable search parameters, requiring a number of initial exploratory runs and therefore further increasing the computation times. Here we present a novel parallel method, self-adaptive cooperative enhanced scatter search (saCeSS), to accelerate the solution of this class of problems. The method is based on the scatter search optimization metaheuristic and incorporates several key new mechanisms: (i) asynchronous cooperation between parallel processes, (ii) coarse and fine-grained parallelism, and (iii) self-tuning strategies. The performance and robustness of saCeSS is illustrated by solving a set of challenging parameter estimation problems, including medium and large-scale kinetic models of the bacterium E. coli, bakerés yeast S. cerevisiae, the vinegar fly D. melanogaster, Chinese Hamster Ovary cells, and a generic signal transduction network. The results consistently show that saCeSS is a robust and efficient method, allowing very significant reduction of computation times with respect to several previous state of the art methods (from days to minutes, in several cases) even when only a small number of processors is used. The new parallel cooperative method presented here allows the solution of medium and large scale parameter estimation problems in reasonable computation times and with small hardware requirements. Further, the method includes self-tuning mechanisms which facilitate its use by non-experts. We believe that this new method can play a key role in the development of large-scale and even whole-cell dynamic models.
Visual analysis of inter-process communication for large-scale parallel computing.
Muelder, Chris; Gygi, Francois; Ma, Kwan-Liu
2009-01-01
In serial computation, program profiling is often helpful for optimization of key sections of code. When moving to parallel computation, not only does the code execution need to be considered but also communication between the different processes which can induce delays that are detrimental to performance. As the number of processes increases, so does the impact of the communication delays on performance. For large-scale parallel applications, it is critical to understand how the communication impacts performance in order to make the code more efficient. There are several tools available for visualizing program execution and communications on parallel systems. These tools generally provide either views which statistically summarize the entire program execution or process-centric views. However, process-centric visualizations do not scale well as the number of processes gets very large. In particular, the most common representation of parallel processes is a Gantt char t with a row for each process. As the number of processes increases, these charts can become difficult to work with and can even exceed screen resolution. We propose a new visualization approach that affords more scalability and then demonstrate it on systems running with up to 16,384 processes.
Grid-Enabled Quantitative Analysis of Breast Cancer
2010-10-01
large-scale, multi-modality computerized image analysis . The central hypothesis of this research is that large-scale image analysis for breast cancer...research, we designed a pilot study utilizing large scale parallel Grid computing harnessing nationwide infrastructure for medical image analysis . Also
NASA Astrophysics Data System (ADS)
Newman, Gregory A.
2014-01-01
Many geoscientific applications exploit electrostatic and electromagnetic fields to interrogate and map subsurface electrical resistivity—an important geophysical attribute for characterizing mineral, energy, and water resources. In complex three-dimensional geologies, where many of these resources remain to be found, resistivity mapping requires large-scale modeling and imaging capabilities, as well as the ability to treat significant data volumes, which can easily overwhelm single-core and modest multicore computing hardware. To treat such problems requires large-scale parallel computational resources, necessary for reducing the time to solution to a time frame acceptable to the exploration process. The recognition that significant parallel computing processes must be brought to bear on these problems gives rise to choices that must be made in parallel computing hardware and software. In this review, some of these choices are presented, along with the resulting trade-offs. We also discuss future trends in high-performance computing and the anticipated impact on electromagnetic (EM) geophysics. Topics discussed in this review article include a survey of parallel computing platforms, graphics processing units to multicore CPUs with a fast interconnect, along with effective parallel solvers and associated solver libraries effective for inductive EM modeling and imaging.
Wong, William W L; Feng, Zeny Z; Thein, Hla-Hla
2016-11-01
Agent-based models (ABMs) are computer simulation models that define interactions among agents and simulate emergent behaviors that arise from the ensemble of local decisions. ABMs have been increasingly used to examine trends in infectious disease epidemiology. However, the main limitation of ABMs is the high computational cost for a large-scale simulation. To improve the computational efficiency for large-scale ABM simulations, we built a parallelizable sliding region algorithm (SRA) for ABM and compared it to a nonparallelizable ABM. We developed a complex agent network and performed two simulations to model hepatitis C epidemics based on the real demographic data from Saskatchewan, Canada. The first simulation used the SRA that processed on each postal code subregion subsequently. The second simulation processed the entire population simultaneously. It was concluded that the parallelizable SRA showed computational time saving with comparable results in a province-wide simulation. Using the same method, SRA can be generalized for performing a country-wide simulation. Thus, this parallel algorithm enables the possibility of using ABM for large-scale simulation with limited computational resources.
Trace: a high-throughput tomographic reconstruction engine for large-scale datasets.
Bicer, Tekin; Gürsoy, Doğa; Andrade, Vincent De; Kettimuthu, Rajkumar; Scullin, William; Carlo, Francesco De; Foster, Ian T
2017-01-01
Modern synchrotron light sources and detectors produce data at such scale and complexity that large-scale computation is required to unleash their full power. One of the widely used imaging techniques that generates data at tens of gigabytes per second is computed tomography (CT). Although CT experiments result in rapid data generation, the analysis and reconstruction of the collected data may require hours or even days of computation time with a medium-sized workstation, which hinders the scientific progress that relies on the results of analysis. We present Trace, a data-intensive computing engine that we have developed to enable high-performance implementation of iterative tomographic reconstruction algorithms for parallel computers. Trace provides fine-grained reconstruction of tomography datasets using both (thread-level) shared memory and (process-level) distributed memory parallelization. Trace utilizes a special data structure called replicated reconstruction object to maximize application performance. We also present the optimizations that we apply to the replicated reconstruction objects and evaluate them using tomography datasets collected at the Advanced Photon Source. Our experimental evaluations show that our optimizations and parallelization techniques can provide 158× speedup using 32 compute nodes (384 cores) over a single-core configuration and decrease the end-to-end processing time of a large sinogram (with 4501 × 1 × 22,400 dimensions) from 12.5 h to <5 min per iteration. The proposed tomographic reconstruction engine can efficiently process large-scale tomographic data using many compute nodes and minimize reconstruction times.
Constructing Neuronal Network Models in Massively Parallel Environments.
Ippen, Tammo; Eppler, Jochen M; Plesser, Hans E; Diesmann, Markus
2017-01-01
Recent advances in the development of data structures to represent spiking neuron network models enable us to exploit the complete memory of petascale computers for a single brain-scale network simulation. In this work, we investigate how well we can exploit the computing power of such supercomputers for the creation of neuronal networks. Using an established benchmark, we divide the runtime of simulation code into the phase of network construction and the phase during which the dynamical state is advanced in time. We find that on multi-core compute nodes network creation scales well with process-parallel code but exhibits a prohibitively large memory consumption. Thread-parallel network creation, in contrast, exhibits speedup only up to a small number of threads but has little overhead in terms of memory. We further observe that the algorithms creating instances of model neurons and their connections scale well for networks of ten thousand neurons, but do not show the same speedup for networks of millions of neurons. Our work uncovers that the lack of scaling of thread-parallel network creation is due to inadequate memory allocation strategies and demonstrates that thread-optimized memory allocators recover excellent scaling. An analysis of the loop order used for network construction reveals that more complex tests on the locality of operations significantly improve scaling and reduce runtime by allowing construction algorithms to step through large networks more efficiently than in existing code. The combination of these techniques increases performance by an order of magnitude and harnesses the increasingly parallel compute power of the compute nodes in high-performance clusters and supercomputers.
Constructing Neuronal Network Models in Massively Parallel Environments
Ippen, Tammo; Eppler, Jochen M.; Plesser, Hans E.; Diesmann, Markus
2017-01-01
Recent advances in the development of data structures to represent spiking neuron network models enable us to exploit the complete memory of petascale computers for a single brain-scale network simulation. In this work, we investigate how well we can exploit the computing power of such supercomputers for the creation of neuronal networks. Using an established benchmark, we divide the runtime of simulation code into the phase of network construction and the phase during which the dynamical state is advanced in time. We find that on multi-core compute nodes network creation scales well with process-parallel code but exhibits a prohibitively large memory consumption. Thread-parallel network creation, in contrast, exhibits speedup only up to a small number of threads but has little overhead in terms of memory. We further observe that the algorithms creating instances of model neurons and their connections scale well for networks of ten thousand neurons, but do not show the same speedup for networks of millions of neurons. Our work uncovers that the lack of scaling of thread-parallel network creation is due to inadequate memory allocation strategies and demonstrates that thread-optimized memory allocators recover excellent scaling. An analysis of the loop order used for network construction reveals that more complex tests on the locality of operations significantly improve scaling and reduce runtime by allowing construction algorithms to step through large networks more efficiently than in existing code. The combination of these techniques increases performance by an order of magnitude and harnesses the increasingly parallel compute power of the compute nodes in high-performance clusters and supercomputers. PMID:28559808
Accelerating Large Scale Image Analyses on Parallel, CPU-GPU Equipped Systems
Teodoro, George; Kurc, Tahsin M.; Pan, Tony; Cooper, Lee A.D.; Kong, Jun; Widener, Patrick; Saltz, Joel H.
2014-01-01
The past decade has witnessed a major paradigm shift in high performance computing with the introduction of accelerators as general purpose processors. These computing devices make available very high parallel computing power at low cost and power consumption, transforming current high performance platforms into heterogeneous CPU-GPU equipped systems. Although the theoretical performance achieved by these hybrid systems is impressive, taking practical advantage of this computing power remains a very challenging problem. Most applications are still deployed to either GPU or CPU, leaving the other resource under- or un-utilized. In this paper, we propose, implement, and evaluate a performance aware scheduling technique along with optimizations to make efficient collaborative use of CPUs and GPUs on a parallel system. In the context of feature computations in large scale image analysis applications, our evaluations show that intelligently co-scheduling CPUs and GPUs can significantly improve performance over GPU-only or multi-core CPU-only approaches. PMID:25419545
Trace: a high-throughput tomographic reconstruction engine for large-scale datasets
Bicer, Tekin; Gursoy, Doga; Andrade, Vincent De; ...
2017-01-28
Here, synchrotron light source and detector technologies enable scientists to perform advanced experiments. These scientific instruments and experiments produce data at such scale and complexity that large-scale computation is required to unleash their full power. One of the widely used data acquisition technique at light sources is Computed Tomography, which can generate tens of GB/s depending on x-ray range. A large-scale tomographic dataset, such as mouse brain, may require hours of computation time with a medium size workstation. In this paper, we present Trace, a data-intensive computing middleware we developed for implementation and parallelization of iterative tomographic reconstruction algorithms. Tracemore » provides fine-grained reconstruction of tomography datasets using both (thread level) shared memory and (process level) distributed memory parallelization. Trace utilizes a special data structure called replicated reconstruction object to maximize application performance. We also present the optimizations we have done on the replicated reconstruction objects and evaluate them using a shale and a mouse brain sinogram. Our experimental evaluations show that the applied optimizations and parallelization techniques can provide 158x speedup (using 32 compute nodes) over single core configuration, which decreases the reconstruction time of a sinogram (with 4501 projections and 22400 detector resolution) from 12.5 hours to less than 5 minutes per iteration.« less
Trace: a high-throughput tomographic reconstruction engine for large-scale datasets
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bicer, Tekin; Gursoy, Doga; Andrade, Vincent De
Here, synchrotron light source and detector technologies enable scientists to perform advanced experiments. These scientific instruments and experiments produce data at such scale and complexity that large-scale computation is required to unleash their full power. One of the widely used data acquisition technique at light sources is Computed Tomography, which can generate tens of GB/s depending on x-ray range. A large-scale tomographic dataset, such as mouse brain, may require hours of computation time with a medium size workstation. In this paper, we present Trace, a data-intensive computing middleware we developed for implementation and parallelization of iterative tomographic reconstruction algorithms. Tracemore » provides fine-grained reconstruction of tomography datasets using both (thread level) shared memory and (process level) distributed memory parallelization. Trace utilizes a special data structure called replicated reconstruction object to maximize application performance. We also present the optimizations we have done on the replicated reconstruction objects and evaluate them using a shale and a mouse brain sinogram. Our experimental evaluations show that the applied optimizations and parallelization techniques can provide 158x speedup (using 32 compute nodes) over single core configuration, which decreases the reconstruction time of a sinogram (with 4501 projections and 22400 detector resolution) from 12.5 hours to less than 5 minutes per iteration.« less
NASA Astrophysics Data System (ADS)
Hartmann, Alfred; Redfield, Steve
1989-04-01
This paper discusses design of large-scale (1000x 1000) optical crossbar switching networks for use in parallel processing supercom-puters. Alternative design sketches for an optical crossbar switching network are presented using free-space optical transmission with either a beam spreading/masking model or a beam steering model for internodal communications. The performances of alternative multiple access channel communications protocol-unslotted and slotted ALOHA and carrier sense multiple access (CSMA)-are compared with the performance of the classic arbitrated bus crossbar of conventional electronic parallel computing. These comparisons indicate an almost inverse relationship between ease of implementation and speed of operation. Practical issues of optical system design are addressed, and an optically addressed, composite spatial light modulator design is presented for fabrication to arbitrarily large scale. The wide range of switch architecture, communications protocol, optical systems design, device fabrication, and system performance problems presented by these design sketches poses a serious challenge to practical exploitation of highly parallel optical interconnects in advanced computer designs.
The International Conference on Vector and Parallel Computing (2nd)
1989-01-17
Computation of the SVD of Bidiagonal Matrices" ...................................... 11 " Lattice QCD -As a Large Scale Scientific Computation...vectorizcd for the IBM 3090 Vector Facility. In addition, elapsed times " Lattice QCD -As a Large Scale Scientific have been reduced by using 3090...benchmarked Lattice QCD on a large number ofcompu- come from the wavefront solver routine. This was exten- ters: CrayX-MP and Cray 2 (vector
NASA Astrophysics Data System (ADS)
Georgiev, K.; Zlatev, Z.
2010-11-01
The Danish Eulerian Model (DEM) is an Eulerian model for studying the transport of air pollutants on large scale. Originally, the model was developed at the National Environmental Research Institute of Denmark. The model computational domain covers Europe and some neighbour parts belong to the Atlantic Ocean, Asia and Africa. If DEM model is to be applied by using fine grids, then its discretization leads to a huge computational problem. This implies that such a model as DEM must be run only on high-performance computer architectures. The implementation and tuning of such a complex large-scale model on each different computer is a non-trivial task. Here, some comparison results of running of this model on different kind of vector (CRAY C92A, Fujitsu, etc.), parallel computers with distributed memory (IBM SP, CRAY T3E, Beowulf clusters, Macintosh G4 clusters, etc.), parallel computers with shared memory (SGI Origin, SUN, etc.) and parallel computers with two levels of parallelism (IBM SMP, IBM BlueGene/P, clusters of multiprocessor nodes, etc.) will be presented. The main idea in the parallel version of DEM is domain partitioning approach. Discussions according to the effective use of the cache and hierarchical memories of the modern computers as well as the performance, speed-ups and efficiency achieved will be done. The parallel code of DEM, created by using MPI standard library, appears to be highly portable and shows good efficiency and scalability on different kind of vector and parallel computers. Some important applications of the computer model output are presented in short.
Streaming parallel GPU acceleration of large-scale filter-based spiking neural networks.
Slażyński, Leszek; Bohte, Sander
2012-01-01
The arrival of graphics processing (GPU) cards suitable for massively parallel computing promises affordable large-scale neural network simulation previously only available at supercomputing facilities. While the raw numbers suggest that GPUs may outperform CPUs by at least an order of magnitude, the challenge is to develop fine-grained parallel algorithms to fully exploit the particulars of GPUs. Computation in a neural network is inherently parallel and thus a natural match for GPU architectures: given inputs, the internal state for each neuron can be updated in parallel. We show that for filter-based spiking neurons, like the Spike Response Model, the additive nature of membrane potential dynamics enables additional update parallelism. This also reduces the accumulation of numerical errors when using single precision computation, the native precision of GPUs. We further show that optimizing simulation algorithms and data structures to the GPU's architecture has a large pay-off: for example, matching iterative neural updating to the memory architecture of the GPU speeds up this simulation step by a factor of three to five. With such optimizations, we can simulate in better-than-realtime plausible spiking neural networks of up to 50 000 neurons, processing over 35 million spiking events per second.
A transient FETI methodology for large-scale parallel implicit computations in structural mechanics
NASA Technical Reports Server (NTRS)
Farhat, Charbel; Crivelli, Luis; Roux, Francois-Xavier
1992-01-01
Explicit codes are often used to simulate the nonlinear dynamics of large-scale structural systems, even for low frequency response, because the storage and CPU requirements entailed by the repeated factorizations traditionally found in implicit codes rapidly overwhelm the available computing resources. With the advent of parallel processing, this trend is accelerating because explicit schemes are also easier to parallelize than implicit ones. However, the time step restriction imposed by the Courant stability condition on all explicit schemes cannot yet -- and perhaps will never -- be offset by the speed of parallel hardware. Therefore, it is essential to develop efficient and robust alternatives to direct methods that are also amenable to massively parallel processing because implicit codes using unconditionally stable time-integration algorithms are computationally more efficient when simulating low-frequency dynamics. Here we present a domain decomposition method for implicit schemes that requires significantly less storage than factorization algorithms, that is several times faster than other popular direct and iterative methods, that can be easily implemented on both shared and local memory parallel processors, and that is both computationally and communication-wise efficient. The proposed transient domain decomposition method is an extension of the method of Finite Element Tearing and Interconnecting (FETI) developed by Farhat and Roux for the solution of static problems. Serial and parallel performance results on the CRAY Y-MP/8 and the iPSC-860/128 systems are reported and analyzed for realistic structural dynamics problems. These results establish the superiority of the FETI method over both the serial/parallel conjugate gradient algorithm with diagonal scaling and the serial/parallel direct method, and contrast the computational power of the iPSC-860/128 parallel processor with that of the CRAY Y-MP/8 system.
Reversible Parallel Discrete-Event Execution of Large-scale Epidemic Outbreak Models
DOE Office of Scientific and Technical Information (OSTI.GOV)
Perumalla, Kalyan S; Seal, Sudip K
2010-01-01
The spatial scale, runtime speed and behavioral detail of epidemic outbreak simulations together require the use of large-scale parallel processing. In this paper, an optimistic parallel discrete event execution of a reaction-diffusion simulation model of epidemic outbreaks is presented, with an implementation over themore » $$\\mu$$sik simulator. Rollback support is achieved with the development of a novel reversible model that combines reverse computation with a small amount of incremental state saving. Parallel speedup and other runtime performance metrics of the simulation are tested on a small (8,192-core) Blue Gene / P system, while scalability is demonstrated on 65,536 cores of a large Cray XT5 system. Scenarios representing large population sizes (up to several hundred million individuals in the largest case) are exercised.« less
Accelerating large-scale protein structure alignments with graphics processing units
2012-01-01
Background Large-scale protein structure alignment, an indispensable tool to structural bioinformatics, poses a tremendous challenge on computational resources. To ensure structure alignment accuracy and efficiency, efforts have been made to parallelize traditional alignment algorithms in grid environments. However, these solutions are costly and of limited accessibility. Others trade alignment quality for speedup by using high-level characteristics of structure fragments for structure comparisons. Findings We present ppsAlign, a parallel protein structure Alignment framework designed and optimized to exploit the parallelism of Graphics Processing Units (GPUs). As a general-purpose GPU platform, ppsAlign could take many concurrent methods, such as TM-align and Fr-TM-align, into the parallelized algorithm design. We evaluated ppsAlign on an NVIDIA Tesla C2050 GPU card, and compared it with existing software solutions running on an AMD dual-core CPU. We observed a 36-fold speedup over TM-align, a 65-fold speedup over Fr-TM-align, and a 40-fold speedup over MAMMOTH. Conclusions ppsAlign is a high-performance protein structure alignment tool designed to tackle the computational complexity issues from protein structural data. The solution presented in this paper allows large-scale structure comparisons to be performed using massive parallel computing power of GPU. PMID:22357132
Implementation of highly parallel and large scale GW calculations within the OpenAtom software
NASA Astrophysics Data System (ADS)
Ismail-Beigi, Sohrab
The need to describe electronic excitations with better accuracy than provided by band structures produced by Density Functional Theory (DFT) has been a long-term enterprise for the computational condensed matter and materials theory communities. In some cases, appropriate theoretical frameworks have existed for some time but have been difficult to apply widely due to computational cost. For example, the GW approximation incorporates a great deal of important non-local and dynamical electronic interaction effects but has been too computationally expensive for routine use in large materials simulations. OpenAtom is an open source massively parallel ab initiodensity functional software package based on plane waves and pseudopotentials (http://charm.cs.uiuc.edu/OpenAtom/) that takes advantage of the Charm + + parallel framework. At present, it is developed via a three-way collaboration, funded by an NSF SI2-SSI grant (ACI-1339804), between Yale (Ismail-Beigi), IBM T. J. Watson (Glenn Martyna) and the University of Illinois at Urbana Champaign (Laxmikant Kale). We will describe the project and our current approach towards implementing large scale GW calculations with OpenAtom. Potential applications of large scale parallel GW software for problems involving electronic excitations in semiconductor and/or metal oxide systems will be also be pointed out.
NASA Astrophysics Data System (ADS)
Yan, Hui; Wang, K. G.; Jones, Jim E.
2016-06-01
A parallel algorithm for large-scale three-dimensional phase-field simulations of phase coarsening is developed and implemented on high-performance architectures. From the large-scale simulations, a new kinetics in phase coarsening in the region of ultrahigh volume fraction is found. The parallel implementation is capable of harnessing the greater computer power available from high-performance architectures. The parallelized code enables increase in three-dimensional simulation system size up to a 5123 grid cube. Through the parallelized code, practical runtime can be achieved for three-dimensional large-scale simulations, and the statistical significance of the results from these high resolution parallel simulations are greatly improved over those obtainable from serial simulations. A detailed performance analysis on speed-up and scalability is presented, showing good scalability which improves with increasing problem size. In addition, a model for prediction of runtime is developed, which shows a good agreement with actual run time from numerical tests.
Computational Issues in Damping Identification for Large Scale Problems
NASA Technical Reports Server (NTRS)
Pilkey, Deborah L.; Roe, Kevin P.; Inman, Daniel J.
1997-01-01
Two damping identification methods are tested for efficiency in large-scale applications. One is an iterative routine, and the other a least squares method. Numerical simulations have been performed on multiple degree-of-freedom models to test the effectiveness of the algorithm and the usefulness of parallel computation for the problems. High Performance Fortran is used to parallelize the algorithm. Tests were performed using the IBM-SP2 at NASA Ames Research Center. The least squares method tested incurs high communication costs, which reduces the benefit of high performance computing. This method's memory requirement grows at a very rapid rate meaning that larger problems can quickly exceed available computer memory. The iterative method's memory requirement grows at a much slower pace and is able to handle problems with 500+ degrees of freedom on a single processor. This method benefits from parallelization, and significant speedup can he seen for problems of 100+ degrees-of-freedom.
Scalable parallel distance field construction for large-scale applications
Yu, Hongfeng; Xie, Jinrong; Ma, Kwan -Liu; ...
2015-10-01
Computing distance fields is fundamental to many scientific and engineering applications. Distance fields can be used to direct analysis and reduce data. In this paper, we present a highly scalable method for computing 3D distance fields on massively parallel distributed-memory machines. Anew distributed spatial data structure, named parallel distance tree, is introduced to manage the level sets of data and facilitate surface tracking overtime, resulting in significantly reduced computation and communication costs for calculating the distance to the surface of interest from any spatial locations. Our method supports several data types and distance metrics from real-world applications. We demonstrate itsmore » efficiency and scalability on state-of-the-art supercomputers using both large-scale volume datasets and surface models. We also demonstrate in-situ distance field computation on dynamic turbulent flame surfaces for a petascale combustion simulation. In conclusion, our work greatly extends the usability of distance fields for demanding applications.« less
Scalable Parallel Distance Field Construction for Large-Scale Applications.
Yu, Hongfeng; Xie, Jinrong; Ma, Kwan-Liu; Kolla, Hemanth; Chen, Jacqueline H
2015-10-01
Computing distance fields is fundamental to many scientific and engineering applications. Distance fields can be used to direct analysis and reduce data. In this paper, we present a highly scalable method for computing 3D distance fields on massively parallel distributed-memory machines. A new distributed spatial data structure, named parallel distance tree, is introduced to manage the level sets of data and facilitate surface tracking over time, resulting in significantly reduced computation and communication costs for calculating the distance to the surface of interest from any spatial locations. Our method supports several data types and distance metrics from real-world applications. We demonstrate its efficiency and scalability on state-of-the-art supercomputers using both large-scale volume datasets and surface models. We also demonstrate in-situ distance field computation on dynamic turbulent flame surfaces for a petascale combustion simulation. Our work greatly extends the usability of distance fields for demanding applications.
Parallel computing for probabilistic fatigue analysis
NASA Technical Reports Server (NTRS)
Sues, Robert H.; Lua, Yuan J.; Smith, Mark D.
1993-01-01
This paper presents the results of Phase I research to investigate the most effective parallel processing software strategies and hardware configurations for probabilistic structural analysis. We investigate the efficiency of both shared and distributed-memory architectures via a probabilistic fatigue life analysis problem. We also present a parallel programming approach, the virtual shared-memory paradigm, that is applicable across both types of hardware. Using this approach, problems can be solved on a variety of parallel configurations, including networks of single or multiprocessor workstations. We conclude that it is possible to effectively parallelize probabilistic fatigue analysis codes; however, special strategies will be needed to achieve large-scale parallelism to keep large number of processors busy and to treat problems with the large memory requirements encountered in practice. We also conclude that distributed-memory architecture is preferable to shared-memory for achieving large scale parallelism; however, in the future, the currently emerging hybrid-memory architectures will likely be optimal.
NASA Astrophysics Data System (ADS)
Shi, X.
2015-12-01
As NSF indicated - "Theory and experimentation have for centuries been regarded as two fundamental pillars of science. It is now widely recognized that computational and data-enabled science forms a critical third pillar." Geocomputation is the third pillar of GIScience and geosciences. With the exponential growth of geodata, the challenge of scalable and high performance computing for big data analytics become urgent because many research activities are constrained by the inability of software or tool that even could not complete the computation process. Heterogeneous geodata integration and analytics obviously magnify the complexity and operational time frame. Many large-scale geospatial problems may be not processable at all if the computer system does not have sufficient memory or computational power. Emerging computer architectures, such as Intel's Many Integrated Core (MIC) Architecture and Graphics Processing Unit (GPU), and advanced computing technologies provide promising solutions to employ massive parallelism and hardware resources to achieve scalability and high performance for data intensive computing over large spatiotemporal and social media data. Exploring novel algorithms and deploying the solutions in massively parallel computing environment to achieve the capability for scalable data processing and analytics over large-scale, complex, and heterogeneous geodata with consistent quality and high-performance has been the central theme of our research team in the Department of Geosciences at the University of Arkansas (UARK). New multi-core architectures combined with application accelerators hold the promise to achieve scalability and high performance by exploiting task and data levels of parallelism that are not supported by the conventional computing systems. Such a parallel or distributed computing environment is particularly suitable for large-scale geocomputation over big data as proved by our prior works, while the potential of such advanced infrastructure remains unexplored in this domain. Within this presentation, our prior and on-going initiatives will be summarized to exemplify how we exploit multicore CPUs, GPUs, and MICs, and clusters of CPUs, GPUs and MICs, to accelerate geocomputation in different applications.
Suplatov, Dmitry; Popova, Nina; Zhumatiy, Sergey; Voevodin, Vladimir; Švedas, Vytas
2016-04-01
Rapid expansion of online resources providing access to genomic, structural, and functional information associated with biological macromolecules opens an opportunity to gain a deeper understanding of the mechanisms of biological processes due to systematic analysis of large datasets. This, however, requires novel strategies to optimally utilize computer processing power. Some methods in bioinformatics and molecular modeling require extensive computational resources. Other algorithms have fast implementations which take at most several hours to analyze a common input on a modern desktop station, however, due to multiple invocations for a large number of subtasks the full task requires a significant computing power. Therefore, an efficient computational solution to large-scale biological problems requires both a wise parallel implementation of resource-hungry methods as well as a smart workflow to manage multiple invocations of relatively fast algorithms. In this work, a new computer software mpiWrapper has been developed to accommodate non-parallel implementations of scientific algorithms within the parallel supercomputing environment. The Message Passing Interface has been implemented to exchange information between nodes. Two specialized threads - one for task management and communication, and another for subtask execution - are invoked on each processing unit to avoid deadlock while using blocking calls to MPI. The mpiWrapper can be used to launch all conventional Linux applications without the need to modify their original source codes and supports resubmission of subtasks on node failure. We show that this approach can be used to process huge amounts of biological data efficiently by running non-parallel programs in parallel mode on a supercomputer. The C++ source code and documentation are available from http://biokinet.belozersky.msu.ru/mpiWrapper .
DOE Office of Scientific and Technical Information (OSTI.GOV)
Not Available
An account of the Caltech Concurrent Computation Program (C{sup 3}P), a five year project that focused on answering the question: Can parallel computers be used to do large-scale scientific computations '' As the title indicates, the question is answered in the affirmative, by implementing numerous scientific applications on real parallel computers and doing computations that produced new scientific results. In the process of doing so, C{sup 3}P helped design and build several new computers, designed and implemented basic system software, developed algorithms for frequently used mathematical computations on massively parallel machines, devised performance models and measured the performance of manymore » computers, and created a high performance computing facility based exclusively on parallel computers. While the initial focus of C{sup 3}P was the hypercube architecture developed by C. Seitz, many of the methods developed and lessons learned have been applied successfully on other massively parallel architectures.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chang, Justin; Karra, Satish; Nakshatrala, Kalyana B.
It is well-known that the standard Galerkin formulation, which is often the formulation of choice under the finite element method for solving self-adjoint diffusion equations, does not meet maximum principles and the non-negative constraint for anisotropic diffusion equations. Recently, optimization-based methodologies that satisfy maximum principles and the non-negative constraint for steady-state and transient diffusion-type equations have been proposed. To date, these methodologies have been tested only on small-scale academic problems. The purpose of this paper is to systematically study the performance of the non-negative methodology in the context of high performance computing (HPC). PETSc and TAO libraries are, respectively, usedmore » for the parallel environment and optimization solvers. For large-scale problems, it is important for computational scientists to understand the computational performance of current algorithms available in these scientific libraries. The numerical experiments are conducted on the state-of-the-art HPC systems, and a single-core performance model is used to better characterize the efficiency of the solvers. Furthermore, our studies indicate that the proposed non-negative computational framework for diffusion-type equations exhibits excellent strong scaling for real-world large-scale problems.« less
Chang, Justin; Karra, Satish; Nakshatrala, Kalyana B.
2016-07-26
It is well-known that the standard Galerkin formulation, which is often the formulation of choice under the finite element method for solving self-adjoint diffusion equations, does not meet maximum principles and the non-negative constraint for anisotropic diffusion equations. Recently, optimization-based methodologies that satisfy maximum principles and the non-negative constraint for steady-state and transient diffusion-type equations have been proposed. To date, these methodologies have been tested only on small-scale academic problems. The purpose of this paper is to systematically study the performance of the non-negative methodology in the context of high performance computing (HPC). PETSc and TAO libraries are, respectively, usedmore » for the parallel environment and optimization solvers. For large-scale problems, it is important for computational scientists to understand the computational performance of current algorithms available in these scientific libraries. The numerical experiments are conducted on the state-of-the-art HPC systems, and a single-core performance model is used to better characterize the efficiency of the solvers. Furthermore, our studies indicate that the proposed non-negative computational framework for diffusion-type equations exhibits excellent strong scaling for real-world large-scale problems.« less
Approximate kernel competitive learning.
Wu, Jian-Sheng; Zheng, Wei-Shi; Lai, Jian-Huang
2015-03-01
Kernel competitive learning has been successfully used to achieve robust clustering. However, kernel competitive learning (KCL) is not scalable for large scale data processing, because (1) it has to calculate and store the full kernel matrix that is too large to be calculated and kept in the memory and (2) it cannot be computed in parallel. In this paper we develop a framework of approximate kernel competitive learning for processing large scale dataset. The proposed framework consists of two parts. First, it derives an approximate kernel competitive learning (AKCL), which learns kernel competitive learning in a subspace via sampling. We provide solid theoretical analysis on why the proposed approximation modelling would work for kernel competitive learning, and furthermore, we show that the computational complexity of AKCL is largely reduced. Second, we propose a pseudo-parallelled approximate kernel competitive learning (PAKCL) based on a set-based kernel competitive learning strategy, which overcomes the obstacle of using parallel programming in kernel competitive learning and significantly accelerates the approximate kernel competitive learning for large scale clustering. The empirical evaluation on publicly available datasets shows that the proposed AKCL and PAKCL can perform comparably as KCL, with a large reduction on computational cost. Also, the proposed methods achieve more effective clustering performance in terms of clustering precision against related approximate clustering approaches. Copyright © 2014 Elsevier Ltd. All rights reserved.
Optimistic barrier synchronization
NASA Technical Reports Server (NTRS)
Nicol, David M.
1992-01-01
Barrier synchronization is fundamental operation in parallel computation. In many contexts, at the point a processor enters a barrier it knows that it has already processed all the work required of it prior to synchronization. The alternative case, when a processor cannot enter a barrier with the assurance that it has already performed all the necessary pre-synchronization computation, is treated. The problem arises when the number of pre-sychronization messages to be received by a processor is unkown, for example, in a parallel discrete simulation or any other computation that is largely driven by an unpredictable exchange of messages. We describe an optimistic O(log sup 2 P) barrier algorithm for such problems, study its performance on a large-scale parallel system, and consider extensions to general associative reductions as well as associative parallel prefix computations.
Eigensolver for a Sparse, Large Hermitian Matrix
NASA Technical Reports Server (NTRS)
Tisdale, E. Robert; Oyafuso, Fabiano; Klimeck, Gerhard; Brown, R. Chris
2003-01-01
A parallel-processing computer program finds a few eigenvalues in a sparse Hermitian matrix that contains as many as 100 million diagonal elements. This program finds the eigenvalues faster, using less memory, than do other, comparable eigensolver programs. This program implements a Lanczos algorithm in the American National Standards Institute/ International Organization for Standardization (ANSI/ISO) C computing language, using the Message Passing Interface (MPI) standard to complement an eigensolver in PARPACK. [PARPACK (Parallel Arnoldi Package) is an extension, to parallel-processing computer architectures, of ARPACK (Arnoldi Package), which is a collection of Fortran 77 subroutines that solve large-scale eigenvalue problems.] The eigensolver runs on Beowulf clusters of computers at the Jet Propulsion Laboratory (JPL).
A scalable PC-based parallel computer for lattice QCD
NASA Astrophysics Data System (ADS)
Fodor, Z.; Katz, S. D.; Pappa, G.
2003-05-01
A PC-based parallel computer for medium/large scale lattice QCD simulations is suggested. The Eo¨tvo¨s Univ., Inst. Theor. Phys. cluster consists of 137 Intel P4-1.7GHz nodes. Gigabit Ethernet cards are used for nearest neighbor communication in a two-dimensional mesh. The sustained performance for dynamical staggered (wilson) quarks on large lattices is around 70(110) GFlops. The exceptional price/performance ratio is below $1/Mflop.
MapReduce Based Parallel Neural Networks in Enabling Large Scale Machine Learning
Yang, Jie; Huang, Yuan; Xu, Lixiong; Li, Siguang; Qi, Man
2015-01-01
Artificial neural networks (ANNs) have been widely used in pattern recognition and classification applications. However, ANNs are notably slow in computation especially when the size of data is large. Nowadays, big data has received a momentum from both industry and academia. To fulfill the potentials of ANNs for big data applications, the computation process must be speeded up. For this purpose, this paper parallelizes neural networks based on MapReduce, which has become a major computing model to facilitate data intensive applications. Three data intensive scenarios are considered in the parallelization process in terms of the volume of classification data, the size of the training data, and the number of neurons in the neural network. The performance of the parallelized neural networks is evaluated in an experimental MapReduce computer cluster from the aspects of accuracy in classification and efficiency in computation. PMID:26681933
MapReduce Based Parallel Neural Networks in Enabling Large Scale Machine Learning.
Liu, Yang; Yang, Jie; Huang, Yuan; Xu, Lixiong; Li, Siguang; Qi, Man
2015-01-01
Artificial neural networks (ANNs) have been widely used in pattern recognition and classification applications. However, ANNs are notably slow in computation especially when the size of data is large. Nowadays, big data has received a momentum from both industry and academia. To fulfill the potentials of ANNs for big data applications, the computation process must be speeded up. For this purpose, this paper parallelizes neural networks based on MapReduce, which has become a major computing model to facilitate data intensive applications. Three data intensive scenarios are considered in the parallelization process in terms of the volume of classification data, the size of the training data, and the number of neurons in the neural network. The performance of the parallelized neural networks is evaluated in an experimental MapReduce computer cluster from the aspects of accuracy in classification and efficiency in computation.
A new parallel-vector finite element analysis software on distributed-memory computers
NASA Technical Reports Server (NTRS)
Qin, Jiangning; Nguyen, Duc T.
1993-01-01
A new parallel-vector finite element analysis software package MPFEA (Massively Parallel-vector Finite Element Analysis) is developed for large-scale structural analysis on massively parallel computers with distributed-memory. MPFEA is designed for parallel generation and assembly of the global finite element stiffness matrices as well as parallel solution of the simultaneous linear equations, since these are often the major time-consuming parts of a finite element analysis. Block-skyline storage scheme along with vector-unrolling techniques are used to enhance the vector performance. Communications among processors are carried out concurrently with arithmetic operations to reduce the total execution time. Numerical results on the Intel iPSC/860 computers (such as the Intel Gamma with 128 processors and the Intel Touchstone Delta with 512 processors) are presented, including an aircraft structure and some very large truss structures, to demonstrate the efficiency and accuracy of MPFEA.
NASA Technical Reports Server (NTRS)
Bailey, David (Editor); Barton, John (Editor); Lasinski, Thomas (Editor); Simon, Horst (Editor)
1993-01-01
A new set of benchmarks was developed for the performance evaluation of highly parallel supercomputers. These benchmarks consist of a set of kernels, the 'Parallel Kernels,' and a simulated application benchmark. Together they mimic the computation and data movement characteristics of large scale computational fluid dynamics (CFD) applications. The principal distinguishing feature of these benchmarks is their 'pencil and paper' specification - all details of these benchmarks are specified only algorithmically. In this way many of the difficulties associated with conventional benchmarking approaches on highly parallel systems are avoided.
Parallel Computing for Probabilistic Response Analysis of High Temperature Composites
NASA Technical Reports Server (NTRS)
Sues, R. H.; Lua, Y. J.; Smith, M. D.
1994-01-01
The objective of this Phase I research was to establish the required software and hardware strategies to achieve large scale parallelism in solving PCM problems. To meet this objective, several investigations were conducted. First, we identified the multiple levels of parallelism in PCM and the computational strategies to exploit these parallelisms. Next, several software and hardware efficiency investigations were conducted. These involved the use of three different parallel programming paradigms and solution of two example problems on both a shared-memory multiprocessor and a distributed-memory network of workstations.
The EMCC / DARPA Massively Parallel Electromagnetic Scattering Project
NASA Technical Reports Server (NTRS)
Woo, Alex C.; Hill, Kueichien C.
1996-01-01
The Electromagnetic Code Consortium (EMCC) was sponsored by the Advanced Research Program Agency (ARPA) to demonstrate the effectiveness of massively parallel computing in large scale radar signature predictions. The EMCC/ARPA project consisted of three parts.
Efficiency of parallel direct optimization
NASA Technical Reports Server (NTRS)
Janies, D. A.; Wheeler, W. C.
2001-01-01
Tremendous progress has been made at the level of sequential computation in phylogenetics. However, little attention has been paid to parallel computation. Parallel computing is particularly suited to phylogenetics because of the many ways large computational problems can be broken into parts that can be analyzed concurrently. In this paper, we investigate the scaling factors and efficiency of random addition and tree refinement strategies using the direct optimization software, POY, on a small (10 slave processors) and a large (256 slave processors) cluster of networked PCs running LINUX. These algorithms were tested on several data sets composed of DNA and morphology ranging from 40 to 500 taxa. Various algorithms in POY show fundamentally different properties within and between clusters. All algorithms are efficient on the small cluster for the 40-taxon data set. On the large cluster, multibuilding exhibits excellent parallel efficiency, whereas parallel building is inefficient. These results are independent of data set size. Branch swapping in parallel shows excellent speed-up for 16 slave processors on the large cluster. However, there is no appreciable speed-up for branch swapping with the further addition of slave processors (>16). This result is independent of data set size. Ratcheting in parallel is efficient with the addition of up to 32 processors in the large cluster. This result is independent of data set size. c2001 The Willi Hennig Society.
Improving parallel I/O autotuning with performance modeling
Behzad, Babak; Byna, Surendra; Wild, Stefan M.; ...
2014-01-01
Various layers of the parallel I/O subsystem offer tunable parameters for improving I/O performance on large-scale computers. However, searching through a large parameter space is challenging. We are working towards an autotuning framework for determining the parallel I/O parameters that can achieve good I/O performance for different data write patterns. In this paper, we characterize parallel I/O and discuss the development of predictive models for use in effectively reducing the parameter space. Furthermore, applying our technique on tuning an I/O kernel derived from a large-scale simulation code shows that the search time can be reduced from 12 hours to 2more » hours, while achieving 54X I/O performance speedup.« less
Conjugate-Gradient Algorithms For Dynamics Of Manipulators
NASA Technical Reports Server (NTRS)
Fijany, Amir; Scheid, Robert E.
1993-01-01
Algorithms for serial and parallel computation of forward dynamics of multiple-link robotic manipulators by conjugate-gradient method developed. Parallel algorithms have potential for speedup of computations on multiple linked, specialized processors implemented in very-large-scale integrated circuits. Such processors used to stimulate dynamics, possibly faster than in real time, for purposes of planning and control.
Parallel block schemes for large scale least squares computations
DOE Office of Scientific and Technical Information (OSTI.GOV)
Golub, G.H.; Plemmons, R.J.; Sameh, A.
1986-04-01
Large scale least squares computations arise in a variety of scientific and engineering problems, including geodetic adjustments and surveys, medical image analysis, molecular structures, partial differential equations and substructuring methods in structural engineering. In each of these problems, matrices often arise which possess a block structure which reflects the local connection nature of the underlying physical problem. For example, such super-large nonlinear least squares computations arise in geodesy. Here the coordinates of positions are calculated by iteratively solving overdetermined systems of nonlinear equations by the Gauss-Newton method. The US National Geodetic Survey will complete this year (1986) the readjustment ofmore » the North American Datum, a problem which involves over 540 thousand unknowns and over 6.5 million observations (equations). The observation matrix for these least squares computations has a block angular form with 161 diagnonal blocks, each containing 3 to 4 thousand unknowns. In this paper parallel schemes are suggested for the orthogonal factorization of matrices in block angular form and for the associated backsubstitution phase of the least squares computations. In addition, a parallel scheme for the calculation of certain elements of the covariance matrix for such problems is described. It is shown that these algorithms are ideally suited for multiprocessors with three levels of parallelism such as the Cedar system at the University of Illinois. 20 refs., 7 figs.« less
Grid-Enabled Quantitative Analysis of Breast Cancer
2009-10-01
large-scale, multi-modality computerized image analysis . The central hypothesis of this research is that large-scale image analysis for breast cancer...pilot study to utilize large scale parallel Grid computing to harness the nationwide cluster infrastructure for optimization of medical image ... analysis parameters. Additionally, we investigated the use of cutting edge dataanalysis/ mining techniques as applied to Ultrasound, FFDM, and DCE-MRI Breast
A scalable parallel black oil simulator on distributed memory parallel computers
NASA Astrophysics Data System (ADS)
Wang, Kun; Liu, Hui; Chen, Zhangxin
2015-11-01
This paper presents our work on developing a parallel black oil simulator for distributed memory computers based on our in-house parallel platform. The parallel simulator is designed to overcome the performance issues of common simulators that are implemented for personal computers and workstations. The finite difference method is applied to discretize the black oil model. In addition, some advanced techniques are employed to strengthen the robustness and parallel scalability of the simulator, including an inexact Newton method, matrix decoupling methods, and algebraic multigrid methods. A new multi-stage preconditioner is proposed to accelerate the solution of linear systems from the Newton methods. Numerical experiments show that our simulator is scalable and efficient, and is capable of simulating extremely large-scale black oil problems with tens of millions of grid blocks using thousands of MPI processes on parallel computers.
Parallel Computing Strategies for Irregular Algorithms
NASA Technical Reports Server (NTRS)
Biswas, Rupak; Oliker, Leonid; Shan, Hongzhang; Biegel, Bryan (Technical Monitor)
2002-01-01
Parallel computing promises several orders of magnitude increase in our ability to solve realistic computationally-intensive problems, but relies on their efficient mapping and execution on large-scale multiprocessor architectures. Unfortunately, many important applications are irregular and dynamic in nature, making their effective parallel implementation a daunting task. Moreover, with the proliferation of parallel architectures and programming paradigms, the typical scientist is faced with a plethora of questions that must be answered in order to obtain an acceptable parallel implementation of the solution algorithm. In this paper, we consider three representative irregular applications: unstructured remeshing, sparse matrix computations, and N-body problems, and parallelize them using various popular programming paradigms on a wide spectrum of computer platforms ranging from state-of-the-art supercomputers to PC clusters. We present the underlying problems, the solution algorithms, and the parallel implementation strategies. Smart load-balancing, partitioning, and ordering techniques are used to enhance parallel performance. Overall results demonstrate the complexity of efficiently parallelizing irregular algorithms.
Wan, Shixiang; Zou, Quan
2017-01-01
Multiple sequence alignment (MSA) plays a key role in biological sequence analyses, especially in phylogenetic tree construction. Extreme increase in next-generation sequencing results in shortage of efficient ultra-large biological sequence alignment approaches for coping with different sequence types. Distributed and parallel computing represents a crucial technique for accelerating ultra-large (e.g. files more than 1 GB) sequence analyses. Based on HAlign and Spark distributed computing system, we implement a highly cost-efficient and time-efficient HAlign-II tool to address ultra-large multiple biological sequence alignment and phylogenetic tree construction. The experiments in the DNA and protein large scale data sets, which are more than 1GB files, showed that HAlign II could save time and space. It outperformed the current software tools. HAlign-II can efficiently carry out MSA and construct phylogenetic trees with ultra-large numbers of biological sequences. HAlign-II shows extremely high memory efficiency and scales well with increases in computing resource. THAlign-II provides a user-friendly web server based on our distributed computing infrastructure. HAlign-II with open-source codes and datasets was established at http://lab.malab.cn/soft/halign.
Programming Probabilistic Structural Analysis for Parallel Processing Computer
NASA Technical Reports Server (NTRS)
Sues, Robert H.; Chen, Heh-Chyun; Twisdale, Lawrence A.; Chamis, Christos C.; Murthy, Pappu L. N.
1991-01-01
The ultimate goal of this research program is to make Probabilistic Structural Analysis (PSA) computationally efficient and hence practical for the design environment by achieving large scale parallelism. The paper identifies the multiple levels of parallelism in PSA, identifies methodologies for exploiting this parallelism, describes the development of a parallel stochastic finite element code, and presents results of two example applications. It is demonstrated that speeds within five percent of those theoretically possible can be achieved. A special-purpose numerical technique, the stochastic preconditioned conjugate gradient method, is also presented and demonstrated to be extremely efficient for certain classes of PSA problems.
NASA Technical Reports Server (NTRS)
Bailey, D. H.; Barszcz, E.; Barton, J. T.; Carter, R. L.; Lasinski, T. A.; Browning, D. S.; Dagum, L.; Fatoohi, R. A.; Frederickson, P. O.; Schreiber, R. S.
1991-01-01
A new set of benchmarks has been developed for the performance evaluation of highly parallel supercomputers in the framework of the NASA Ames Numerical Aerodynamic Simulation (NAS) Program. These consist of five 'parallel kernel' benchmarks and three 'simulated application' benchmarks. Together they mimic the computation and data movement characteristics of large-scale computational fluid dynamics applications. The principal distinguishing feature of these benchmarks is their 'pencil and paper' specification-all details of these benchmarks are specified only algorithmically. In this way many of the difficulties associated with conventional benchmarking approaches on highly parallel systems are avoided.
GLAD: a system for developing and deploying large-scale bioinformatics grid.
Teo, Yong-Meng; Wang, Xianbing; Ng, Yew-Kwong
2005-03-01
Grid computing is used to solve large-scale bioinformatics problems with gigabytes database by distributing the computation across multiple platforms. Until now in developing bioinformatics grid applications, it is extremely tedious to design and implement the component algorithms and parallelization techniques for different classes of problems, and to access remotely located sequence database files of varying formats across the grid. In this study, we propose a grid programming toolkit, GLAD (Grid Life sciences Applications Developer), which facilitates the development and deployment of bioinformatics applications on a grid. GLAD has been developed using ALiCE (Adaptive scaLable Internet-based Computing Engine), a Java-based grid middleware, which exploits the task-based parallelism. Two bioinformatics benchmark applications, such as distributed sequence comparison and distributed progressive multiple sequence alignment, have been developed using GLAD.
DGDFT: A massively parallel method for large scale density functional theory calculations.
Hu, Wei; Lin, Lin; Yang, Chao
2015-09-28
We describe a massively parallel implementation of the recently developed discontinuous Galerkin density functional theory (DGDFT) method, for efficient large-scale Kohn-Sham DFT based electronic structure calculations. The DGDFT method uses adaptive local basis (ALB) functions generated on-the-fly during the self-consistent field iteration to represent the solution to the Kohn-Sham equations. The use of the ALB set provides a systematic way to improve the accuracy of the approximation. By using the pole expansion and selected inversion technique to compute electron density, energy, and atomic forces, we can make the computational complexity of DGDFT scale at most quadratically with respect to the number of electrons for both insulating and metallic systems. We show that for the two-dimensional (2D) phosphorene systems studied here, using 37 basis functions per atom allows us to reach an accuracy level of 1.3 × 10(-4) Hartree/atom in terms of the error of energy and 6.2 × 10(-4) Hartree/bohr in terms of the error of atomic force, respectively. DGDFT can achieve 80% parallel efficiency on 128,000 high performance computing cores when it is used to study the electronic structure of 2D phosphorene systems with 3500-14 000 atoms. This high parallel efficiency results from a two-level parallelization scheme that we will describe in detail.
Parallel-vector out-of-core equation solver for computational mechanics
NASA Technical Reports Server (NTRS)
Qin, J.; Agarwal, T. K.; Storaasli, O. O.; Nguyen, D. T.; Baddourah, M. A.
1993-01-01
A parallel/vector out-of-core equation solver is developed for shared-memory computers, such as the Cray Y-MP machine. The input/ output (I/O) time is reduced by using the a synchronous BUFFER IN and BUFFER OUT, which can be executed simultaneously with the CPU instructions. The parallel and vector capability provided by the supercomputers is also exploited to enhance the performance. Numerical applications in large-scale structural analysis are given to demonstrate the efficiency of the present out-of-core solver.
NASA Astrophysics Data System (ADS)
Ukawa, Akira
1998-05-01
The CP-PACS computer is a massively parallel computer consisting of 2048 processing units and having a peak speed of 614 GFLOPS and 128 GByte of main memory. It was developed over the four years from 1992 to 1996 at the Center for Computational Physics, University of Tsukuba, for large-scale numerical simulations in computational physics, especially those of lattice QCD. The CP-PACS computer has been in full operation for physics computations since October 1996. In this article we describe the chronology of the development, the hardware and software characteristics of the computer, and its performance for lattice QCD simulations.
Parallel-vector solution of large-scale structural analysis problems on supercomputers
NASA Technical Reports Server (NTRS)
Storaasli, Olaf O.; Nguyen, Duc T.; Agarwal, Tarun K.
1989-01-01
A direct linear equation solution method based on the Choleski factorization procedure is presented which exploits both parallel and vector features of supercomputers. The new equation solver is described, and its performance is evaluated by solving structural analysis problems on three high-performance computers. The method has been implemented using Force, a generic parallel FORTRAN language.
NASA Astrophysics Data System (ADS)
Liu, Jiping; Kang, Xiaochen; Dong, Chun; Xu, Shenghua
2017-12-01
Surface area estimation is a widely used tool for resource evaluation in the physical world. When processing large scale spatial data, the input/output (I/O) can easily become the bottleneck in parallelizing the algorithm due to the limited physical memory resources and the very slow disk transfer rate. In this paper, we proposed a stream tilling approach to surface area estimation that first decomposed a spatial data set into tiles with topological expansions. With these tiles, the one-to-one mapping relationship between the input and the computing process was broken. Then, we realized a streaming framework towards the scheduling of the I/O processes and computing units. Herein, each computing unit encapsulated a same copy of the estimation algorithm, and multiple asynchronous computing units could work individually in parallel. Finally, the performed experiment demonstrated that our stream tilling estimation can efficiently alleviate the heavy pressures from the I/O-bound work, and the measured speedup after being optimized have greatly outperformed the directly parallel versions in shared memory systems with multi-core processors.
A distributed parallel storage architecture and its potential application within EOSDIS
NASA Technical Reports Server (NTRS)
Johnston, William E.; Tierney, Brian; Feuquay, Jay; Butzer, Tony
1994-01-01
We describe the architecture, implementation, use of a scalable, high performance, distributed-parallel data storage system developed in the ARPA funded MAGIC gigabit testbed. A collection of wide area distributed disk servers operate in parallel to provide logical block level access to large data sets. Operated primarily as a network-based cache, the architecture supports cooperation among independently owned resources to provide fast, large-scale, on-demand storage to support data handling, simulation, and computation.
Strategies for Large Scale Implementation of a Multiscale, Multiprocess Integrated Hydrologic Model
NASA Astrophysics Data System (ADS)
Kumar, M.; Duffy, C.
2006-05-01
Distributed models simulate hydrologic state variables in space and time while taking into account the heterogeneities in terrain, surface, subsurface properties and meteorological forcings. Computational cost and complexity associated with these model increases with its tendency to accurately simulate the large number of interacting physical processes at fine spatio-temporal resolution in a large basin. A hydrologic model run on a coarse spatial discretization of the watershed with limited number of physical processes needs lesser computational load. But this negatively affects the accuracy of model results and restricts physical realization of the problem. So it is imperative to have an integrated modeling strategy (a) which can be universally applied at various scales in order to study the tradeoffs between computational complexity (determined by spatio- temporal resolution), accuracy and predictive uncertainty in relation to various approximations of physical processes (b) which can be applied at adaptively different spatial scales in the same domain by taking into account the local heterogeneity of topography and hydrogeologic variables c) which is flexible enough to incorporate different number and approximation of process equations depending on model purpose and computational constraint. An efficient implementation of this strategy becomes all the more important for Great Salt Lake river basin which is relatively large (~89000 sq. km) and complex in terms of hydrologic and geomorphic conditions. Also the types and the time scales of hydrologic processes which are dominant in different parts of basin are different. Part of snow melt runoff generated in the Uinta Mountains infiltrates and contributes as base flow to the Great Salt Lake over a time scale of decades to centuries. The adaptive strategy helps capture the steep topographic and climatic gradient along the Wasatch front. Here we present the aforesaid modeling strategy along with an associated hydrologic modeling framework which facilitates a seamless, computationally efficient and accurate integration of the process model with the data model. The flexibility of this framework leads to implementation of multiscale, multiresolution, adaptive refinement/de-refinement and nested modeling simulations with least computational burden. However, performing these simulations and related calibration of these models over a large basin at higher spatio- temporal resolutions is computationally intensive and requires use of increasing computing power. With the advent of parallel processing architectures, high computing performance can be achieved by parallelization of existing serial integrated-hydrologic-model code. This translates to running the same model simulation on a network of large number of processors thereby reducing the time needed to obtain solution. The paper also discusses the implementation of the integrated model on parallel processors. Also will be discussed the mapping of the problem on multi-processor environment, method to incorporate coupling between hydrologic processes using interprocessor communication models, model data structure and parallel numerical algorithms to obtain high performance.
Parallel Computation of the Regional Ocean Modeling System (ROMS)
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wang, P; Song, Y T; Chao, Y
2005-04-05
The Regional Ocean Modeling System (ROMS) is a regional ocean general circulation modeling system solving the free surface, hydrostatic, primitive equations over varying topography. It is free software distributed world-wide for studying both complex coastal ocean problems and the basin-to-global scale ocean circulation. The original ROMS code could only be run on shared-memory systems. With the increasing need to simulate larger model domains with finer resolutions and on a variety of computer platforms, there is a need in the ocean-modeling community to have a ROMS code that can be run on any parallel computer ranging from 10 to hundreds ofmore » processors. Recently, we have explored parallelization for ROMS using the MPI programming model. In this paper, an efficient parallelization strategy for such a large-scale scientific software package, based on an existing shared-memory computing model, is presented. In addition, scientific applications and data-performance issues on a couple of SGI systems, including Columbia, the world's third-fastest supercomputer, are discussed.« less
Parallel algorithm for multiscale atomistic/continuum simulations using LAMMPS
NASA Astrophysics Data System (ADS)
Pavia, F.; Curtin, W. A.
2015-07-01
Deformation and fracture processes in engineering materials often require simultaneous descriptions over a range of length and time scales, with each scale using a different computational technique. Here we present a high-performance parallel 3D computing framework for executing large multiscale studies that couple an atomic domain, modeled using molecular dynamics and a continuum domain, modeled using explicit finite elements. We use the robust Coupled Atomistic/Discrete-Dislocation (CADD) displacement-coupling method, but without the transfer of dislocations between atoms and continuum. The main purpose of the work is to provide a multiscale implementation within an existing large-scale parallel molecular dynamics code (LAMMPS) that enables use of all the tools associated with this popular open-source code, while extending CADD-type coupling to 3D. Validation of the implementation includes the demonstration of (i) stability in finite-temperature dynamics using Langevin dynamics, (ii) elimination of wave reflections due to large dynamic events occurring in the MD region and (iii) the absence of spurious forces acting on dislocations due to the MD/FE coupling, for dislocations further than 10 Å from the coupling boundary. A first non-trivial example application of dislocation glide and bowing around obstacles is shown, for dislocation lengths of ∼50 nm using fewer than 1 000 000 atoms but reproducing results of extremely large atomistic simulations at much lower computational cost.
Using Agent Base Models to Optimize Large Scale Network for Large System Inventories
NASA Technical Reports Server (NTRS)
Shameldin, Ramez Ahmed; Bowling, Shannon R.
2010-01-01
The aim of this paper is to use Agent Base Models (ABM) to optimize large scale network handling capabilities for large system inventories and to implement strategies for the purpose of reducing capital expenses. The models used in this paper either use computational algorithms or procedure implementations developed by Matlab to simulate agent based models in a principal programming language and mathematical theory using clusters, these clusters work as a high performance computational performance to run the program in parallel computational. In both cases, a model is defined as compilation of a set of structures and processes assumed to underlie the behavior of a network system.
Regional-scale calculation of the LS factor using parallel processing
NASA Astrophysics Data System (ADS)
Liu, Kai; Tang, Guoan; Jiang, Ling; Zhu, A.-Xing; Yang, Jianyi; Song, Xiaodong
2015-05-01
With the increase of data resolution and the increasing application of USLE over large areas, the existing serial implementation of algorithms for computing the LS factor is becoming a bottleneck. In this paper, a parallel processing model based on message passing interface (MPI) is presented for the calculation of the LS factor, so that massive datasets at a regional scale can be processed efficiently. The parallel model contains algorithms for calculating flow direction, flow accumulation, drainage network, slope, slope length and the LS factor. According to the existence of data dependence, the algorithms are divided into local algorithms and global algorithms. Parallel strategy are designed according to the algorithm characters including the decomposition method for maintaining the integrity of the results, optimized workflow for reducing the time taken for exporting the unnecessary intermediate data and a buffer-communication-computation strategy for improving the communication efficiency. Experiments on a multi-node system show that the proposed parallel model allows efficient calculation of the LS factor at a regional scale with a massive dataset.
Parallel Optimization of 3D Cardiac Electrophysiological Model Using GPU
Xia, Yong; Zhang, Henggui
2015-01-01
Large-scale 3D virtual heart model simulations are highly demanding in computational resources. This imposes a big challenge to the traditional computation resources based on CPU environment, which already cannot meet the requirement of the whole computation demands or are not easily available due to expensive costs. GPU as a parallel computing environment therefore provides an alternative to solve the large-scale computational problems of whole heart modeling. In this study, using a 3D sheep atrial model as a test bed, we developed a GPU-based simulation algorithm to simulate the conduction of electrical excitation waves in the 3D atria. In the GPU algorithm, a multicellular tissue model was split into two components: one is the single cell model (ordinary differential equation) and the other is the diffusion term of the monodomain model (partial differential equation). Such a decoupling enabled realization of the GPU parallel algorithm. Furthermore, several optimization strategies were proposed based on the features of the virtual heart model, which enabled a 200-fold speedup as compared to a CPU implementation. In conclusion, an optimized GPU algorithm has been developed that provides an economic and powerful platform for 3D whole heart simulations. PMID:26581957
Parallel Optimization of 3D Cardiac Electrophysiological Model Using GPU.
Xia, Yong; Wang, Kuanquan; Zhang, Henggui
2015-01-01
Large-scale 3D virtual heart model simulations are highly demanding in computational resources. This imposes a big challenge to the traditional computation resources based on CPU environment, which already cannot meet the requirement of the whole computation demands or are not easily available due to expensive costs. GPU as a parallel computing environment therefore provides an alternative to solve the large-scale computational problems of whole heart modeling. In this study, using a 3D sheep atrial model as a test bed, we developed a GPU-based simulation algorithm to simulate the conduction of electrical excitation waves in the 3D atria. In the GPU algorithm, a multicellular tissue model was split into two components: one is the single cell model (ordinary differential equation) and the other is the diffusion term of the monodomain model (partial differential equation). Such a decoupling enabled realization of the GPU parallel algorithm. Furthermore, several optimization strategies were proposed based on the features of the virtual heart model, which enabled a 200-fold speedup as compared to a CPU implementation. In conclusion, an optimized GPU algorithm has been developed that provides an economic and powerful platform for 3D whole heart simulations.
NASA Astrophysics Data System (ADS)
Jang, W.; Engda, T. A.; Neff, J. C.; Herrick, J.
2017-12-01
Many crop models are increasingly used to evaluate crop yields at regional and global scales. However, implementation of these models across large areas using fine-scale grids is limited by computational time requirements. In order to facilitate global gridded crop modeling with various scenarios (i.e., different crop, management schedule, fertilizer, and irrigation) using the Environmental Policy Integrated Climate (EPIC) model, we developed a distributed parallel computing framework in Python. Our local desktop with 14 cores (28 threads) was used to test the distributed parallel computing framework in Iringa, Tanzania which has 406,839 grid cells. High-resolution soil data, SoilGrids (250 x 250 m), and climate data, AgMERRA (0.25 x 0.25 deg) were also used as input data for the gridded EPIC model. The framework includes a master file for parallel computing, input database, input data formatters, EPIC model execution, and output analyzers. Through the master file for parallel computing, the user-defined number of threads of CPU divides the EPIC simulation into jobs. Then, Using EPIC input data formatters, the raw database is formatted for EPIC input data and the formatted data moves into EPIC simulation jobs. Then, 28 EPIC jobs run simultaneously and only interesting results files are parsed and moved into output analyzers. We applied various scenarios with seven different slopes and twenty-four fertilizer ranges. Parallelized input generators create different scenarios as a list for distributed parallel computing. After all simulations are completed, parallelized output analyzers are used to analyze all outputs according to the different scenarios. This saves significant computing time and resources, making it possible to conduct gridded modeling at regional to global scales with high-resolution data. For example, serial processing for the Iringa test case would require 113 hours, while using the framework developed in this study requires only approximately 6 hours, a nearly 95% reduction in computing time.
Parallel Tensor Compression for Large-Scale Scientific Data.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kolda, Tamara G.; Ballard, Grey; Austin, Woody Nathan
As parallel computing trends towards the exascale, scientific data produced by high-fidelity simulations are growing increasingly massive. For instance, a simulation on a three-dimensional spatial grid with 512 points per dimension that tracks 64 variables per grid point for 128 time steps yields 8 TB of data. By viewing the data as a dense five way tensor, we can compute a Tucker decomposition to find inherent low-dimensional multilinear structure, achieving compression ratios of up to 10000 on real-world data sets with negligible loss in accuracy. So that we can operate on such massive data, we present the first-ever distributed memorymore » parallel implementation for the Tucker decomposition, whose key computations correspond to parallel linear algebra operations, albeit with nonstandard data layouts. Our approach specifies a data distribution for tensors that avoids any tensor data redistribution, either locally or in parallel. We provide accompanying analysis of the computation and communication costs of the algorithms. To demonstrate the compression and accuracy of the method, we apply our approach to real-world data sets from combustion science simulations. We also provide detailed performance results, including parallel performance in both weak and strong scaling experiments.« less
Large-scale parallel genome assembler over cloud computing environment.
Das, Arghya Kusum; Koppa, Praveen Kumar; Goswami, Sayan; Platania, Richard; Park, Seung-Jong
2017-06-01
The size of high throughput DNA sequencing data has already reached the terabyte scale. To manage this huge volume of data, many downstream sequencing applications started using locality-based computing over different cloud infrastructures to take advantage of elastic (pay as you go) resources at a lower cost. However, the locality-based programming model (e.g. MapReduce) is relatively new. Consequently, developing scalable data-intensive bioinformatics applications using this model and understanding the hardware environment that these applications require for good performance, both require further research. In this paper, we present a de Bruijn graph oriented Parallel Giraph-based Genome Assembler (GiGA), as well as the hardware platform required for its optimal performance. GiGA uses the power of Hadoop (MapReduce) and Giraph (large-scale graph analysis) to achieve high scalability over hundreds of compute nodes by collocating the computation and data. GiGA achieves significantly higher scalability with competitive assembly quality compared to contemporary parallel assemblers (e.g. ABySS and Contrail) over traditional HPC cluster. Moreover, we show that the performance of GiGA is significantly improved by using an SSD-based private cloud infrastructure over traditional HPC cluster. We observe that the performance of GiGA on 256 cores of this SSD-based cloud infrastructure closely matches that of 512 cores of traditional HPC cluster.
The Parallel System for Integrating Impact Models and Sectors (pSIMS)
NASA Technical Reports Server (NTRS)
Elliott, Joshua; Kelly, David; Chryssanthacopoulos, James; Glotter, Michael; Jhunjhnuwala, Kanika; Best, Neil; Wilde, Michael; Foster, Ian
2014-01-01
We present a framework for massively parallel climate impact simulations: the parallel System for Integrating Impact Models and Sectors (pSIMS). This framework comprises a) tools for ingesting and converting large amounts of data to a versatile datatype based on a common geospatial grid; b) tools for translating this datatype into custom formats for site-based models; c) a scalable parallel framework for performing large ensemble simulations, using any one of a number of different impacts models, on clusters, supercomputers, distributed grids, or clouds; d) tools and data standards for reformatting outputs to common datatypes for analysis and visualization; and e) methodologies for aggregating these datatypes to arbitrary spatial scales such as administrative and environmental demarcations. By automating many time-consuming and error-prone aspects of large-scale climate impacts studies, pSIMS accelerates computational research, encourages model intercomparison, and enhances reproducibility of simulation results. We present the pSIMS design and use example assessments to demonstrate its multi-model, multi-scale, and multi-sector versatility.
Software Engineering for Scientific Computer Simulations
NASA Astrophysics Data System (ADS)
Post, Douglass E.; Henderson, Dale B.; Kendall, Richard P.; Whitney, Earl M.
2004-11-01
Computer simulation is becoming a very powerful tool for analyzing and predicting the performance of fusion experiments. Simulation efforts are evolving from including only a few effects to many effects, from small teams with a few people to large teams, and from workstations and small processor count parallel computers to massively parallel platforms. Successfully making this transition requires attention to software engineering issues. We report on the conclusions drawn from a number of case studies of large scale scientific computing projects within DOE, academia and the DoD. The major lessons learned include attention to sound project management including setting reasonable and achievable requirements, building a good code team, enforcing customer focus, carrying out verification and validation and selecting the optimum computational mathematics approaches.
Parallel Clustering Algorithm for Large-Scale Biological Data Sets
Wang, Minchao; Zhang, Wu; Ding, Wang; Dai, Dongbo; Zhang, Huiran; Xie, Hao; Chen, Luonan; Guo, Yike; Xie, Jiang
2014-01-01
Backgrounds Recent explosion of biological data brings a great challenge for the traditional clustering algorithms. With increasing scale of data sets, much larger memory and longer runtime are required for the cluster identification problems. The affinity propagation algorithm outperforms many other classical clustering algorithms and is widely applied into the biological researches. However, the time and space complexity become a great bottleneck when handling the large-scale data sets. Moreover, the similarity matrix, whose constructing procedure takes long runtime, is required before running the affinity propagation algorithm, since the algorithm clusters data sets based on the similarities between data pairs. Methods Two types of parallel architectures are proposed in this paper to accelerate the similarity matrix constructing procedure and the affinity propagation algorithm. The memory-shared architecture is used to construct the similarity matrix, and the distributed system is taken for the affinity propagation algorithm, because of its large memory size and great computing capacity. An appropriate way of data partition and reduction is designed in our method, in order to minimize the global communication cost among processes. Result A speedup of 100 is gained with 128 cores. The runtime is reduced from serval hours to a few seconds, which indicates that parallel algorithm is capable of handling large-scale data sets effectively. The parallel affinity propagation also achieves a good performance when clustering large-scale gene data (microarray) and detecting families in large protein superfamilies. PMID:24705246
NASA Astrophysics Data System (ADS)
Iwasawa, Masaki; Tanikawa, Ataru; Hosono, Natsuki; Nitadori, Keigo; Muranushi, Takayuki; Makino, Junichiro
2016-08-01
We present the basic idea, implementation, measured performance, and performance model of FDPS (Framework for Developing Particle Simulators). FDPS is an application-development framework which helps researchers to develop simulation programs using particle methods for large-scale distributed-memory parallel supercomputers. A particle-based simulation program for distributed-memory parallel computers needs to perform domain decomposition, exchange of particles which are not in the domain of each computing node, and gathering of the particle information in other nodes which are necessary for interaction calculation. Also, even if distributed-memory parallel computers are not used, in order to reduce the amount of computation, algorithms such as the Barnes-Hut tree algorithm or the Fast Multipole Method should be used in the case of long-range interactions. For short-range interactions, some methods to limit the calculation to neighbor particles are required. FDPS provides all of these functions which are necessary for efficient parallel execution of particle-based simulations as "templates," which are independent of the actual data structure of particles and the functional form of the particle-particle interaction. By using FDPS, researchers can write their programs with the amount of work necessary to write a simple, sequential and unoptimized program of O(N2) calculation cost, and yet the program, once compiled with FDPS, will run efficiently on large-scale parallel supercomputers. A simple gravitational N-body program can be written in around 120 lines. We report the actual performance of these programs and the performance model. The weak scaling performance is very good, and almost linear speed-up was obtained for up to the full system of the K computer. The minimum calculation time per timestep is in the range of 30 ms (N = 107) to 300 ms (N = 109). These are currently limited by the time for the calculation of the domain decomposition and communication necessary for the interaction calculation. We discuss how we can overcome these bottlenecks.
Application of high-performance computing to numerical simulation of human movement
NASA Technical Reports Server (NTRS)
Anderson, F. C.; Ziegler, J. M.; Pandy, M. G.; Whalen, R. T.
1995-01-01
We have examined the feasibility of using massively-parallel and vector-processing supercomputers to solve large-scale optimization problems for human movement. Specifically, we compared the computational expense of determining the optimal controls for the single support phase of gait using a conventional serial machine (SGI Iris 4D25), a MIMD parallel machine (Intel iPSC/860), and a parallel-vector-processing machine (Cray Y-MP 8/864). With the human body modeled as a 14 degree-of-freedom linkage actuated by 46 musculotendinous units, computation of the optimal controls for gait could take up to 3 months of CPU time on the Iris. Both the Cray and the Intel are able to reduce this time to practical levels. The optimal solution for gait can be found with about 77 hours of CPU on the Cray and with about 88 hours of CPU on the Intel. Although the overall speeds of the Cray and the Intel were found to be similar, the unique capabilities of each machine are better suited to different portions of the computational algorithm used. The Intel was best suited to computing the derivatives of the performance criterion and the constraints whereas the Cray was best suited to parameter optimization of the controls. These results suggest that the ideal computer architecture for solving very large-scale optimal control problems is a hybrid system in which a vector-processing machine is integrated into the communication network of a MIMD parallel machine.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sreepathi, Sarat; Kumar, Jitendra; Mills, Richard T.
A proliferation of data from vast networks of remote sensing platforms (satellites, unmanned aircraft systems (UAS), airborne etc.), observational facilities (meteorological, eddy covariance etc.), state-of-the-art sensors, and simulation models offer unprecedented opportunities for scientific discovery. Unsupervised classification is a widely applied data mining approach to derive insights from such data. However, classification of very large data sets is a complex computational problem that requires efficient numerical algorithms and implementations on high performance computing (HPC) platforms. Additionally, increasing power, space, cooling and efficiency requirements has led to the deployment of hybrid supercomputing platforms with complex architectures and memory hierarchies like themore » Titan system at Oak Ridge National Laboratory. The advent of such accelerated computing architectures offers new challenges and opportunities for big data analytics in general and specifically, large scale cluster analysis in our case. Although there is an existing body of work on parallel cluster analysis, those approaches do not fully meet the needs imposed by the nature and size of our large data sets. Moreover, they had scaling limitations and were mostly limited to traditional distributed memory computing platforms. We present a parallel Multivariate Spatio-Temporal Clustering (MSTC) technique based on k-means cluster analysis that can target hybrid supercomputers like Titan. We developed a hybrid MPI, CUDA and OpenACC implementation that can utilize both CPU and GPU resources on computational nodes. We describe performance results on Titan that demonstrate the scalability and efficacy of our approach in processing large ecological data sets.« less
The TeraShake Computational Platform for Large-Scale Earthquake Simulations
NASA Astrophysics Data System (ADS)
Cui, Yifeng; Olsen, Kim; Chourasia, Amit; Moore, Reagan; Maechling, Philip; Jordan, Thomas
Geoscientific and computer science researchers with the Southern California Earthquake Center (SCEC) are conducting a large-scale, physics-based, computationally demanding earthquake system science research program with the goal of developing predictive models of earthquake processes. The computational demands of this program continue to increase rapidly as these researchers seek to perform physics-based numerical simulations of earthquake processes for larger meet the needs of this research program, a multiple-institution team coordinated by SCEC has integrated several scientific codes into a numerical modeling-based research tool we call the TeraShake computational platform (TSCP). A central component in the TSCP is a highly scalable earthquake wave propagation simulation program called the TeraShake anelastic wave propagation (TS-AWP) code. In this chapter, we describe how we extended an existing, stand-alone, wellvalidated, finite-difference, anelastic wave propagation modeling code into the highly scalable and widely used TS-AWP and then integrated this code into the TeraShake computational platform that provides end-to-end (initialization to analysis) research capabilities. We also describe the techniques used to enhance the TS-AWP parallel performance on TeraGrid supercomputers, as well as the TeraShake simulations phases including input preparation, run time, data archive management, and visualization. As a result of our efforts to improve its parallel efficiency, the TS-AWP has now shown highly efficient strong scaling on over 40K processors on IBM’s BlueGene/L Watson computer. In addition, the TSCP has developed into a computational system that is useful to many members of the SCEC community for performing large-scale earthquake simulations.
Massively parallel sparse matrix function calculations with NTPoly
NASA Astrophysics Data System (ADS)
Dawson, William; Nakajima, Takahito
2018-04-01
We present NTPoly, a massively parallel library for computing the functions of sparse, symmetric matrices. The theory of matrix functions is a well developed framework with a wide range of applications including differential equations, graph theory, and electronic structure calculations. One particularly important application area is diagonalization free methods in quantum chemistry. When the input and output of the matrix function are sparse, methods based on polynomial expansions can be used to compute matrix functions in linear time. We present a library based on these methods that can compute a variety of matrix functions. Distributed memory parallelization is based on a communication avoiding sparse matrix multiplication algorithm. OpenMP task parallellization is utilized to implement hybrid parallelization. We describe NTPoly's interface and show how it can be integrated with programs written in many different programming languages. We demonstrate the merits of NTPoly by performing large scale calculations on the K computer.
SNAVA-A real-time multi-FPGA multi-model spiking neural network simulation architecture.
Sripad, Athul; Sanchez, Giovanny; Zapata, Mireya; Pirrone, Vito; Dorta, Taho; Cambria, Salvatore; Marti, Albert; Krishnamourthy, Karthikeyan; Madrenas, Jordi
2018-01-01
Spiking Neural Networks (SNN) for Versatile Applications (SNAVA) simulation platform is a scalable and programmable parallel architecture that supports real-time, large-scale, multi-model SNN computation. This parallel architecture is implemented in modern Field-Programmable Gate Arrays (FPGAs) devices to provide high performance execution and flexibility to support large-scale SNN models. Flexibility is defined in terms of programmability, which allows easy synapse and neuron implementation. This has been achieved by using a special-purpose Processing Elements (PEs) for computing SNNs, and analyzing and customizing the instruction set according to the processing needs to achieve maximum performance with minimum resources. The parallel architecture is interfaced with customized Graphical User Interfaces (GUIs) to configure the SNN's connectivity, to compile the neuron-synapse model and to monitor SNN's activity. Our contribution intends to provide a tool that allows to prototype SNNs faster than on CPU/GPU architectures but significantly cheaper than fabricating a customized neuromorphic chip. This could be potentially valuable to the computational neuroscience and neuromorphic engineering communities. Copyright © 2017 Elsevier Ltd. All rights reserved.
MPI implementation of PHOENICS: A general purpose computational fluid dynamics code
NASA Astrophysics Data System (ADS)
Simunovic, S.; Zacharia, T.; Baltas, N.; Spalding, D. B.
1995-03-01
PHOENICS is a suite of computational analysis programs that are used for simulation of fluid flow, heat transfer, and dynamical reaction processes. The parallel version of the solver EARTH for the Computational Fluid Dynamics (CFD) program PHOENICS has been implemented using Message Passing Interface (MPI) standard. Implementation of MPI version of PHOENICS makes this computational tool portable to a wide range of parallel machines and enables the use of high performance computing for large scale computational simulations. MPI libraries are available on several parallel architectures making the program usable across different architectures as well as on heterogeneous computer networks. The Intel Paragon NX and MPI versions of the program have been developed and tested on massively parallel supercomputers Intel Paragon XP/S 5, XP/S 35, and Kendall Square Research, and on the multiprocessor SGI Onyx computer at Oak Ridge National Laboratory. The preliminary testing results of the developed program have shown scalable performance for reasonably sized computational domains.
MPI implementation of PHOENICS: A general purpose computational fluid dynamics code
DOE Office of Scientific and Technical Information (OSTI.GOV)
Simunovic, S.; Zacharia, T.; Baltas, N.
1995-04-01
PHOENICS is a suite of computational analysis programs that are used for simulation of fluid flow, heat transfer, and dynamical reaction processes. The parallel version of the solver EARTH for the Computational Fluid Dynamics (CFD) program PHOENICS has been implemented using Message Passing Interface (MPI) standard. Implementation of MPI version of PHOENICS makes this computational tool portable to a wide range of parallel machines and enables the use of high performance computing for large scale computational simulations. MPI libraries are available on several parallel architectures making the program usable across different architectures as well as on heterogeneous computer networks. Themore » Intel Paragon NX and MPI versions of the program have been developed and tested on massively parallel supercomputers Intel Paragon XP/S 5, XP/S 35, and Kendall Square Research, and on the multiprocessor SGI Onyx computer at Oak Ridge National Laboratory. The preliminary testing results of the developed program have shown scalable performance for reasonably sized computational domains.« less
Computer Science Techniques Applied to Parallel Atomistic Simulation
NASA Astrophysics Data System (ADS)
Nakano, Aiichiro
1998-03-01
Recent developments in parallel processing technology and multiresolution numerical algorithms have established large-scale molecular dynamics (MD) simulations as a new research mode for studying materials phenomena such as fracture. However, this requires large system sizes and long simulated times. We have developed: i) Space-time multiresolution schemes; ii) fuzzy-clustering approach to hierarchical dynamics; iii) wavelet-based adaptive curvilinear-coordinate load balancing; iv) multilevel preconditioned conjugate gradient method; and v) spacefilling-curve-based data compression for parallel I/O. Using these techniques, million-atom parallel MD simulations are performed for the oxidation dynamics of nanocrystalline Al. The simulations take into account the effect of dynamic charge transfer between Al and O using the electronegativity equalization scheme. The resulting long-range Coulomb interaction is calculated efficiently with the fast multipole method. Results for temperature and charge distributions, residual stresses, bond lengths and bond angles, and diffusivities of Al and O will be presented. The oxidation of nanocrystalline Al is elucidated through immersive visualization in virtual environments. A unique dual-degree education program at Louisiana State University will also be discussed in which students can obtain a Ph.D. in Physics & Astronomy and a M.S. from the Department of Computer Science in five years. This program fosters interdisciplinary research activities for interfacing High Performance Computing and Communications with large-scale atomistic simulations of advanced materials. This work was supported by NSF (CAREER Program), ARO, PRF, and Louisiana LEQSF.
Scaling predictive modeling in drug development with cloud computing.
Moghadam, Behrooz Torabi; Alvarsson, Jonathan; Holm, Marcus; Eklund, Martin; Carlsson, Lars; Spjuth, Ola
2015-01-26
Growing data sets with increased time for analysis is hampering predictive modeling in drug discovery. Model building can be carried out on high-performance computer clusters, but these can be expensive to purchase and maintain. We have evaluated ligand-based modeling on cloud computing resources where computations are parallelized and run on the Amazon Elastic Cloud. We trained models on open data sets of varying sizes for the end points logP and Ames mutagenicity and compare with model building parallelized on a traditional high-performance computing cluster. We show that while high-performance computing results in faster model building, the use of cloud computing resources is feasible for large data sets and scales well within cloud instances. An additional advantage of cloud computing is that the costs of predictive models can be easily quantified, and a choice can be made between speed and economy. The easy access to computational resources with no up-front investments makes cloud computing an attractive alternative for scientists, especially for those without access to a supercomputer, and our study shows that it enables cost-efficient modeling of large data sets on demand within reasonable time.
2004-10-01
MONITORING AGENCY NAME(S) AND ADDRESS(ES) Defense Advanced Research Projects Agency AFRL/IFTC 3701 North Fairfax Drive...Scalable Parallel Libraries for Large-Scale Concurrent Applications," Technical Report UCRL -JC-109251, Lawrence Livermore National Laboratory
Parallel Finite Element Domain Decomposition for Structural/Acoustic Analysis
NASA Technical Reports Server (NTRS)
Nguyen, Duc T.; Tungkahotara, Siroj; Watson, Willie R.; Rajan, Subramaniam D.
2005-01-01
A domain decomposition (DD) formulation for solving sparse linear systems of equations resulting from finite element analysis is presented. The formulation incorporates mixed direct and iterative equation solving strategics and other novel algorithmic ideas that are optimized to take advantage of sparsity and exploit modern computer architecture, such as memory and parallel computing. The most time consuming part of the formulation is identified and the critical roles of direct sparse and iterative solvers within the framework of the formulation are discussed. Experiments on several computer platforms using several complex test matrices are conducted using software based on the formulation. Small-scale structural examples are used to validate thc steps in the formulation and large-scale (l,000,000+ unknowns) duct acoustic examples are used to evaluate the ORIGIN 2000 processors, and a duster of 6 PCs (running under the Windows environment). Statistics show that the formulation is efficient in both sequential and parallel computing environmental and that the formulation is significantly faster and consumes less memory than that based on one of the best available commercialized parallel sparse solvers.
Profiling and Improving I/O Performance of a Large-Scale Climate Scientific Application
NASA Technical Reports Server (NTRS)
Liu, Zhuo; Wang, Bin; Wang, Teng; Tian, Yuan; Xu, Cong; Wang, Yandong; Yu, Weikuan; Cruz, Carlos A.; Zhou, Shujia; Clune, Tom;
2013-01-01
Exascale computing systems are soon to emerge, which will pose great challenges on the huge gap between computing and I/O performance. Many large-scale scientific applications play an important role in our daily life. The huge amounts of data generated by such applications require highly parallel and efficient I/O management policies. In this paper, we adopt a mission-critical scientific application, GEOS-5, as a case to profile and analyze the communication and I/O issues that are preventing applications from fully utilizing the underlying parallel storage systems. Through in-detail architectural and experimental characterization, we observe that current legacy I/O schemes incur significant network communication overheads and are unable to fully parallelize the data access, thus degrading applications' I/O performance and scalability. To address these inefficiencies, we redesign its I/O framework along with a set of parallel I/O techniques to achieve high scalability and performance. Evaluation results on the NASA discover cluster show that our optimization of GEOS-5 with ADIOS has led to significant performance improvements compared to the original GEOS-5 implementation.
NASA Astrophysics Data System (ADS)
Childers, J. T.; Uram, T. D.; LeCompte, T. J.; Papka, M. E.; Benjamin, D. P.
2017-01-01
As the LHC moves to higher energies and luminosity, the demand for computing resources increases accordingly and will soon outpace the growth of the Worldwide LHC Computing Grid. To meet this greater demand, event generation Monte Carlo was targeted for adaptation to run on Mira, the supercomputer at the Argonne Leadership Computing Facility. Alpgen is a Monte Carlo event generation application that is used by LHC experiments in the simulation of collisions that take place in the Large Hadron Collider. This paper details the process by which Alpgen was adapted from a single-processor serial-application to a large-scale parallel-application and the performance that was achieved.
Parallelization of Nullspace Algorithm for the computation of metabolic pathways
Jevremović, Dimitrije; Trinh, Cong T.; Srienc, Friedrich; Sosa, Carlos P.; Boley, Daniel
2011-01-01
Elementary mode analysis is a useful metabolic pathway analysis tool in understanding and analyzing cellular metabolism, since elementary modes can represent metabolic pathways with unique and minimal sets of enzyme-catalyzed reactions of a metabolic network under steady state conditions. However, computation of the elementary modes of a genome- scale metabolic network with 100–1000 reactions is very expensive and sometimes not feasible with the commonly used serial Nullspace Algorithm. In this work, we develop a distributed memory parallelization of the Nullspace Algorithm to handle efficiently the computation of the elementary modes of a large metabolic network. We give an implementation in C++ language with the support of MPI library functions for the parallel communication. Our proposed algorithm is accompanied with an analysis of the complexity and identification of major bottlenecks during computation of all possible pathways of a large metabolic network. The algorithm includes methods to achieve load balancing among the compute-nodes and specific communication patterns to reduce the communication overhead and improve efficiency. PMID:22058581
Parallel Optimization of Polynomials for Large-scale Problems in Stability and Control
NASA Astrophysics Data System (ADS)
Kamyar, Reza
In this thesis, we focus on some of the NP-hard problems in control theory. Thanks to the converse Lyapunov theory, these problems can often be modeled as optimization over polynomials. To avoid the problem of intractability, we establish a trade off between accuracy and complexity. In particular, we develop a sequence of tractable optimization problems --- in the form of Linear Programs (LPs) and/or Semi-Definite Programs (SDPs) --- whose solutions converge to the exact solution of the NP-hard problem. However, the computational and memory complexity of these LPs and SDPs grow exponentially with the progress of the sequence - meaning that improving the accuracy of the solutions requires solving SDPs with tens of thousands of decision variables and constraints. Setting up and solving such problems is a significant challenge. The existing optimization algorithms and software are only designed to use desktop computers or small cluster computers --- machines which do not have sufficient memory for solving such large SDPs. Moreover, the speed-up of these algorithms does not scale beyond dozens of processors. This in fact is the reason we seek parallel algorithms for setting-up and solving large SDPs on large cluster- and/or super-computers. We propose parallel algorithms for stability analysis of two classes of systems: 1) Linear systems with a large number of uncertain parameters; 2) Nonlinear systems defined by polynomial vector fields. First, we develop a distributed parallel algorithm which applies Polya's and/or Handelman's theorems to some variants of parameter-dependent Lyapunov inequalities with parameters defined over the standard simplex. The result is a sequence of SDPs which possess a block-diagonal structure. We then develop a parallel SDP solver which exploits this structure in order to map the computation, memory and communication to a distributed parallel environment. Numerical tests on a supercomputer demonstrate the ability of the algorithm to efficiently utilize hundreds and potentially thousands of processors, and analyze systems with 100+ dimensional state-space. Furthermore, we extend our algorithms to analyze robust stability over more complicated geometries such as hypercubes and arbitrary convex polytopes. Our algorithms can be readily extended to address a wide variety of problems in control such as Hinfinity synthesis for systems with parametric uncertainty and computing control Lyapunov functions.
NASA Astrophysics Data System (ADS)
Mosby, Matthew; Matouš, Karel
2015-12-01
Three-dimensional simulations capable of resolving the large range of spatial scales, from the failure-zone thickness up to the size of the representative unit cell, in damage mechanics problems of particle reinforced adhesives are presented. We show that resolving this wide range of scales in complex three-dimensional heterogeneous morphologies is essential in order to apprehend fracture characteristics, such as strength, fracture toughness and shape of the softening profile. Moreover, we show that computations that resolve essential physical length scales capture the particle size-effect in fracture toughness, for example. In the vein of image-based computational materials science, we construct statistically optimal unit cells containing hundreds to thousands of particles. We show that these statistically representative unit cells are capable of capturing the first- and second-order probability functions of a given data-source with better accuracy than traditional inclusion packing techniques. In order to accomplish these large computations, we use a parallel multiscale cohesive formulation and extend it to finite strains including damage mechanics. The high-performance parallel computational framework is executed on up to 1024 processing cores. A mesh convergence and a representative unit cell study are performed. Quantifying the complex damage patterns in simulations consisting of tens of millions of computational cells and millions of highly nonlinear equations requires data-mining the parallel simulations, and we propose two damage metrics to quantify the damage patterns. A detailed study of volume fraction and filler size on the macroscopic traction-separation response of heterogeneous adhesives is presented.
Chen, Qingkui; Zhao, Deyu; Wang, Jingjuan
2017-01-01
This paper aims to develop a low-cost, high-performance and high-reliability computing system to process large-scale data using common data mining algorithms in the Internet of Things (IoT) computing environment. Considering the characteristics of IoT data processing, similar to mainstream high performance computing, we use a GPU (Graphics Processing Unit) cluster to achieve better IoT services. Firstly, we present an energy consumption calculation method (ECCM) based on WSNs. Then, using the CUDA (Compute Unified Device Architecture) Programming model, we propose a Two-level Parallel Optimization Model (TLPOM) which exploits reasonable resource planning and common compiler optimization techniques to obtain the best blocks and threads configuration considering the resource constraints of each node. The key to this part is dynamic coupling Thread-Level Parallelism (TLP) and Instruction-Level Parallelism (ILP) to improve the performance of the algorithms without additional energy consumption. Finally, combining the ECCM and the TLPOM, we use the Reliable GPU Cluster Architecture (RGCA) to obtain a high-reliability computing system considering the nodes’ diversity, algorithm characteristics, etc. The results show that the performance of the algorithms significantly increased by 34.1%, 33.96% and 24.07% for Fermi, Kepler and Maxwell on average with TLPOM and the RGCA ensures that our IoT computing system provides low-cost and high-reliability services. PMID:28777325
Fang, Yuling; Chen, Qingkui; Xiong, Neal N; Zhao, Deyu; Wang, Jingjuan
2017-08-04
This paper aims to develop a low-cost, high-performance and high-reliability computing system to process large-scale data using common data mining algorithms in the Internet of Things (IoT) computing environment. Considering the characteristics of IoT data processing, similar to mainstream high performance computing, we use a GPU (Graphics Processing Unit) cluster to achieve better IoT services. Firstly, we present an energy consumption calculation method (ECCM) based on WSNs. Then, using the CUDA (Compute Unified Device Architecture) Programming model, we propose a Two-level Parallel Optimization Model (TLPOM) which exploits reasonable resource planning and common compiler optimization techniques to obtain the best blocks and threads configuration considering the resource constraints of each node. The key to this part is dynamic coupling Thread-Level Parallelism (TLP) and Instruction-Level Parallelism (ILP) to improve the performance of the algorithms without additional energy consumption. Finally, combining the ECCM and the TLPOM, we use the Reliable GPU Cluster Architecture (RGCA) to obtain a high-reliability computing system considering the nodes' diversity, algorithm characteristics, etc. The results show that the performance of the algorithms significantly increased by 34.1%, 33.96% and 24.07% for Fermi, Kepler and Maxwell on average with TLPOM and the RGCA ensures that our IoT computing system provides low-cost and high-reliability services.
NASA Technical Reports Server (NTRS)
Deardorff, Glenn; Djomehri, M. Jahed; Freeman, Ken; Gambrel, Dave; Green, Bryan; Henze, Chris; Hinke, Thomas; Hood, Robert; Kiris, Cetin; Moran, Patrick;
2001-01-01
A series of NASA presentations for the Supercomputing 2001 conference are summarized. The topics include: (1) Mars Surveyor Landing Sites "Collaboratory"; (2) Parallel and Distributed CFD for Unsteady Flows with Moving Overset Grids; (3) IP Multicast for Seamless Support of Remote Science; (4) Consolidated Supercomputing Management Office; (5) Growler: A Component-Based Framework for Distributed/Collaborative Scientific Visualization and Computational Steering; (6) Data Mining on the Information Power Grid (IPG); (7) Debugging on the IPG; (8) Debakey Heart Assist Device: (9) Unsteady Turbopump for Reusable Launch Vehicle; (10) Exploratory Computing Environments Component Framework; (11) OVERSET Computational Fluid Dynamics Tools; (12) Control and Observation in Distributed Environments; (13) Multi-Level Parallelism Scaling on NASA's Origin 1024 CPU System; (14) Computing, Information, & Communications Technology; (15) NAS Grid Benchmarks; (16) IPG: A Large-Scale Distributed Computing and Data Management System; and (17) ILab: Parameter Study Creation and Submission on the IPG.
NASA Astrophysics Data System (ADS)
Suryanarayana, Phanish; Pratapa, Phanisri P.; Sharma, Abhiraj; Pask, John E.
2018-03-01
We present SQDFT: a large-scale parallel implementation of the Spectral Quadrature (SQ) method for O(N) Kohn-Sham Density Functional Theory (DFT) calculations at high temperature. Specifically, we develop an efficient and scalable finite-difference implementation of the infinite-cell Clenshaw-Curtis SQ approach, in which results for the infinite crystal are obtained by expressing quantities of interest as bilinear forms or sums of bilinear forms, that are then approximated by spatially localized Clenshaw-Curtis quadrature rules. We demonstrate the accuracy of SQDFT by showing systematic convergence of energies and atomic forces with respect to SQ parameters to reference diagonalization results, and convergence with discretization to established planewave results, for both metallic and insulating systems. We further demonstrate that SQDFT achieves excellent strong and weak parallel scaling on computer systems consisting of tens of thousands of processors, with near perfect O(N) scaling with system size and wall times as low as a few seconds per self-consistent field iteration. Finally, we verify the accuracy of SQDFT in large-scale quantum molecular dynamics simulations of aluminum at high temperature.
Parallel Visualization Co-Processing of Overnight CFD Propulsion Applications
NASA Technical Reports Server (NTRS)
Edwards, David E.; Haimes, Robert
1999-01-01
An interactive visualization system pV3 is being developed for the investigation of advanced computational methodologies employing visualization and parallel processing for the extraction of information contained in large-scale transient engineering simulations. Visual techniques for extracting information from the data in terms of cutting planes, iso-surfaces, particle tracing and vector fields are included in this system. This paper discusses improvements to the pV3 system developed under NASA's Affordable High Performance Computing project.
Dynamic Load Balancing for Grid Partitioning on a SP-2 Multiprocessor: A Framework
NASA Technical Reports Server (NTRS)
Sohn, Andrew; Simon, Horst; Lasinski, T. A. (Technical Monitor)
1994-01-01
Computational requirements of full scale computational fluid dynamics change as computation progresses on a parallel machine. The change in computational intensity causes workload imbalance of processors, which in turn requires a large amount of data movement at runtime. If parallel CFD is to be successful on a parallel or massively parallel machine, balancing of the runtime load is indispensable. Here a framework is presented for dynamic load balancing for CFD applications, called Jove. One processor is designated as a decision maker Jove while others are assigned to computational fluid dynamics. Processors running CFD send flags to Jove in a predetermined number of iterations to initiate load balancing. Jove starts working on load balancing while other processors continue working with the current data and load distribution. Jove goes through several steps to decide if the new data should be taken, including preliminary evaluate, partition, processor reassignment, cost evaluation, and decision. Jove running on a single EBM SP2 node has been completely implemented. Preliminary experimental results show that the Jove approach to dynamic load balancing can be effective for full scale grid partitioning on the target machine IBM SP2.
Dynamic Load Balancing For Grid Partitioning on a SP-2 Multiprocessor: A Framework
NASA Technical Reports Server (NTRS)
Sohn, Andrew; Simon, Horst; Lasinski, T. A. (Technical Monitor)
1994-01-01
Computational requirements of full scale computational fluid dynamics change as computation progresses on a parallel machine. The change in computational intensity causes workload imbalance of processors, which in turn requires a large amount of data movement at runtime. If parallel CFD is to be successful on a parallel or massively parallel machine, balancing of the runtime load is indispensable. Here a framework is presented for dynamic load balancing for CFD applications, called Jove. One processor is designated as a decision maker Jove while others are assigned to computational fluid dynamics. Processors running CFD send flags to Jove in a predetermined number of iterations to initiate load balancing. Jove starts working on load balancing while other processors continue working with the current data and load distribution. Jove goes through several steps to decide if the new data should be taken, including preliminary evaluate, partition, processor reassignment, cost evaluation, and decision. Jove running on a single IBM SP2 node has been completely implemented. Preliminary experimental results show that the Jove approach to dynamic load balancing can be effective for full scale grid partitioning on the target machine IBM SP2.
An Implicit Solver on A Parallel Block-Structured Adaptive Mesh Grid for FLASH
NASA Astrophysics Data System (ADS)
Lee, D.; Gopal, S.; Mohapatra, P.
2012-07-01
We introduce a fully implicit solver for FLASH based on a Jacobian-Free Newton-Krylov (JFNK) approach with an appropriate preconditioner. The main goal of developing this JFNK-type implicit solver is to provide efficient high-order numerical algorithms and methodology for simulating stiff systems of differential equations on large-scale parallel computer architectures. A large number of natural problems in nonlinear physics involve a wide range of spatial and time scales of interest. A system that encompasses such a wide magnitude of scales is described as "stiff." A stiff system can arise in many different fields of physics, including fluid dynamics/aerodynamics, laboratory/space plasma physics, low Mach number flows, reactive flows, radiation hydrodynamics, and geophysical flows. One of the big challenges in solving such a stiff system using current-day computational resources lies in resolving time and length scales varying by several orders of magnitude. We introduce FLASH's preliminary implementation of a time-accurate JFNK-based implicit solver in the framework of FLASH's unsplit hydro solver.
Nishizawa, Hiroaki; Nishimura, Yoshifumi; Kobayashi, Masato; Irle, Stephan; Nakai, Hiromi
2016-08-05
The linear-scaling divide-and-conquer (DC) quantum chemical methodology is applied to the density-functional tight-binding (DFTB) theory to develop a massively parallel program that achieves on-the-fly molecular reaction dynamics simulations of huge systems from scratch. The functions to perform large scale geometry optimization and molecular dynamics with DC-DFTB potential energy surface are implemented to the program called DC-DFTB-K. A novel interpolation-based algorithm is developed for parallelizing the determination of the Fermi level in the DC method. The performance of the DC-DFTB-K program is assessed using a laboratory computer and the K computer. Numerical tests show the high efficiency of the DC-DFTB-K program, a single-point energy gradient calculation of a one-million-atom system is completed within 60 s using 7290 nodes of the K computer. © 2016 Wiley Periodicals, Inc. © 2016 Wiley Periodicals, Inc.
Discrete Event Modeling and Massively Parallel Execution of Epidemic Outbreak Phenomena
DOE Office of Scientific and Technical Information (OSTI.GOV)
Perumalla, Kalyan S; Seal, Sudip K
2011-01-01
In complex phenomena such as epidemiological outbreaks, the intensity of inherent feedback effects and the significant role of transients in the dynamics make simulation the only effective method for proactive, reactive or post-facto analysis. The spatial scale, runtime speed, and behavioral detail needed in detailed simulations of epidemic outbreaks make it necessary to use large-scale parallel processing. Here, an optimistic parallel execution of a new discrete event formulation of a reaction-diffusion simulation model of epidemic propagation is presented to facilitate in dramatically increasing the fidelity and speed by which epidemiological simulations can be performed. Rollback support needed during optimistic parallelmore » execution is achieved by combining reverse computation with a small amount of incremental state saving. Parallel speedup of over 5,500 and other runtime performance metrics of the system are observed with weak-scaling execution on a small (8,192-core) Blue Gene / P system, while scalability with a weak-scaling speedup of over 10,000 is demonstrated on 65,536 cores of a large Cray XT5 system. Scenarios representing large population sizes exceeding several hundreds of millions of individuals in the largest cases are successfully exercised to verify model scalability.« less
Parallel implementation of geometrical shock dynamics for two dimensional converging shock waves
NASA Astrophysics Data System (ADS)
Qiu, Shi; Liu, Kuang; Eliasson, Veronica
2016-10-01
Geometrical shock dynamics (GSD) theory is an appealing method to predict the shock motion in the sense that it is more computationally efficient than solving the traditional Euler equations, especially for converging shock waves. However, to solve and optimize large scale configurations, the main bottleneck is the computational cost. Among the existing numerical GSD schemes, there is only one that has been implemented on parallel computers, with the purpose to analyze detonation waves. To extend the computational advantage of the GSD theory to more general applications such as converging shock waves, a numerical implementation using a spatial decomposition method has been coupled with a front tracking approach on parallel computers. In addition, an efficient tridiagonal system solver for massively parallel computers has been applied to resolve the most expensive function in this implementation, resulting in an efficiency of 0.93 while using 32 HPCC cores. Moreover, symmetric boundary conditions have been developed to further reduce the computational cost, achieving a speedup of 19.26 for a 12-sided polygonal converging shock.
NASA Astrophysics Data System (ADS)
Casu, F.; Bonano, M.; de Luca, C.; Lanari, R.; Manunta, M.; Manzo, M.; Zinno, I.
2017-12-01
Since its launch in 2014, the Sentinel-1 (S1) constellation has played a key role on SAR data availability and dissemination all over the World. Indeed, the free and open access data policy adopted by the European Copernicus program together with the global coverage acquisition strategy, make the Sentinel constellation as a game changer in the Earth Observation scenario. Being the SAR data become ubiquitous, the technological and scientific challenge is focused on maximizing the exploitation of such huge data flow. In this direction, the use of innovative processing algorithms and distributed computing infrastructures, such as the Cloud Computing platforms, can play a crucial role. In this work we present a Cloud Computing solution for the advanced interferometric (DInSAR) processing chain based on the Parallel SBAS (P-SBAS) approach, aimed at processing S1 Interferometric Wide Swath (IWS) data for the generation of large spatial scale deformation time series in efficient, automatic and systematic way. Such a DInSAR chain ingests Sentinel 1 SLC images and carries out several processing steps, to finally compute deformation time series and mean deformation velocity maps. Different parallel strategies have been designed ad hoc for each processing step of the P-SBAS S1 chain, encompassing both multi-core and multi-node programming techniques, in order to maximize the computational efficiency achieved within a Cloud Computing environment and cut down the relevant processing times. The presented P-SBAS S1 processing chain has been implemented on the Amazon Web Services platform and a thorough analysis of the attained parallel performances has been performed to identify and overcome the major bottlenecks to the scalability. The presented approach is used to perform national-scale DInSAR analyses over Italy, involving the processing of more than 3000 S1 IWS images acquired from both ascending and descending orbits. Such an experiment confirms the big advantage of exploiting large computational and storage resources of Cloud Computing platforms for large scale DInSAR analysis. The presented Cloud Computing P-SBAS processing chain can be a precious tool in the perspective of developing operational services disposable for the EO scientific community related to hazard monitoring and risk prevention and mitigation.
Experiences with hypercube operating system instrumentation
NASA Technical Reports Server (NTRS)
Reed, Daniel A.; Rudolph, David C.
1989-01-01
The difficulties in conceptualizing the interactions among a large number of processors make it difficult both to identify the sources of inefficiencies and to determine how a parallel program could be made more efficient. This paper describes an instrumentation system that can trace the execution of distributed memory parallel programs by recording the occurrence of parallel program events. The resulting event traces can be used to compile summary statistics that provide a global view of program performance. In addition, visualization tools permit the graphic display of event traces. Visual presentation of performance data is particularly useful, indeed, necessary for large-scale parallel computers; the enormous volume of performance data mandates visual display.
Probabilistic structural mechanics research for parallel processing computers
NASA Technical Reports Server (NTRS)
Sues, Robert H.; Chen, Heh-Chyun; Twisdale, Lawrence A.; Martin, William R.
1991-01-01
Aerospace structures and spacecraft are a complex assemblage of structural components that are subjected to a variety of complex, cyclic, and transient loading conditions. Significant modeling uncertainties are present in these structures, in addition to the inherent randomness of material properties and loads. To properly account for these uncertainties in evaluating and assessing the reliability of these components and structures, probabilistic structural mechanics (PSM) procedures must be used. Much research has focused on basic theory development and the development of approximate analytic solution methods in random vibrations and structural reliability. Practical application of PSM methods was hampered by their computationally intense nature. Solution of PSM problems requires repeated analyses of structures that are often large, and exhibit nonlinear and/or dynamic response behavior. These methods are all inherently parallel and ideally suited to implementation on parallel processing computers. New hardware architectures and innovative control software and solution methodologies are needed to make solution of large scale PSM problems practical.
Massive parallelization of serial inference algorithms for a complex generalized linear model
Suchard, Marc A.; Simpson, Shawn E.; Zorych, Ivan; Ryan, Patrick; Madigan, David
2014-01-01
Following a series of high-profile drug safety disasters in recent years, many countries are redoubling their efforts to ensure the safety of licensed medical products. Large-scale observational databases such as claims databases or electronic health record systems are attracting particular attention in this regard, but present significant methodological and computational concerns. In this paper we show how high-performance statistical computation, including graphics processing units, relatively inexpensive highly parallel computing devices, can enable complex methods in large databases. We focus on optimization and massive parallelization of cyclic coordinate descent approaches to fit a conditioned generalized linear model involving tens of millions of observations and thousands of predictors in a Bayesian context. We find orders-of-magnitude improvement in overall run-time. Coordinate descent approaches are ubiquitous in high-dimensional statistics and the algorithms we propose open up exciting new methodological possibilities with the potential to significantly improve drug safety. PMID:25328363
Xyce parallel electronic simulator users guide, version 6.1
DOE Office of Scientific and Technical Information (OSTI.GOV)
Keiter, Eric R; Mei, Ting; Russo, Thomas V.
This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas; Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). This includes support for most popular parallel and serial computers; A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows one to developmore » new types of analysis without requiring the implementation of analysis-specific device models; Device models that are specifically tailored to meet Sandia's needs, including some radiationaware devices (for Sandia users only); and Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase-a message passing parallel implementation-which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows.« less
Xyce parallel electronic simulator users' guide, Version 6.0.1.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Keiter, Eric R; Mei, Ting; Russo, Thomas V.
This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows one to developmore » new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandias needs, including some radiationaware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase a message passing parallel implementation which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows.« less
Xyce parallel electronic simulator users guide, version 6.0.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Keiter, Eric R; Mei, Ting; Russo, Thomas V.
This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows one to developmore » new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandias needs, including some radiationaware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase a message passing parallel implementation which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows.« less
Hierarchical Parallelism in Finite Difference Analysis of Heat Conduction
NASA Technical Reports Server (NTRS)
Padovan, Joseph; Krishna, Lala; Gute, Douglas
1997-01-01
Based on the concept of hierarchical parallelism, this research effort resulted in highly efficient parallel solution strategies for very large scale heat conduction problems. Overall, the method of hierarchical parallelism involves the partitioning of thermal models into several substructured levels wherein an optimal balance into various associated bandwidths is achieved. The details are described in this report. Overall, the report is organized into two parts. Part 1 describes the parallel modelling methodology and associated multilevel direct, iterative and mixed solution schemes. Part 2 establishes both the formal and computational properties of the scheme.
Combining points and lines in rectifying satellite images
NASA Astrophysics Data System (ADS)
Elaksher, Ahmed F.
2017-09-01
The quick advance in remote sensing technologies established the potential to gather accurate and reliable information about the Earth surface using high resolution satellite images. Remote sensing satellite images of less than one-meter pixel size are currently used in large-scale mapping. Rigorous photogrammetric equations are usually used to describe the relationship between the image coordinates and ground coordinates. These equations require the knowledge of the exterior and interior orientation parameters of the image that might not be available. On the other hand, the parallel projection transformation could be used to represent the mathematical relationship between the image-space and objectspace coordinate systems and provides the required accuracy for large-scale mapping using fewer ground control features. This article investigates the differences between point-based and line-based parallel projection transformation models in rectifying satellite images with different resolutions. The point-based parallel projection transformation model and its extended form are presented and the corresponding line-based forms are developed. Results showed that the RMS computed using the point- or line-based transformation models are equivalent and satisfy the requirement for large-scale mapping. The differences between the transformation parameters computed using the point- and line-based transformation models are insignificant. The results showed high correlation between the differences in the ground elevation and the RMS.
NASA Astrophysics Data System (ADS)
Lee, J.; Kim, K.
A Very Large Scale Integration (VLSI) architecture for robot direct kinematic computation suitable for industrial robot manipulators was investigated. The Denavit-Hartenberg transformations are reviewed to exploit a proper processing element, namely an augmented CORDIC. Specifically, two distinct implementations are elaborated on, such as the bit-serial and parallel. Performance of each scheme is analyzed with respect to the time to compute one location of the end-effector of a 6-links manipulator, and the number of transistors required.
NASA Technical Reports Server (NTRS)
Lee, J.; Kim, K.
1991-01-01
A Very Large Scale Integration (VLSI) architecture for robot direct kinematic computation suitable for industrial robot manipulators was investigated. The Denavit-Hartenberg transformations are reviewed to exploit a proper processing element, namely an augmented CORDIC. Specifically, two distinct implementations are elaborated on, such as the bit-serial and parallel. Performance of each scheme is analyzed with respect to the time to compute one location of the end-effector of a 6-links manipulator, and the number of transistors required.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Childers, J. T.; Uram, T. D.; LeCompte, T. J.
As the LHC moves to higher energies and luminosity, the demand for computing resources increases accordingly and will soon outpace the growth of the World- wide LHC Computing Grid. To meet this greater demand, event generation Monte Carlo was targeted for adaptation to run on Mira, the supercomputer at the Argonne Leadership Computing Facility. Alpgen is a Monte Carlo event generation application that is used by LHC experiments in the simulation of collisions that take place in the Large Hadron Collider. This paper details the process by which Alpgen was adapted from a single-processor serial-application to a large-scale parallel-application andmore » the performance that was achieved.« less
Childers, J. T.; Uram, T. D.; LeCompte, T. J.; ...
2016-09-29
As the LHC moves to higher energies and luminosity, the demand for computing resources increases accordingly and will soon outpace the growth of the Worldwide LHC Computing Grid. To meet this greater demand, event generation Monte Carlo was targeted for adaptation to run on Mira, the supercomputer at the Argonne Leadership Computing Facility. Alpgen is a Monte Carlo event generation application that is used by LHC experiments in the simulation of collisions that take place in the Large Hadron Collider. Finally, this paper details the process by which Alpgen was adapted from a single-processor serial-application to a large-scale parallel-application andmore » the performance that was achieved.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Childers, J. T.; Uram, T. D.; LeCompte, T. J.
As the LHC moves to higher energies and luminosity, the demand for computing resources increases accordingly and will soon outpace the growth of the Worldwide LHC Computing Grid. To meet this greater demand, event generation Monte Carlo was targeted for adaptation to run on Mira, the supercomputer at the Argonne Leadership Computing Facility. Alpgen is a Monte Carlo event generation application that is used by LHC experiments in the simulation of collisions that take place in the Large Hadron Collider. Finally, this paper details the process by which Alpgen was adapted from a single-processor serial-application to a large-scale parallel-application andmore » the performance that was achieved.« less
Large-eddy simulations of compressible convection on massively parallel computers. [stellar physics
NASA Technical Reports Server (NTRS)
Xie, Xin; Toomre, Juri
1993-01-01
We report preliminary implementation of the large-eddy simulation (LES) technique in 2D simulations of compressible convection carried out on the CM-2 massively parallel computer. The convective flow fields in our simulations possess structures similar to those found in a number of direct simulations, with roll-like flows coherent across the entire depth of the layer that spans several density scale heights. Our detailed assessment of the effects of various subgrid scale (SGS) terms reveals that they may affect the gross character of convection. Yet, somewhat surprisingly, we find that our LES solutions, and another in which the SGS terms are turned off, only show modest differences. The resulting 2D flows realized here are rather laminar in character, and achieving substantial turbulence may require stronger forcing and less dissipation.
The performance of low-cost commercial cloud computing as an alternative in computational chemistry.
Thackston, Russell; Fortenberry, Ryan C
2015-05-05
The growth of commercial cloud computing (CCC) as a viable means of computational infrastructure is largely unexplored for the purposes of quantum chemistry. In this work, the PSI4 suite of computational chemistry programs is installed on five different types of Amazon World Services CCC platforms. The performance for a set of electronically excited state single-point energies is compared between these CCC platforms and typical, "in-house" physical machines. Further considerations are made for the number of cores or virtual CPUs (vCPUs, for the CCC platforms), but no considerations are made for full parallelization of the program (even though parallelization of the BLAS library is implemented), complete high-performance computing cluster utilization, or steal time. Even with this most pessimistic view of the computations, CCC resources are shown to be more cost effective for significant numbers of typical quantum chemistry computations. Large numbers of large computations are still best utilized by more traditional means, but smaller-scale research may be more effectively undertaken through CCC services. © 2015 Wiley Periodicals, Inc.
Parallel scalability of Hartree-Fock calculations
NASA Astrophysics Data System (ADS)
Chow, Edmond; Liu, Xing; Smelyanskiy, Mikhail; Hammond, Jeff R.
2015-03-01
Quantum chemistry is increasingly performed using large cluster computers consisting of multiple interconnected nodes. For a fixed molecular problem, the efficiency of a calculation usually decreases as more nodes are used, due to the cost of communication between the nodes. This paper empirically investigates the parallel scalability of Hartree-Fock calculations. The construction of the Fock matrix and the density matrix calculation are analyzed separately. For the former, we use a parallelization of Fock matrix construction based on a static partitioning of work followed by a work stealing phase. For the latter, we use density matrix purification from the linear scaling methods literature, but without using sparsity. When using large numbers of nodes for moderately sized problems, density matrix computations are network-bandwidth bound, making purification methods potentially faster than eigendecomposition methods.
NASA Astrophysics Data System (ADS)
Timchenko, Leonid; Yarovyi, Andrii; Kokriatskaya, Nataliya; Nakonechna, Svitlana; Abramenko, Ludmila; Ławicki, Tomasz; Popiel, Piotr; Yesmakhanova, Laura
2016-09-01
The paper presents a method of parallel-hierarchical transformations for rapid recognition of dynamic images using GPU technology. Direct parallel-hierarchical transformations based on cluster CPU-and GPU-oriented hardware platform. Mathematic models of training of the parallel hierarchical (PH) network for the transformation are developed, as well as a training method of the PH network for recognition of dynamic images. This research is most topical for problems on organizing high-performance computations of super large arrays of information designed to implement multi-stage sensing and processing as well as compaction and recognition of data in the informational structures and computer devices. This method has such advantages as high performance through the use of recent advances in parallelization, possibility to work with images of ultra dimension, ease of scaling in case of changing the number of nodes in the cluster, auto scan of local network to detect compute nodes.
High performance cellular level agent-based simulation with FLAME for the GPU.
Richmond, Paul; Walker, Dawn; Coakley, Simon; Romano, Daniela
2010-05-01
Driven by the availability of experimental data and ability to simulate a biological scale which is of immediate interest, the cellular scale is fast emerging as an ideal candidate for middle-out modelling. As with 'bottom-up' simulation approaches, cellular level simulations demand a high degree of computational power, which in large-scale simulations can only be achieved through parallel computing. The flexible large-scale agent modelling environment (FLAME) is a template driven framework for agent-based modelling (ABM) on parallel architectures ideally suited to the simulation of cellular systems. It is available for both high performance computing clusters (www.flame.ac.uk) and GPU hardware (www.flamegpu.com) and uses a formal specification technique that acts as a universal modelling format. This not only creates an abstraction from the underlying hardware architectures, but avoids the steep learning curve associated with programming them. In benchmarking tests and simulations of advanced cellular systems, FLAME GPU has reported massive improvement in performance over more traditional ABM frameworks. This allows the time spent in the development and testing stages of modelling to be drastically reduced and creates the possibility of real-time visualisation for simple visual face-validation.
Large Scale Flutter Data for Design of Rotating Blades Using Navier-Stokes Equations
NASA Technical Reports Server (NTRS)
Guruswamy, Guru P.
2012-01-01
A procedure to compute flutter boundaries of rotating blades is presented; a) Navier-Stokes equations. b) Frequency domain method compatible with industry practice. Procedure is initially validated: a) Unsteady loads with flapping wing experiment. b) Flutter boundary with fixed wing experiment. Large scale flutter computation is demonstrated for rotating blade: a) Single job submission script. b) Flutter boundary in 24 hour wall clock time with 100 cores. c) Linearly scalable with number of cores. Tested with 1000 cores that produced data in 25 hrs for 10 flutter boundaries. Further wall-clock speed-up is possible by performing parallel computations within each case.
Distributed-Memory Computing With the Langley Aerothermodynamic Upwind Relaxation Algorithm (LAURA)
NASA Technical Reports Server (NTRS)
Riley, Christopher J.; Cheatwood, F. McNeil
1997-01-01
The Langley Aerothermodynamic Upwind Relaxation Algorithm (LAURA), a Navier-Stokes solver, has been modified for use in a parallel, distributed-memory environment using the Message-Passing Interface (MPI) standard. A standard domain decomposition strategy is used in which the computational domain is divided into subdomains with each subdomain assigned to a processor. Performance is examined on dedicated parallel machines and a network of desktop workstations. The effect of domain decomposition and frequency of boundary updates on performance and convergence is also examined for several realistic configurations and conditions typical of large-scale computational fluid dynamic analysis.
Hybrid Parallelism for Volume Rendering on Large-, Multi-, and Many-Core Systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Howison, Mark; Bethel, E. Wes; Childs, Hank
2012-01-01
With the computing industry trending towards multi- and many-core processors, we study how a standard visualization algorithm, ray-casting volume rendering, can benefit from a hybrid parallelism approach. Hybrid parallelism provides the best of both worlds: using distributed-memory parallelism across a large numbers of nodes increases available FLOPs and memory, while exploiting shared-memory parallelism among the cores within each node ensures that each node performs its portion of the larger calculation as efficiently as possible. We demonstrate results from weak and strong scaling studies, at levels of concurrency ranging up to 216,000, and with datasets as large as 12.2 trillion cells.more » The greatest benefit from hybrid parallelism lies in the communication portion of the algorithm, the dominant cost at higher levels of concurrency. We show that reducing the number of participants with a hybrid approach significantly improves performance.« less
Parallel Dynamics Simulation Using a Krylov-Schwarz Linear Solution Scheme
Abhyankar, Shrirang; Constantinescu, Emil M.; Smith, Barry F.; ...
2016-11-07
Fast dynamics simulation of large-scale power systems is a computational challenge because of the need to solve a large set of stiff, nonlinear differential-algebraic equations at every time step. The main bottleneck in dynamic simulations is the solution of a linear system during each nonlinear iteration of Newton’s method. In this paper, we present a parallel Krylov- Schwarz linear solution scheme that uses the Krylov subspacebased iterative linear solver GMRES with an overlapping restricted additive Schwarz preconditioner. As a result, performance tests of the proposed Krylov-Schwarz scheme for several large test cases ranging from 2,000 to 20,000 buses, including amore » real utility network, show good scalability on different computing architectures.« less
Parallel Dynamics Simulation Using a Krylov-Schwarz Linear Solution Scheme
DOE Office of Scientific and Technical Information (OSTI.GOV)
Abhyankar, Shrirang; Constantinescu, Emil M.; Smith, Barry F.
Fast dynamics simulation of large-scale power systems is a computational challenge because of the need to solve a large set of stiff, nonlinear differential-algebraic equations at every time step. The main bottleneck in dynamic simulations is the solution of a linear system during each nonlinear iteration of Newton’s method. In this paper, we present a parallel Krylov- Schwarz linear solution scheme that uses the Krylov subspacebased iterative linear solver GMRES with an overlapping restricted additive Schwarz preconditioner. As a result, performance tests of the proposed Krylov-Schwarz scheme for several large test cases ranging from 2,000 to 20,000 buses, including amore » real utility network, show good scalability on different computing architectures.« less
Ultrafast and scalable cone-beam CT reconstruction using MapReduce in a cloud computing environment.
Meng, Bowen; Pratx, Guillem; Xing, Lei
2011-12-01
Four-dimensional CT (4DCT) and cone beam CT (CBCT) are widely used in radiation therapy for accurate tumor target definition and localization. However, high-resolution and dynamic image reconstruction is computationally demanding because of the large amount of data processed. Efficient use of these imaging techniques in the clinic requires high-performance computing. The purpose of this work is to develop a novel ultrafast, scalable and reliable image reconstruction technique for 4D CBCT∕CT using a parallel computing framework called MapReduce. We show the utility of MapReduce for solving large-scale medical physics problems in a cloud computing environment. In this work, we accelerated the Feldcamp-Davis-Kress (FDK) algorithm by porting it to Hadoop, an open-source MapReduce implementation. Gated phases from a 4DCT scans were reconstructed independently. Following the MapReduce formalism, Map functions were used to filter and backproject subsets of projections, and Reduce function to aggregate those partial backprojection into the whole volume. MapReduce automatically parallelized the reconstruction process on a large cluster of computer nodes. As a validation, reconstruction of a digital phantom and an acquired CatPhan 600 phantom was performed on a commercial cloud computing environment using the proposed 4D CBCT∕CT reconstruction algorithm. Speedup of reconstruction time is found to be roughly linear with the number of nodes employed. For instance, greater than 10 times speedup was achieved using 200 nodes for all cases, compared to the same code executed on a single machine. Without modifying the code, faster reconstruction is readily achievable by allocating more nodes in the cloud computing environment. Root mean square error between the images obtained using MapReduce and a single-threaded reference implementation was on the order of 10(-7). Our study also proved that cloud computing with MapReduce is fault tolerant: the reconstruction completed successfully with identical results even when half of the nodes were manually terminated in the middle of the process. An ultrafast, reliable and scalable 4D CBCT∕CT reconstruction method was developed using the MapReduce framework. Unlike other parallel computing approaches, the parallelization and speedup required little modification of the original reconstruction code. MapReduce provides an efficient and fault tolerant means of solving large-scale computing problems in a cloud computing environment.
Ultrafast and scalable cone-beam CT reconstruction using MapReduce in a cloud computing environment
Meng, Bowen; Pratx, Guillem; Xing, Lei
2011-01-01
Purpose: Four-dimensional CT (4DCT) and cone beam CT (CBCT) are widely used in radiation therapy for accurate tumor target definition and localization. However, high-resolution and dynamic image reconstruction is computationally demanding because of the large amount of data processed. Efficient use of these imaging techniques in the clinic requires high-performance computing. The purpose of this work is to develop a novel ultrafast, scalable and reliable image reconstruction technique for 4D CBCT/CT using a parallel computing framework called MapReduce. We show the utility of MapReduce for solving large-scale medical physics problems in a cloud computing environment. Methods: In this work, we accelerated the Feldcamp–Davis–Kress (FDK) algorithm by porting it to Hadoop, an open-source MapReduce implementation. Gated phases from a 4DCT scans were reconstructed independently. Following the MapReduce formalism, Map functions were used to filter and backproject subsets of projections, and Reduce function to aggregate those partial backprojection into the whole volume. MapReduce automatically parallelized the reconstruction process on a large cluster of computer nodes. As a validation, reconstruction of a digital phantom and an acquired CatPhan 600 phantom was performed on a commercial cloud computing environment using the proposed 4D CBCT/CT reconstruction algorithm. Results: Speedup of reconstruction time is found to be roughly linear with the number of nodes employed. For instance, greater than 10 times speedup was achieved using 200 nodes for all cases, compared to the same code executed on a single machine. Without modifying the code, faster reconstruction is readily achievable by allocating more nodes in the cloud computing environment. Root mean square error between the images obtained using MapReduce and a single-threaded reference implementation was on the order of 10−7. Our study also proved that cloud computing with MapReduce is fault tolerant: the reconstruction completed successfully with identical results even when half of the nodes were manually terminated in the middle of the process. Conclusions: An ultrafast, reliable and scalable 4D CBCT/CT reconstruction method was developed using the MapReduce framework. Unlike other parallel computing approaches, the parallelization and speedup required little modification of the original reconstruction code. MapReduce provides an efficient and fault tolerant means of solving large-scale computing problems in a cloud computing environment. PMID:22149842
Simulation framework for intelligent transportation systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ewing, T.; Doss, E.; Hanebutte, U.
1996-10-01
A simulation framework has been developed for a large-scale, comprehensive, scaleable simulation of an Intelligent Transportation System (ITS). The simulator is designed for running on parallel computers and distributed (networked) computer systems, but can run on standalone workstations for smaller simulations. The simulator currently models instrumented smart vehicles with in-vehicle navigation units capable of optimal route planning and Traffic Management Centers (TMC). The TMC has probe vehicle tracking capabilities (display position and attributes of instrumented vehicles), and can provide two-way interaction with traffic to provide advisories and link times. Both the in-vehicle navigation module and the TMC feature detailed graphicalmore » user interfaces to support human-factors studies. Realistic modeling of variations of the posted driving speed are based on human factors studies that take into consideration weather, road conditions, driver personality and behavior, and vehicle type. The prototype has been developed on a distributed system of networked UNIX computers but is designed to run on parallel computers, such as ANL`s IBM SP-2, for large-scale problems. A novel feature of the approach is that vehicles are represented by autonomous computer processes which exchange messages with other processes. The vehicles have a behavior model which governs route selection and driving behavior, and can react to external traffic events much like real vehicles. With this approach, the simulation is scaleable to take advantage of emerging massively parallel processor (MPP) systems.« less
A Multiscale Parallel Computing Architecture for Automated Segmentation of the Brain Connectome
Knobe, Kathleen; Newton, Ryan R.; Schlimbach, Frank; Blower, Melanie; Reid, R. Clay
2015-01-01
Several groups in neurobiology have embarked into deciphering the brain circuitry using large-scale imaging of a mouse brain and manual tracing of the connections between neurons. Creating a graph of the brain circuitry, also called a connectome, could have a huge impact on the understanding of neurodegenerative diseases such as Alzheimer’s disease. Although considerably smaller than a human brain, a mouse brain already exhibits one billion connections and manually tracing the connectome of a mouse brain can only be achieved partially. This paper proposes to scale up the tracing by using automated image segmentation and a parallel computing approach designed for domain experts. We explain the design decisions behind our parallel approach and we present our results for the segmentation of the vasculature and the cell nuclei, which have been obtained without any manual intervention. PMID:21926011
Approximate Computing Techniques for Iterative Graph Algorithms
DOE Office of Scientific and Technical Information (OSTI.GOV)
Panyala, Ajay R.; Subasi, Omer; Halappanavar, Mahantesh
Approximate computing enables processing of large-scale graphs by trading off quality for performance. Approximate computing techniques have become critical not only due to the emergence of parallel architectures but also the availability of large scale datasets enabling data-driven discovery. Using two prototypical graph algorithms, PageRank and community detection, we present several approximate computing heuristics to scale the performance with minimal loss of accuracy. We present several heuristics including loop perforation, data caching, incomplete graph coloring and synchronization, and evaluate their efficiency. We demonstrate performance improvements of up to 83% for PageRank and up to 450x for community detection, with lowmore » impact of accuracy for both the algorithms. We expect the proposed approximate techniques will enable scalable graph analytics on data of importance to several applications in science and their subsequent adoption to scale similar graph algorithms.« less
Argonne Simulation Framework for Intelligent Transportation Systems
DOT National Transportation Integrated Search
1996-01-01
A simulation framework has been developed which defines a high-level architecture for a large-scale, comprehensive, scalable simulation of an Intelligent Transportation System (ITS). The simulator is designed to run on parallel computers and distribu...
Cloud Computing Boosts Business Intelligence of Telecommunication Industry
NASA Astrophysics Data System (ADS)
Xu, Meng; Gao, Dan; Deng, Chao; Luo, Zhiguo; Sun, Shaoling
Business Intelligence becomes an attracting topic in today's data intensive applications, especially in telecommunication industry. Meanwhile, Cloud Computing providing IT supporting Infrastructure with excellent scalability, large scale storage, and high performance becomes an effective way to implement parallel data processing and data mining algorithms. BC-PDM (Big Cloud based Parallel Data Miner) is a new MapReduce based parallel data mining platform developed by CMRI (China Mobile Research Institute) to fit the urgent requirements of business intelligence in telecommunication industry. In this paper, the architecture, functionality and performance of BC-PDM are presented, together with the experimental evaluation and case studies of its applications. The evaluation result demonstrates both the usability and the cost-effectiveness of Cloud Computing based Business Intelligence system in applications of telecommunication industry.
Wakefield Computations for the CLIC PETS using the Parallel Finite Element Time-Domain Code T3P
DOE Office of Scientific and Technical Information (OSTI.GOV)
Candel, A; Kabel, A.; Lee, L.
In recent years, SLAC's Advanced Computations Department (ACD) has developed the high-performance parallel 3D electromagnetic time-domain code, T3P, for simulations of wakefields and transients in complex accelerator structures. T3P is based on advanced higher-order Finite Element methods on unstructured grids with quadratic surface approximation. Optimized for large-scale parallel processing on leadership supercomputing facilities, T3P allows simulations of realistic 3D structures with unprecedented accuracy, aiding the design of the next generation of accelerator facilities. Applications to the Compact Linear Collider (CLIC) Power Extraction and Transfer Structure (PETS) are presented.
Anandakrishnan, Ramu; Scogland, Tom R W; Fenley, Andrew T; Gordon, John C; Feng, Wu-chun; Onufriev, Alexey V
2010-06-01
Tools that compute and visualize biomolecular electrostatic surface potential have been used extensively for studying biomolecular function. However, determining the surface potential for large biomolecules on a typical desktop computer can take days or longer using currently available tools and methods. Two commonly used techniques to speed-up these types of electrostatic computations are approximations based on multi-scale coarse-graining and parallelization across multiple processors. This paper demonstrates that for the computation of electrostatic surface potential, these two techniques can be combined to deliver significantly greater speed-up than either one separately, something that is in general not always possible. Specifically, the electrostatic potential computation, using an analytical linearized Poisson-Boltzmann (ALPB) method, is approximated using the hierarchical charge partitioning (HCP) multi-scale method, and parallelized on an ATI Radeon 4870 graphical processing unit (GPU). The implementation delivers a combined 934-fold speed-up for a 476,040 atom viral capsid, compared to an equivalent non-parallel implementation on an Intel E6550 CPU without the approximation. This speed-up is significantly greater than the 42-fold speed-up for the HCP approximation alone or the 182-fold speed-up for the GPU alone. Copyright (c) 2010 Elsevier Inc. All rights reserved.
A Web-based Distributed Voluntary Computing Platform for Large Scale Hydrological Computations
NASA Astrophysics Data System (ADS)
Demir, I.; Agliamzanov, R.
2014-12-01
Distributed volunteer computing can enable researchers and scientist to form large parallel computing environments to utilize the computing power of the millions of computers on the Internet, and use them towards running large scale environmental simulations and models to serve the common good of local communities and the world. Recent developments in web technologies and standards allow client-side scripting languages to run at speeds close to native application, and utilize the power of Graphics Processing Units (GPU). Using a client-side scripting language like JavaScript, we have developed an open distributed computing framework that makes it easy for researchers to write their own hydrologic models, and run them on volunteer computers. Users will easily enable their websites for visitors to volunteer sharing their computer resources to contribute running advanced hydrological models and simulations. Using a web-based system allows users to start volunteering their computational resources within seconds without installing any software. The framework distributes the model simulation to thousands of nodes in small spatial and computational sizes. A relational database system is utilized for managing data connections and queue management for the distributed computing nodes. In this paper, we present a web-based distributed volunteer computing platform to enable large scale hydrological simulations and model runs in an open and integrated environment.
Self-Scheduling Parallel Methods for Multiple Serial Codes with Application to WOPWOP
NASA Technical Reports Server (NTRS)
Long, Lyle N.; Brentner, Kenneth S.
2000-01-01
This paper presents a scheme for efficiently running a large number of serial jobs on parallel computers. Two examples are given of computer programs that run relatively quickly, but often they must be run numerous times to obtain all the results needed. It is very common in science and engineering to have codes that are not massive computing challenges in themselves, but due to the number of instances that must be run, they do become large-scale computing problems. The two examples given here represent common problems in aerospace engineering: aerodynamic panel methods and aeroacoustic integral methods. The first example simply solves many systems of linear equations. This is representative of an aerodynamic panel code where someone would like to solve for numerous angles of attack. The complete code for this first example is included in the appendix so that it can be readily used by others as a template. The second example is an aeroacoustics code (WOPWOP) that solves the Ffowcs Williams Hawkings equation to predict the far-field sound due to rotating blades. In this example, one quite often needs to compute the sound at numerous observer locations, hence parallelization is utilized to automate the noise computation for a large number of observers.
Large-Scale, Parallel, Multi-Sensor Atmospheric Data Fusion Using Cloud Computing
NASA Astrophysics Data System (ADS)
Wilson, B. D.; Manipon, G.; Hua, H.; Fetzer, E. J.
2013-12-01
NASA's Earth Observing System (EOS) is an ambitious facility for studying global climate change. The mandate now is to combine measurements from the instruments on the 'A-Train' platforms (AIRS, AMSR-E, MODIS, MISR, MLS, and CloudSat) and other Earth probes to enable large-scale studies of climate change over decades. Moving to multi-sensor, long-duration analyses of important climate variables presents serious challenges for large-scale data mining and fusion. For example, one might want to compare temperature and water vapor retrievals from one instrument (AIRS) to another (MODIS), and to a model (MERRA), stratify the comparisons using a classification of the 'cloud scenes' from CloudSat, and repeat the entire analysis over 10 years of data. To efficiently assemble such datasets, we are utilizing Elastic Computing in the Cloud and parallel map/reduce-based algorithms. However, these problems are Data Intensive computing so the data transfer times and storage costs (for caching) are key issues. SciReduce is a Hadoop-like parallel analysis system, programmed in parallel python, that is designed from the ground up for Earth science. SciReduce executes inside VMWare images and scales to any number of nodes in the Cloud. Unlike Hadoop, SciReduce operates on bundles of named numeric arrays, which can be passed in memory or serialized to disk in netCDF4 or HDF5. Figure 1 shows the architecture of the full computational system, with SciReduce at the core. Multi-year datasets are automatically 'sharded' by time and space across a cluster of nodes so that years of data (millions of files) can be processed in a massively parallel way. Input variables (arrays) are pulled on-demand into the Cloud using OPeNDAP URLs or other subsetting services, thereby minimizing the size of the cached input and intermediate datasets. We are using SciReduce to automate the production of multiple versions of a ten-year A-Train water vapor climatology under a NASA MEASURES grant. We will present the architecture of SciReduce, describe the achieved 'clock time' speedups in fusing datasets on our own compute nodes and in the public Cloud, and discuss the Cloud cost tradeoffs for storage, compute, and data transfer. We will also present a concept/prototype for staging NASA's A-Train Atmospheric datasets (Levels 2 & 3) in the Amazon Cloud so that any number of compute jobs can be executed 'near' the multi-sensor data. Given such a system, multi-sensor climate studies over 10-20 years of data could be performed in an efficient way, with the researcher paying only his own Cloud compute bill. SciReduce Architecture
DOE Office of Scientific and Technical Information (OSTI.GOV)
Suryanarayana, Phanish; Pratapa, Phanisri P.; Sharma, Abhiraj
We present SQDFT: a large-scale parallel implementation of the Spectral Quadrature (SQ) method formore » $$\\mathscr{O}(N)$$ Kohn–Sham Density Functional Theory (DFT) calculations at high temperature. Specifically, we develop an efficient and scalable finite-difference implementation of the infinite-cell Clenshaw–Curtis SQ approach, in which results for the infinite crystal are obtained by expressing quantities of interest as bilinear forms or sums of bilinear forms, that are then approximated by spatially localized Clenshaw–Curtis quadrature rules. We demonstrate the accuracy of SQDFT by showing systematic convergence of energies and atomic forces with respect to SQ parameters to reference diagonalization results, and convergence with discretization to established planewave results, for both metallic and insulating systems. Here, we further demonstrate that SQDFT achieves excellent strong and weak parallel scaling on computer systems consisting of tens of thousands of processors, with near perfect $$\\mathscr{O}(N)$$ scaling with system size and wall times as low as a few seconds per self-consistent field iteration. Finally, we verify the accuracy of SQDFT in large-scale quantum molecular dynamics simulations of aluminum at high temperature.« less
Suryanarayana, Phanish; Pratapa, Phanisri P.; Sharma, Abhiraj; ...
2017-12-07
We present SQDFT: a large-scale parallel implementation of the Spectral Quadrature (SQ) method formore » $$\\mathscr{O}(N)$$ Kohn–Sham Density Functional Theory (DFT) calculations at high temperature. Specifically, we develop an efficient and scalable finite-difference implementation of the infinite-cell Clenshaw–Curtis SQ approach, in which results for the infinite crystal are obtained by expressing quantities of interest as bilinear forms or sums of bilinear forms, that are then approximated by spatially localized Clenshaw–Curtis quadrature rules. We demonstrate the accuracy of SQDFT by showing systematic convergence of energies and atomic forces with respect to SQ parameters to reference diagonalization results, and convergence with discretization to established planewave results, for both metallic and insulating systems. Here, we further demonstrate that SQDFT achieves excellent strong and weak parallel scaling on computer systems consisting of tens of thousands of processors, with near perfect $$\\mathscr{O}(N)$$ scaling with system size and wall times as low as a few seconds per self-consistent field iteration. Finally, we verify the accuracy of SQDFT in large-scale quantum molecular dynamics simulations of aluminum at high temperature.« less
Efficient Parallelization of a Dynamic Unstructured Application on the Tera MTA
NASA Technical Reports Server (NTRS)
Oliker, Leonid; Biswas, Rupak
1999-01-01
The success of parallel computing in solving real-life computationally-intensive problems relies on their efficient mapping and execution on large-scale multiprocessor architectures. Many important applications are both unstructured and dynamic in nature, making their efficient parallel implementation a daunting task. This paper presents the parallelization of a dynamic unstructured mesh adaptation algorithm using three popular programming paradigms on three leading supercomputers. We examine an MPI message-passing implementation on the Cray T3E and the SGI Origin2OOO, a shared-memory implementation using cache coherent nonuniform memory access (CC-NUMA) of the Origin2OOO, and a multi-threaded version on the newly-released Tera Multi-threaded Architecture (MTA). We compare several critical factors of this parallel code development, including runtime, scalability, programmability, and memory overhead. Our overall results demonstrate that multi-threaded systems offer tremendous potential for quickly and efficiently solving some of the most challenging real-life problems on parallel computers.
Parallelization of the FLAPW method and comparison with the PPW method
NASA Astrophysics Data System (ADS)
Canning, Andrew; Mannstadt, Wolfgang; Freeman, Arthur
2000-03-01
The FLAPW (full-potential linearized-augmented plane-wave) method is one of the most accurate first-principles methods for determining electronic and magnetic properties of crystals and surfaces. In the past the FLAPW method has been limited to systems of about a hundred atoms due to the lack of an efficient parallel implementation to exploit the power and memory of parallel computers. In this work we present an efficient parallelization of the method by division among the processors of the plane-wave components for each state. The code is also optimized for RISC (reduced instruction set computer) architectures, such as those found on most parallel computers, making full use of BLAS (basic linear algebra subprograms) wherever possible. Scaling results are presented for systems of up to 686 silicon atoms and 343 palladium atoms per unit cell running on up to 512 processors on a Cray T3E parallel supercomputer. Some results will also be presented on a comparison of the plane-wave pseudopotential method and the FLAPW method on large systems.
Creating a Parallel Version of VisIt for Microsoft Windows
DOE Office of Scientific and Technical Information (OSTI.GOV)
Whitlock, B J; Biagas, K S; Rawson, P L
2011-12-07
VisIt is a popular, free interactive parallel visualization and analysis tool for scientific data. Users can quickly generate visualizations from their data, animate them through time, manipulate them, and save the resulting images or movies for presentations. VisIt was designed from the ground up to work on many scales of computers from modest desktops up to massively parallel clusters. VisIt is comprised of a set of cooperating programs. All programs can be run locally or in client/server mode in which some run locally and some run remotely on compute clusters. The VisIt program most able to harness today's computing powermore » is the VisIt compute engine. The compute engine is responsible for reading simulation data from disk, processing it, and sending results or images back to the VisIt viewer program. In a parallel environment, the compute engine runs several processes, coordinating using the Message Passing Interface (MPI) library. Each MPI process reads some subset of the scientific data and filters the data in various ways to create useful visualizations. By using MPI, VisIt has been able to scale well into the thousands of processors on large computers such as dawn and graph at LLNL. The advent of multicore CPU's has made parallelism the 'new' way to achieve increasing performance. With today's computers having at least 2 cores and in many cases up to 8 and beyond, it is more important than ever to deploy parallel software that can use that computing power not only on clusters but also on the desktop. We have created a parallel version of VisIt for Windows that uses Microsoft's MPI implementation (MSMPI) to process data in parallel on the Windows desktop as well as on a Windows HPC cluster running Microsoft Windows Server 2008. Initial desktop parallel support for Windows was deployed in VisIt 2.4.0. Windows HPC cluster support has been completed and will appear in the VisIt 2.5.0 release. We plan to continue supporting parallel VisIt on Windows so our users will be able to take full advantage of their multicore resources.« less
Settgast, Randolph R.; Fu, Pengcheng; Walsh, Stuart D. C.; ...
2016-09-18
This study describes a fully coupled finite element/finite volume approach for simulating field-scale hydraulically driven fractures in three dimensions, using massively parallel computing platforms. The proposed method is capable of capturing realistic representations of local heterogeneities, layering and natural fracture networks in a reservoir. A detailed description of the numerical implementation is provided, along with numerical studies comparing the model with both analytical solutions and experimental results. The results demonstrate the effectiveness of the proposed method for modeling large-scale problems involving hydraulically driven fractures in three dimensions.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Settgast, Randolph R.; Fu, Pengcheng; Walsh, Stuart D. C.
This study describes a fully coupled finite element/finite volume approach for simulating field-scale hydraulically driven fractures in three dimensions, using massively parallel computing platforms. The proposed method is capable of capturing realistic representations of local heterogeneities, layering and natural fracture networks in a reservoir. A detailed description of the numerical implementation is provided, along with numerical studies comparing the model with both analytical solutions and experimental results. The results demonstrate the effectiveness of the proposed method for modeling large-scale problems involving hydraulically driven fractures in three dimensions.
pcircle - A Suite of Scalable Parallel File System Tools
DOE Office of Scientific and Technical Information (OSTI.GOV)
WANG, FEIYI
2015-10-01
Most of the software related to file system are written for conventional local file system, they are serialized and can't take advantage of the benefit of a large scale parallel file system. "pcircle" software builds on top of ubiquitous MPI in cluster computing environment and "work-stealing" pattern to provide a scalable, high-performance suite of file system tools. In particular - it implemented parallel data copy and parallel data checksumming, with advanced features such as async progress report, checkpoint and restart, as well as integrity checking.
Template Interfaces for Agile Parallel Data-Intensive Science
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ramakrishnan, Lavanya; Gunter, Daniel; Pastorello, Gilerto Z.
Tigres provides a programming library to compose and execute large-scale data-intensive scientific workflows from desktops to supercomputers. DOE User Facilities and large science collaborations are increasingly generating large enough data sets that it is no longer practical to download them to a desktop to operate on them. They are instead stored at centralized compute and storage resources such as high performance computing (HPC) centers. Analysis of this data requires an ability to run on these facilities, but with current technologies, scaling an analysis to an HPC center and to a large data set is difficult even for experts. Tigres ismore » addressing the challenge of enabling collaborative analysis of DOE Science data through a new concept of reusable "templates" that enable scientists to easily compose, run and manage collaborative computational tasks. These templates define common computation patterns used in analyzing a data set.« less
PLUM: Parallel Load Balancing for Unstructured Adaptive Meshes. Degree awarded by Colorado Univ.
NASA Technical Reports Server (NTRS)
Oliker, Leonid
1998-01-01
Dynamic mesh adaption on unstructured grids is a powerful tool for computing large-scale problems that require grid modifications to efficiently resolve solution features. By locally refining and coarsening the mesh to capture physical phenomena of interest, such procedures make standard computational methods more cost effective. Unfortunately, an efficient parallel implementation of these adaptive methods is rather difficult to achieve, primarily due to the load imbalance created by the dynamically-changing nonuniform grid. This requires significant communication at runtime, leading to idle processors and adversely affecting the total execution time. Nonetheless, it is generally thought that unstructured adaptive- grid techniques will constitute a significant fraction of future high-performance supercomputing. Various dynamic load balancing methods have been reported to date; however, most of them either lack a global view of loads across processors or do not apply their techniques to realistic large-scale applications.
Afshar, Yaser; Sbalzarini, Ivo F.
2016-01-01
Modern fluorescence microscopy modalities, such as light-sheet microscopy, are capable of acquiring large three-dimensional images at high data rate. This creates a bottleneck in computational processing and analysis of the acquired images, as the rate of acquisition outpaces the speed of processing. Moreover, images can be so large that they do not fit the main memory of a single computer. We address both issues by developing a distributed parallel algorithm for segmentation of large fluorescence microscopy images. The method is based on the versatile Discrete Region Competition algorithm, which has previously proven useful in microscopy image segmentation. The present distributed implementation decomposes the input image into smaller sub-images that are distributed across multiple computers. Using network communication, the computers orchestrate the collectively solving of the global segmentation problem. This not only enables segmentation of large images (we test images of up to 1010 pixels), but also accelerates segmentation to match the time scale of image acquisition. Such acquisition-rate image segmentation is a prerequisite for the smart microscopes of the future and enables online data compression and interactive experiments. PMID:27046144
Afshar, Yaser; Sbalzarini, Ivo F
2016-01-01
Modern fluorescence microscopy modalities, such as light-sheet microscopy, are capable of acquiring large three-dimensional images at high data rate. This creates a bottleneck in computational processing and analysis of the acquired images, as the rate of acquisition outpaces the speed of processing. Moreover, images can be so large that they do not fit the main memory of a single computer. We address both issues by developing a distributed parallel algorithm for segmentation of large fluorescence microscopy images. The method is based on the versatile Discrete Region Competition algorithm, which has previously proven useful in microscopy image segmentation. The present distributed implementation decomposes the input image into smaller sub-images that are distributed across multiple computers. Using network communication, the computers orchestrate the collectively solving of the global segmentation problem. This not only enables segmentation of large images (we test images of up to 10(10) pixels), but also accelerates segmentation to match the time scale of image acquisition. Such acquisition-rate image segmentation is a prerequisite for the smart microscopes of the future and enables online data compression and interactive experiments.
Large scale cardiac modeling on the Blue Gene supercomputer.
Reumann, Matthias; Fitch, Blake G; Rayshubskiy, Aleksandr; Keller, David U; Weiss, Daniel L; Seemann, Gunnar; Dössel, Olaf; Pitman, Michael C; Rice, John J
2008-01-01
Multi-scale, multi-physical heart models have not yet been able to include a high degree of accuracy and resolution with respect to model detail and spatial resolution due to computational limitations of current systems. We propose a framework to compute large scale cardiac models. Decomposition of anatomical data in segments to be distributed on a parallel computer is carried out by optimal recursive bisection (ORB). The algorithm takes into account a computational load parameter which has to be adjusted according to the cell models used. The diffusion term is realized by the monodomain equations. The anatomical data-set was given by both ventricles of the Visible Female data-set in a 0.2 mm resolution. Heterogeneous anisotropy was included in the computation. Model weights as input for the decomposition and load balancing were set to (a) 1 for tissue and 0 for non-tissue elements; (b) 10 for tissue and 1 for non-tissue elements. Scaling results for 512, 1024, 2048, 4096 and 8192 computational nodes were obtained for 10 ms simulation time. The simulations were carried out on an IBM Blue Gene/L parallel computer. A 1 s simulation was then carried out on 2048 nodes for the optimal model load. Load balances did not differ significantly across computational nodes even if the number of data elements distributed to each node differed greatly. Since the ORB algorithm did not take into account computational load due to communication cycles, the speedup is close to optimal for the computation time but not optimal overall due to the communication overhead. However, the simulation times were reduced form 87 minutes on 512 to 11 minutes on 8192 nodes. This work demonstrates that it is possible to run simulations of the presented detailed cardiac model within hours for the simulation of a heart beat.
PETSc Users Manual Revision 3.7
DOE Office of Scientific and Technical Information (OSTI.GOV)
Balay, Satish; Abhyankar, S.; Adams, M.
This manual describes the use of PETSc for the numerical solution of partial differential equations and related problems on high-performance computers. The Portable, Extensible Toolkit for Scientific Computation (PETSc) is a suite of data structures and routines that provide the building blocks for the implementation of large-scale application codes on parallel (and serial) computers. PETSc uses the MPI standard for all message-passing communication.
PETSc Users Manual Revision 3.8
DOE Office of Scientific and Technical Information (OSTI.GOV)
Balay, S.; Abhyankar, S.; Adams, M.
This manual describes the use of PETSc for the numerical solution of partial differential equations and related problems on high-performance computers. The Portable, Extensible Toolkit for Scientific Computation (PETSc) is a suite of data structures and routines that provide the building blocks for the implementation of large-scale application codes on parallel (and serial) computers. PETSc uses the MPI standard for all message-passing communication.
Load Balancing Scientific Applications
DOE Office of Scientific and Technical Information (OSTI.GOV)
Pearce, Olga Tkachyshyn
2014-12-01
The largest supercomputers have millions of independent processors, and concurrency levels are rapidly increasing. For ideal efficiency, developers of the simulations that run on these machines must ensure that computational work is evenly balanced among processors. Assigning work evenly is challenging because many large modern parallel codes simulate behavior of physical systems that evolve over time, and their workloads change over time. Furthermore, the cost of imbalanced load increases with scale because most large-scale scientific simulations today use a Single Program Multiple Data (SPMD) parallel programming model, and an increasing number of processors will wait for the slowest one atmore » the synchronization points. To address load imbalance, many large-scale parallel applications use dynamic load balance algorithms to redistribute work evenly. The research objective of this dissertation is to develop methods to decide when and how to load balance the application, and to balance it effectively and affordably. We measure and evaluate the computational load of the application, and develop strategies to decide when and how to correct the imbalance. Depending on the simulation, a fast, local load balance algorithm may be suitable, or a more sophisticated and expensive algorithm may be required. We developed a model for comparison of load balance algorithms for a specific state of the simulation that enables the selection of a balancing algorithm that will minimize overall runtime.« less
Large-scale virtual screening on public cloud resources with Apache Spark.
Capuccini, Marco; Ahmed, Laeeq; Schaal, Wesley; Laure, Erwin; Spjuth, Ola
2017-01-01
Structure-based virtual screening is an in-silico method to screen a target receptor against a virtual molecular library. Applying docking-based screening to large molecular libraries can be computationally expensive, however it constitutes a trivially parallelizable task. Most of the available parallel implementations are based on message passing interface, relying on low failure rate hardware and fast network connection. Google's MapReduce revolutionized large-scale analysis, enabling the processing of massive datasets on commodity hardware and cloud resources, providing transparent scalability and fault tolerance at the software level. Open source implementations of MapReduce include Apache Hadoop and the more recent Apache Spark. We developed a method to run existing docking-based screening software on distributed cloud resources, utilizing the MapReduce approach. We benchmarked our method, which is implemented in Apache Spark, docking a publicly available target receptor against [Formula: see text]2.2 M compounds. The performance experiments show a good parallel efficiency (87%) when running in a public cloud environment. Our method enables parallel Structure-based virtual screening on public cloud resources or commodity computer clusters. The degree of scalability that we achieve allows for trying out our method on relatively small libraries first and then to scale to larger libraries. Our implementation is named Spark-VS and it is freely available as open source from GitHub (https://github.com/mcapuccini/spark-vs).Graphical abstract.
Use of parallel computing in mass processing of laser data
NASA Astrophysics Data System (ADS)
Będkowski, J.; Bratuś, R.; Prochaska, M.; Rzonca, A.
2015-12-01
The first part of the paper includes a description of the rules used to generate the algorithm needed for the purpose of parallel computing and also discusses the origins of the idea of research on the use of graphics processors in large scale processing of laser scanning data. The next part of the paper includes the results of an efficiency assessment performed for an array of different processing options, all of which were substantially accelerated with parallel computing. The processing options were divided into the generation of orthophotos using point clouds, coloring of point clouds, transformations, and the generation of a regular grid, as well as advanced processes such as the detection of planes and edges, point cloud classification, and the analysis of data for the purpose of quality control. Most algorithms had to be formulated from scratch in the context of the requirements of parallel computing. A few of the algorithms were based on existing technology developed by the Dephos Software Company and then adapted to parallel computing in the course of this research study. Processing time was determined for each process employed for a typical quantity of data processed, which helped confirm the high efficiency of the solutions proposed and the applicability of parallel computing to the processing of laser scanning data. The high efficiency of parallel computing yields new opportunities in the creation and organization of processing methods for laser scanning data.
LAMMPS strong scaling performance optimization on Blue Gene/Q
DOE Office of Scientific and Technical Information (OSTI.GOV)
Coffman, Paul; Jiang, Wei; Romero, Nichols A.
2014-11-12
LAMMPS "Large-scale Atomic/Molecular Massively Parallel Simulator" is an open-source molecular dynamics package from Sandia National Laboratories. Significant performance improvements in strong-scaling and time-to-solution for this application on IBM's Blue Gene/Q have been achieved through computational optimizations of the OpenMP versions of the short-range Lennard-Jones term of the CHARMM force field and the long-range Coulombic interaction implemented with the PPPM (particle-particle-particle mesh) algorithm, enhanced by runtime parameter settings controlling thread utilization. Additionally, MPI communication performance improvements were made to the PPPM calculation by re-engineering the parallel 3D FFT to use MPICH collectives instead of point-to-point. Performance testing was done using anmore » 8.4-million atom simulation scaling up to 16 racks on the Mira system at Argonne Leadership Computing Facility (ALCF). Speedups resulting from this effort were in some cases over 2x.« less
Large-Scale, Parallel, Multi-Sensor Atmospheric Data Fusion Using Cloud Computing
NASA Astrophysics Data System (ADS)
Wilson, B. D.; Manipon, G.; Hua, H.; Fetzer, E.
2013-05-01
NASA's Earth Observing System (EOS) is an ambitious facility for studying global climate change. The mandate now is to combine measurements from the instruments on the "A-Train" platforms (AIRS, AMSR-E, MODIS, MISR, MLS, and CloudSat) and other Earth probes to enable large-scale studies of climate change over decades. Moving to multi-sensor, long-duration analyses of important climate variables presents serious challenges for large-scale data mining and fusion. For example, one might want to compare temperature and water vapor retrievals from one instrument (AIRS) to another (MODIS), and to a model (ECMWF), stratify the comparisons using a classification of the "cloud scenes" from CloudSat, and repeat the entire analysis over 10 years of data. To efficiently assemble such datasets, we are utilizing Elastic Computing in the Cloud and parallel map/reduce-based algorithms. However, these problems are Data Intensive computing so the data transfer times and storage costs (for caching) are key issues. SciReduce is a Hadoop-like parallel analysis system, programmed in parallel python, that is designed from the ground up for Earth science. SciReduce executes inside VMWare images and scales to any number of nodes in the Cloud. Unlike Hadoop, SciReduce operates on bundles of named numeric arrays, which can be passed in memory or serialized to disk in netCDF4 or HDF5. Figure 1 shows the architecture of the full computational system, with SciReduce at the core. Multi-year datasets are automatically "sharded" by time and space across a cluster of nodes so that years of data (millions of files) can be processed in a massively parallel way. Input variables (arrays) are pulled on-demand into the Cloud using OPeNDAP URLs or other subsetting services, thereby minimizing the size of the cached input and intermediate datasets. We are using SciReduce to automate the production of multiple versions of a ten-year A-Train water vapor climatology under a NASA MEASURES grant. We will present the architecture of SciReduce, describe the achieved "clock time" speedups in fusing datasets on our own nodes and in the Cloud, and discuss the Cloud cost tradeoffs for storage, compute, and data transfer. We will also present a concept/prototype for staging NASA's A-Train Atmospheric datasets (Levels 2 & 3) in the Amazon Cloud so that any number of compute jobs can be executed "near" the multi-sensor data. Given such a system, multi-sensor climate studies over 10-20 years of data could be performed in an efficient way, with the researcher paying only his own Cloud compute bill.; Figure 1 -- Architecture.
Yang, C L; Wei, H Y; Adler, A; Soleimani, M
2013-06-01
Electrical impedance tomography (EIT) is a fast and cost-effective technique to provide a tomographic conductivity image of a subject from boundary current-voltage data. This paper proposes a time and memory efficient method for solving a large scale 3D EIT inverse problem using a parallel conjugate gradient (CG) algorithm. The 3D EIT system with a large number of measurement data can produce a large size of Jacobian matrix; this could cause difficulties in computer storage and the inversion process. One of challenges in 3D EIT is to decrease the reconstruction time and memory usage, at the same time retaining the image quality. Firstly, a sparse matrix reduction technique is proposed using thresholding to set very small values of the Jacobian matrix to zero. By adjusting the Jacobian matrix into a sparse format, the element with zeros would be eliminated, which results in a saving of memory requirement. Secondly, a block-wise CG method for parallel reconstruction has been developed. The proposed method has been tested using simulated data as well as experimental test samples. Sparse Jacobian with a block-wise CG enables the large scale EIT problem to be solved efficiently. Image quality measures are presented to quantify the effect of sparse matrix reduction in reconstruction results.
Parallel Simulation of Three-Dimensional Free Surface Fluid Flow Problems
DOE Office of Scientific and Technical Information (OSTI.GOV)
BAER,THOMAS A.; SACKINGER,PHILIP A.; SUBIA,SAMUEL R.
1999-10-14
Simulation of viscous three-dimensional fluid flow typically involves a large number of unknowns. When free surfaces are included, the number of unknowns increases dramatically. Consequently, this class of problem is an obvious application of parallel high performance computing. We describe parallel computation of viscous, incompressible, free surface, Newtonian fluid flow problems that include dynamic contact fines. The Galerkin finite element method was used to discretize the fully-coupled governing conservation equations and a ''pseudo-solid'' mesh mapping approach was used to determine the shape of the free surface. In this approach, the finite element mesh is allowed to deform to satisfy quasi-staticmore » solid mechanics equations subject to geometric or kinematic constraints on the boundaries. As a result, nodal displacements must be included in the set of unknowns. Other issues discussed are the proper constraints appearing along the dynamic contact line in three dimensions. Issues affecting efficient parallel simulations include problem decomposition to equally distribute computational work among a SPMD computer and determination of robust, scalable preconditioners for the distributed matrix systems that must be solved. Solution continuation strategies important for serial simulations have an enhanced relevance in a parallel coquting environment due to the difficulty of solving large scale systems. Parallel computations will be demonstrated on an example taken from the coating flow industry: flow in the vicinity of a slot coater edge. This is a three dimensional free surface problem possessing a contact line that advances at the web speed in one region but transitions to static behavior in another region. As such, a significant fraction of the computational time is devoted to processing boundary data. Discussion focuses on parallel speed ups for fixed problem size, a class of problems of immediate practical importance.« less
Parallelization of fine-scale computation in Agile Multiscale Modelling Methodology
NASA Astrophysics Data System (ADS)
Macioł, Piotr; Michalik, Kazimierz
2016-10-01
Nowadays, multiscale modelling of material behavior is an extensively developed area. An important obstacle against its wide application is high computational demands. Among others, the parallelization of multiscale computations is a promising solution. Heterogeneous multiscale models are good candidates for parallelization, since communication between sub-models is limited. In this paper, the possibility of parallelization of multiscale models based on Agile Multiscale Methodology framework is discussed. A sequential, FEM based macroscopic model has been combined with concurrently computed fine-scale models, employing a MatCalc thermodynamic simulator. The main issues, being investigated in this work are: (i) the speed-up of multiscale models with special focus on fine-scale computations and (ii) on decreasing the quality of computations enforced by parallel execution. Speed-up has been evaluated on the basis of Amdahl's law equations. The problem of `delay error', rising from the parallel execution of fine scale sub-models, controlled by the sequential macroscopic sub-model is discussed. Some technical aspects of combining third-party commercial modelling software with an in-house multiscale framework and a MPI library are also discussed.
Misra, Sanchit; Pamnany, Kiran; Aluru, Srinivas
2015-01-01
Construction of whole-genome networks from large-scale gene expression data is an important problem in systems biology. While several techniques have been developed, most cannot handle network reconstruction at the whole-genome scale, and the few that can, require large clusters. In this paper, we present a solution on the Intel Xeon Phi coprocessor, taking advantage of its multi-level parallelism including many x86-based cores, multiple threads per core, and vector processing units. We also present a solution on the Intel® Xeon® processor. Our solution is based on TINGe, a fast parallel network reconstruction technique that uses mutual information and permutation testing for assessing statistical significance. We demonstrate the first ever inference of a plant whole genome regulatory network on a single chip by constructing a 15,575 gene network of the plant Arabidopsis thaliana from 3,137 microarray experiments in only 22 minutes. In addition, our optimization for parallelizing mutual information computation on the Intel Xeon Phi coprocessor holds out lessons that are applicable to other domains.
Global Load Balancing with Parallel Mesh Adaption on Distributed-Memory Systems
NASA Technical Reports Server (NTRS)
Biswas, Rupak; Oliker, Leonid; Sohn, Andrew
1996-01-01
Dynamic mesh adaption on unstructured grids is a powerful tool for efficiently computing unsteady problems to resolve solution features of interest. Unfortunately, this causes load imbalance among processors on a parallel machine. This paper describes the parallel implementation of a tetrahedral mesh adaption scheme and a new global load balancing method. A heuristic remapping algorithm is presented that assigns partitions to processors such that the redistribution cost is minimized. Results indicate that the parallel performance of the mesh adaption code depends on the nature of the adaption region and show a 35.5X speedup on 64 processors of an SP2 when 35% of the mesh is randomly adapted. For large-scale scientific computations, our load balancing strategy gives almost a sixfold reduction in solver execution times over non-balanced loads. Furthermore, our heuristic remapper yields processor assignments that are less than 3% off the optimal solutions but requires only 1% of the computational time.
NASA Astrophysics Data System (ADS)
Quan, Zhe; Wu, Lei
2017-09-01
This article investigates the use of parallel computing for solving the disjunctively constrained knapsack problem. The proposed parallel computing model can be viewed as a cooperative algorithm based on a multi-neighbourhood search. The cooperation system is composed of a team manager and a crowd of team members. The team members aim at applying their own search strategies to explore the solution space. The team manager collects the solutions from the members and shares the best one with them. The performance of the proposed method is evaluated on a group of benchmark data sets. The results obtained are compared to those reached by the best methods from the literature. The results show that the proposed method is able to provide the best solutions in most cases. In order to highlight the robustness of the proposed parallel computing model, a new set of large-scale instances is introduced. Encouraging results have been obtained.
NASA Astrophysics Data System (ADS)
Fonseca, R. A.; Vieira, J.; Fiuza, F.; Davidson, A.; Tsung, F. S.; Mori, W. B.; Silva, L. O.
2013-12-01
A new generation of laser wakefield accelerators (LWFA), supported by the extreme accelerating fields generated in the interaction of PW-Class lasers and underdense targets, promises the production of high quality electron beams in short distances for multiple applications. Achieving this goal will rely heavily on numerical modelling to further understand the underlying physics and identify optimal regimes, but large scale modelling of these scenarios is computationally heavy and requires the efficient use of state-of-the-art petascale supercomputing systems. We discuss the main difficulties involved in running these simulations and the new developments implemented in the OSIRIS framework to address these issues, ranging from multi-dimensional dynamic load balancing and hybrid distributed/shared memory parallelism to the vectorization of the PIC algorithm. We present the results of the OASCR Joule Metric program on the issue of large scale modelling of LWFA, demonstrating speedups of over 1 order of magnitude on the same hardware. Finally, scalability to over ˜106 cores and sustained performance over ˜2 P Flops is demonstrated, opening the way for large scale modelling of LWFA scenarios.
Processing large remote sensing image data sets on Beowulf clusters
Steinwand, Daniel R.; Maddox, Brian; Beckmann, Tim; Schmidt, Gail
2003-01-01
High-performance computing is often concerned with the speed at which floating- point calculations can be performed. The architectures of many parallel computers and/or their network topologies are based on these investigations. Often, benchmarks resulting from these investigations are compiled with little regard to how a large dataset would move about in these systems. This part of the Beowulf study addresses that concern by looking at specific applications software and system-level modifications. Applications include an implementation of a smoothing filter for time-series data, a parallel implementation of the decision tree algorithm used in the Landcover Characterization project, a parallel Kriging algorithm used to fit point data collected in the field on invasive species to a regular grid, and modifications to the Beowulf project's resampling algorithm to handle larger, higher resolution datasets at a national scale. Systems-level investigations include a feasibility study on Flat Neighborhood Networks and modifications of that concept with Parallel File Systems.
A Parallel Adaboost-Backpropagation Neural Network for Massive Image Dataset Classification
NASA Astrophysics Data System (ADS)
Cao, Jianfang; Chen, Lichao; Wang, Min; Shi, Hao; Tian, Yun
2016-12-01
Image classification uses computers to simulate human understanding and cognition of images by automatically categorizing images. This study proposes a faster image classification approach that parallelizes the traditional Adaboost-Backpropagation (BP) neural network using the MapReduce parallel programming model. First, we construct a strong classifier by assembling the outputs of 15 BP neural networks (which are individually regarded as weak classifiers) based on the Adaboost algorithm. Second, we design Map and Reduce tasks for both the parallel Adaboost-BP neural network and the feature extraction algorithm. Finally, we establish an automated classification model by building a Hadoop cluster. We use the Pascal VOC2007 and Caltech256 datasets to train and test the classification model. The results are superior to those obtained using traditional Adaboost-BP neural network or parallel BP neural network approaches. Our approach increased the average classification accuracy rate by approximately 14.5% and 26.0% compared to the traditional Adaboost-BP neural network and parallel BP neural network, respectively. Furthermore, the proposed approach requires less computation time and scales very well as evaluated by speedup, sizeup and scaleup. The proposed approach may provide a foundation for automated large-scale image classification and demonstrates practical value.
A Parallel Adaboost-Backpropagation Neural Network for Massive Image Dataset Classification.
Cao, Jianfang; Chen, Lichao; Wang, Min; Shi, Hao; Tian, Yun
2016-12-01
Image classification uses computers to simulate human understanding and cognition of images by automatically categorizing images. This study proposes a faster image classification approach that parallelizes the traditional Adaboost-Backpropagation (BP) neural network using the MapReduce parallel programming model. First, we construct a strong classifier by assembling the outputs of 15 BP neural networks (which are individually regarded as weak classifiers) based on the Adaboost algorithm. Second, we design Map and Reduce tasks for both the parallel Adaboost-BP neural network and the feature extraction algorithm. Finally, we establish an automated classification model by building a Hadoop cluster. We use the Pascal VOC2007 and Caltech256 datasets to train and test the classification model. The results are superior to those obtained using traditional Adaboost-BP neural network or parallel BP neural network approaches. Our approach increased the average classification accuracy rate by approximately 14.5% and 26.0% compared to the traditional Adaboost-BP neural network and parallel BP neural network, respectively. Furthermore, the proposed approach requires less computation time and scales very well as evaluated by speedup, sizeup and scaleup. The proposed approach may provide a foundation for automated large-scale image classification and demonstrates practical value.
A Parallel Adaboost-Backpropagation Neural Network for Massive Image Dataset Classification
Cao, Jianfang; Chen, Lichao; Wang, Min; Shi, Hao; Tian, Yun
2016-01-01
Image classification uses computers to simulate human understanding and cognition of images by automatically categorizing images. This study proposes a faster image classification approach that parallelizes the traditional Adaboost-Backpropagation (BP) neural network using the MapReduce parallel programming model. First, we construct a strong classifier by assembling the outputs of 15 BP neural networks (which are individually regarded as weak classifiers) based on the Adaboost algorithm. Second, we design Map and Reduce tasks for both the parallel Adaboost-BP neural network and the feature extraction algorithm. Finally, we establish an automated classification model by building a Hadoop cluster. We use the Pascal VOC2007 and Caltech256 datasets to train and test the classification model. The results are superior to those obtained using traditional Adaboost-BP neural network or parallel BP neural network approaches. Our approach increased the average classification accuracy rate by approximately 14.5% and 26.0% compared to the traditional Adaboost-BP neural network and parallel BP neural network, respectively. Furthermore, the proposed approach requires less computation time and scales very well as evaluated by speedup, sizeup and scaleup. The proposed approach may provide a foundation for automated large-scale image classification and demonstrates practical value. PMID:27905520
NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations
NASA Astrophysics Data System (ADS)
Valiev, M.; Bylaska, E. J.; Govind, N.; Kowalski, K.; Straatsma, T. P.; Van Dam, H. J. J.; Wang, D.; Nieplocha, J.; Apra, E.; Windus, T. L.; de Jong, W. A.
2010-09-01
The latest release of NWChem delivers an open-source computational chemistry package with extensive capabilities for large scale simulations of chemical and biological systems. Utilizing a common computational framework, diverse theoretical descriptions can be used to provide the best solution for a given scientific problem. Scalable parallel implementations and modular software design enable efficient utilization of current computational architectures. This paper provides an overview of NWChem focusing primarily on the core theoretical modules provided by the code and their parallel performance. Program summaryProgram title: NWChem Catalogue identifier: AEGI_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEGI_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Open Source Educational Community License No. of lines in distributed program, including test data, etc.: 11 709 543 No. of bytes in distributed program, including test data, etc.: 680 696 106 Distribution format: tar.gz Programming language: Fortran 77, C Computer: all Linux based workstations and parallel supercomputers, Windows and Apple machines Operating system: Linux, OS X, Windows Has the code been vectorised or parallelized?: Code is parallelized Classification: 2.1, 2.2, 3, 7.3, 7.7, 16.1, 16.2, 16.3, 16.10, 16.13 Nature of problem: Large-scale atomistic simulations of chemical and biological systems require efficient and reliable methods for ground and excited solutions of many-electron Hamiltonian, analysis of the potential energy surface, and dynamics. Solution method: Ground and excited solutions of many-electron Hamiltonian are obtained utilizing density-functional theory, many-body perturbation approach, and coupled cluster expansion. These solutions or a combination thereof with classical descriptions are then used to analyze potential energy surface and perform dynamical simulations. Additional comments: Full documentation is provided in the distribution file. This includes an INSTALL file giving details of how to build the package. A set of test runs is provided in the examples directory. The distribution file for this program is over 90 Mbytes and therefore is not delivered directly when download or Email is requested. Instead a html file giving details of how the program can be obtained is sent. Running time: Running time depends on the size of the chemical system, complexity of the method, number of cpu's and the computational task. It ranges from several seconds for serial DFT energy calculations on a few atoms to several hours for parallel coupled cluster energy calculations on tens of atoms or ab-initio molecular dynamics simulation on hundreds of atoms.
An S N Algorithm for Modern Architectures
DOE Office of Scientific and Technical Information (OSTI.GOV)
Baker, Randal Scott
2016-08-29
LANL discrete ordinates transport packages are required to perform large, computationally intensive time-dependent calculations on massively parallel architectures, where even a single such calculation may need many months to complete. While KBA methods scale out well to very large numbers of compute nodes, we are limited by practical constraints on the number of such nodes we can actually apply to any given calculation. Instead, we describe a modified KBA algorithm that allows realization of the reductions in solution time offered by both the current, and future, architectural changes within a compute node.
Large-scale Parallel Unstructured Mesh Computations for 3D High-lift Analysis
NASA Technical Reports Server (NTRS)
Mavriplis, Dimitri J.; Pirzadeh, S.
1999-01-01
A complete "geometry to drag-polar" analysis capability for the three-dimensional high-lift configurations is described. The approach is based on the use of unstructured meshes in order to enable rapid turnaround for complicated geometries that arise in high-lift configurations. Special attention is devoted to creating a capability for enabling analyses on highly resolved grids. Unstructured meshes of several million vertices are initially generated on a work-station, and subsequently refined on a supercomputer. The flow is solved on these refined meshes on large parallel computers using an unstructured agglomeration multigrid algorithm. Good prediction of lift and drag throughout the range of incidences is demonstrated on a transport take-off configuration using up to 24.7 million grid points. The feasibility of using this approach in a production environment on existing parallel machines is demonstrated, as well as the scalability of the solver on machines using up to 1450 processors.
Final Report, DE-FG01-06ER25718 Domain Decomposition and Parallel Computing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Widlund, Olof B.
2015-06-09
The goal of this project is to develop and improve domain decomposition algorithms for a variety of partial differential equations such as those of linear elasticity and electro-magnetics.These iterative methods are designed for massively parallel computing systems and allow the fast solution of the very large systems of algebraic equations that arise in large scale and complicated simulations. A special emphasis is placed on problems arising from Maxwell's equation. The approximate solvers, the preconditioners, are combined with the conjugate gradient method and must always include a solver of a coarse model in order to have a performance which is independentmore » of the number of processors used in the computer simulation. A recent development allows for an adaptive construction of this coarse component of the preconditioner.« less
Pesce, Lorenzo L.; Lee, Hyong C.; Hereld, Mark; ...
2013-01-01
Our limited understanding of the relationship between the behavior of individual neurons and large neuronal networks is an important limitation in current epilepsy research and may be one of the main causes of our inadequate ability to treat it. Addressing this problem directly via experiments is impossibly complex; thus, we have been developing and studying medium-large-scale simulations of detailed neuronal networks to guide us. Flexibility in the connection schemas and a complete description of the cortical tissue seem necessary for this purpose. In this paper we examine some of the basic issues encountered in these multiscale simulations. We have determinedmore » the detailed behavior of two such simulators on parallel computer systems. The observed memory and computation-time scaling behavior for a distributed memory implementation were very good over the range studied, both in terms of network sizes (2,000 to 400,000 neurons) and processor pool sizes (1 to 256 processors). Our simulations required between a few megabytes and about 150 gigabytes of RAM and lasted between a few minutes and about a week, well within the capability of most multinode clusters. Therefore, simulations of epileptic seizures on networks with millions of cells should be feasible on current supercomputers.« less
Synchronization Of Parallel Discrete Event Simulations
NASA Technical Reports Server (NTRS)
Steinman, Jeffrey S.
1992-01-01
Adaptive, parallel, discrete-event-simulation-synchronization algorithm, Breathing Time Buckets, developed in Synchronous Parallel Environment for Emulation and Discrete Event Simulation (SPEEDES) operating system. Algorithm allows parallel simulations to process events optimistically in fluctuating time cycles that naturally adapt while simulation in progress. Combines best of optimistic and conservative synchronization strategies while avoiding major disadvantages. Algorithm processes events optimistically in time cycles adapting while simulation in progress. Well suited for modeling communication networks, for large-scale war games, for simulated flights of aircraft, for simulations of computer equipment, for mathematical modeling, for interactive engineering simulations, and for depictions of flows of information.
Xyce™ Parallel Electronic Simulator Users' Guide, Version 6.5.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Keiter, Eric R.; Aadithya, Karthik V.; Mei, Ting
This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows one to developmore » new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandia's needs, including some radiation- aware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase -- a message passing parallel implementation -- which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows. The information herein is subject to change without notice. Copyright © 2002-2016 Sandia Corporation. All rights reserved.« less
Grindon, Christina; Harris, Sarah; Evans, Tom; Novik, Keir; Coveney, Peter; Laughton, Charles
2004-07-15
Molecular modelling played a central role in the discovery of the structure of DNA by Watson and Crick. Today, such modelling is done on computers: the more powerful these computers are, the more detailed and extensive can be the study of the dynamics of such biological macromolecules. To fully harness the power of modern massively parallel computers, however, we need to develop and deploy algorithms which can exploit the structure of such hardware. The Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS) is a scalable molecular dynamics code including long-range Coulomb interactions, which has been specifically designed to function efficiently on parallel platforms. Here we describe the implementation of the AMBER98 force field in LAMMPS and its validation for molecular dynamics investigations of DNA structure and flexibility against the benchmark of results obtained with the long-established code AMBER6 (Assisted Model Building with Energy Refinement, version 6). Extended molecular dynamics simulations on the hydrated DNA dodecamer d(CTTTTGCAAAAG)(2), which has previously been the subject of extensive dynamical analysis using AMBER6, show that it is possible to obtain excellent agreement in terms of static, dynamic and thermodynamic parameters between AMBER6 and LAMMPS. In comparison with AMBER6, LAMMPS shows greatly improved scalability in massively parallel environments, opening up the possibility of efficient simulations of order-of-magnitude larger systems and/or for order-of-magnitude greater simulation times.
Load Balancing Strategies for Multi-Block Overset Grid Applications
NASA Technical Reports Server (NTRS)
Djomehri, M. Jahed; Biswas, Rupak; Lopez-Benitez, Noe; Biegel, Bryan (Technical Monitor)
2002-01-01
The multi-block overset grid method is a powerful technique for high-fidelity computational fluid dynamics (CFD) simulations about complex aerospace configurations. The solution process uses a grid system that discretizes the problem domain by using separately generated but overlapping structured grids that periodically update and exchange boundary information through interpolation. For efficient high performance computations of large-scale realistic applications using this methodology, the individual grids must be properly partitioned among the parallel processors. Overall performance, therefore, largely depends on the quality of load balancing. In this paper, we present three different load balancing strategies far overset grids and analyze their effects on the parallel efficiency of a Navier-Stokes CFD application running on an SGI Origin2000 machine.
Reverse engineering and analysis of large genome-scale gene networks
Aluru, Maneesha; Zola, Jaroslaw; Nettleton, Dan; Aluru, Srinivas
2013-01-01
Reverse engineering the whole-genome networks of complex multicellular organisms continues to remain a challenge. While simpler models easily scale to large number of genes and gene expression datasets, more accurate models are compute intensive limiting their scale of applicability. To enable fast and accurate reconstruction of large networks, we developed Tool for Inferring Network of Genes (TINGe), a parallel mutual information (MI)-based program. The novel features of our approach include: (i) B-spline-based formulation for linear-time computation of MI, (ii) a novel algorithm for direct permutation testing and (iii) development of parallel algorithms to reduce run-time and facilitate construction of large networks. We assess the quality of our method by comparison with ARACNe (Algorithm for the Reconstruction of Accurate Cellular Networks) and GeneNet and demonstrate its unique capability by reverse engineering the whole-genome network of Arabidopsis thaliana from 3137 Affymetrix ATH1 GeneChips in just 9 min on a 1024-core cluster. We further report on the development of a new software Gene Network Analyzer (GeNA) for extracting context-specific subnetworks from a given set of seed genes. Using TINGe and GeNA, we performed analysis of 241 Arabidopsis AraCyc 8.0 pathways, and the results are made available through the web. PMID:23042249
Parallel Geospatial Data Management for Multi-Scale Environmental Data Analysis on GPUs
NASA Astrophysics Data System (ADS)
Wang, D.; Zhang, J.; Wei, Y.
2013-12-01
As the spatial and temporal resolutions of Earth observatory data and Earth system simulation outputs are getting higher, in-situ and/or post- processing such large amount of geospatial data increasingly becomes a bottleneck in scientific inquires of Earth systems and their human impacts. Existing geospatial techniques that are based on outdated computing models (e.g., serial algorithms and disk-resident systems), as have been implemented in many commercial and open source packages, are incapable of processing large-scale geospatial data and achieve desired level of performance. In this study, we have developed a set of parallel data structures and algorithms that are capable of utilizing massively data parallel computing power available on commodity Graphics Processing Units (GPUs) for a popular geospatial technique called Zonal Statistics. Given two input datasets with one representing measurements (e.g., temperature or precipitation) and the other one represent polygonal zones (e.g., ecological or administrative zones), Zonal Statistics computes major statistics (or complete distribution histograms) of the measurements in all regions. Our technique has four steps and each step can be mapped to GPU hardware by identifying its inherent data parallelisms. First, a raster is divided into blocks and per-block histograms are derived. Second, the Minimum Bounding Boxes (MBRs) of polygons are computed and are spatially matched with raster blocks; matched polygon-block pairs are tested and blocks that are either inside or intersect with polygons are identified. Third, per-block histograms are aggregated to polygons for blocks that are completely within polygons. Finally, for blocks that intersect with polygon boundaries, all the raster cells within the blocks are examined using point-in-polygon-test and cells that are within polygons are used to update corresponding histograms. As the task becomes I/O bound after applying spatial indexing and GPU hardware acceleration, we have developed a GPU-based data compression technique by reusing our previous work on Bitplane Quadtree (or BPQ-Tree) based indexing of binary bitmaps. Results have shown that our GPU-based parallel Zonal Statistic technique on 3000+ US counties over 20+ billion NASA SRTM 30 meter resolution Digital Elevation (DEM) raster cells has achieved impressive end-to-end runtimes: 101 seconds and 46 seconds a low-end workstation equipped with a Nvidia GTX Titan GPU using cold and hot cache, respectively; and, 60-70 seconds using a single OLCF TITAN computing node and 10-15 seconds using 8 nodes. Our experiment results clearly show the potentials of using high-end computing facilities for large-scale geospatial processing.
A study of the parallel algorithm for large-scale DC simulation of nonlinear systems
NASA Astrophysics Data System (ADS)
Cortés Udave, Diego Ernesto; Ogrodzki, Jan; Gutiérrez de Anda, Miguel Angel
Newton-Raphson DC analysis of large-scale nonlinear circuits may be an extremely time consuming process even if sparse matrix techniques and bypassing of nonlinear models calculation are used. A slight decrease in the time required for this task may be enabled on multi-core, multithread computers if the calculation of the mathematical models for the nonlinear elements as well as the stamp management of the sparse matrix entries are managed through concurrent processes. This numerical complexity can be further reduced via the circuit decomposition and parallel solution of blocks taking as a departure point the BBD matrix structure. This block-parallel approach may give a considerable profit though it is strongly dependent on the system topology and, of course, on the processor type. This contribution presents the easy-parallelizable decomposition-based algorithm for DC simulation and provides a detailed study of its effectiveness.
Extreme-Scale De Novo Genome Assembly
DOE Office of Scientific and Technical Information (OSTI.GOV)
Georganas, Evangelos; Hofmeyr, Steven; Egan, Rob
De novo whole genome assembly reconstructs genomic sequence from short, overlapping, and potentially erroneous DNA segments and is one of the most important computations in modern genomics. This work presents HipMER, a high-quality end-to-end de novo assembler designed for extreme scale analysis, via efficient parallelization of the Meraculous code. Genome assembly software has many components, each of which stresses different components of a computer system. This chapter explains the computational challenges involved in each step of the HipMer pipeline, the key distributed data structures, and communication costs in detail. We present performance results of assembling the human genome and themore » large hexaploid wheat genome on large supercomputers up to tens of thousands of cores.« less
Parallel/Vector Integration Methods for Dynamical Astronomy
NASA Astrophysics Data System (ADS)
Fukushima, Toshio
1999-01-01
This paper reviews three recent works on the numerical methods to integrate ordinary differential equations (ODE), which are specially designed for parallel, vector, and/or multi-processor-unit(PU) computers. The first is the Picard-Chebyshev method (Fukushima, 1997a). It obtains a global solution of ODE in the form of Chebyshev polynomial of large (> 1000) degree by applying the Picard iteration repeatedly. The iteration converges for smooth problems and/or perturbed dynamics. The method runs around 100-1000 times faster in the vector mode than in the scalar mode of a certain computer with vector processors (Fukushima, 1997b). The second is a parallelization of a symplectic integrator (Saha et al., 1997). It regards the implicit midpoint rules covering thousands of timesteps as large-scale nonlinear equations and solves them by the fixed-point iteration. The method is applicable to Hamiltonian systems and is expected to lead an acceleration factor of around 50 in parallel computers with more than 1000 PUs. The last is a parallelization of the extrapolation method (Ito and Fukushima, 1997). It performs trial integrations in parallel. Also the trial integrations are further accelerated by balancing computational load among PUs by the technique of folding. The method is all-purpose and achieves an acceleration factor of around 3.5 by using several PUs. Finally, we give a perspective on the parallelization of some implicit integrators which require multiple corrections in solving implicit formulas like the implicit Hermitian integrators (Makino and Aarseth, 1992), (Hut et al., 1995) or the implicit symmetric multistep methods (Fukushima, 1998), (Fukushima, 1999).
Integration experiences and performance studies of A COTS parallel archive systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chen, Hsing-bung; Scott, Cody; Grider, Bary
2010-01-01
Current and future Archive Storage Systems have been asked to (a) scale to very high bandwidths, (b) scale in metadata performance, (c) support policy-based hierarchical storage management capability, (d) scale in supporting changing needs of very large data sets, (e) support standard interface, and (f) utilize commercial-off-the-shelf(COTS) hardware. Parallel file systems have been asked to do the same thing but at one or more orders of magnitude faster in performance. Archive systems continue to move closer to file systems in their design due to the need for speed and bandwidth, especially metadata searching speeds such as more caching and lessmore » robust semantics. Currently the number of extreme highly scalable parallel archive solutions is very small especially those that will move a single large striped parallel disk file onto many tapes in parallel. We believe that a hybrid storage approach of using COTS components and innovative software technology can bring new capabilities into a production environment for the HPC community much faster than the approach of creating and maintaining a complete end-to-end unique parallel archive software solution. In this paper, we relay our experience of integrating a global parallel file system and a standard backup/archive product with a very small amount of additional code to provide a scalable, parallel archive. Our solution has a high degree of overlap with current parallel archive products including (a) doing parallel movement to/from tape for a single large parallel file, (b) hierarchical storage management, (c) ILM features, (d) high volume (non-single parallel file) archives for backup/archive/content management, and (e) leveraging all free file movement tools in Linux such as copy, move, ls, tar, etc. We have successfully applied our working COTS Parallel Archive System to the current world's first petaflop/s computing system, LANL's Roadrunner, and demonstrated its capability to address requirements of future archival storage systems.« less
Integration experiments and performance studies of a COTS parallel archive system
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chen, Hsing-bung; Scott, Cody; Grider, Gary
2010-06-16
Current and future Archive Storage Systems have been asked to (a) scale to very high bandwidths, (b) scale in metadata performance, (c) support policy-based hierarchical storage management capability, (d) scale in supporting changing needs of very large data sets, (e) support standard interface, and (f) utilize commercial-off-the-shelf (COTS) hardware. Parallel file systems have been asked to do the same thing but at one or more orders of magnitude faster in performance. Archive systems continue to move closer to file systems in their design due to the need for speed and bandwidth, especially metadata searching speeds such as more caching andmore » less robust semantics. Currently the number of extreme highly scalable parallel archive solutions is very small especially those that will move a single large striped parallel disk file onto many tapes in parallel. We believe that a hybrid storage approach of using COTS components and innovative software technology can bring new capabilities into a production environment for the HPC community much faster than the approach of creating and maintaining a complete end-to-end unique parallel archive software solution. In this paper, we relay our experience of integrating a global parallel file system and a standard backup/archive product with a very small amount of additional code to provide a scalable, parallel archive. Our solution has a high degree of overlap with current parallel archive products including (a) doing parallel movement to/from tape for a single large parallel file, (b) hierarchical storage management, (c) ILM features, (d) high volume (non-single parallel file) archives for backup/archive/content management, and (e) leveraging all free file movement tools in Linux such as copy, move, Is, tar, etc. We have successfully applied our working COTS Parallel Archive System to the current world's first petafiop/s computing system, LANL's Roadrunner machine, and demonstrated its capability to address requirements of future archival storage systems.« less
Hine, N D M; Haynes, P D; Mostofi, A A; Payne, M C
2010-09-21
We present calculations of formation energies of defects in an ionic solid (Al(2)O(3)) extrapolated to the dilute limit, corresponding to a simulation cell of infinite size. The large-scale calculations required for this extrapolation are enabled by developments in the approach to parallel sparse matrix algebra operations, which are central to linear-scaling density-functional theory calculations. The computational cost of manipulating sparse matrices, whose sizes are determined by the large number of basis functions present, is greatly improved with this new approach. We present details of the sparse algebra scheme implemented in the ONETEP code using hierarchical sparsity patterns, and demonstrate its use in calculations on a wide range of systems, involving thousands of atoms on hundreds to thousands of parallel processes.
Avoiding and tolerating latency in large-scale next-generation shared-memory multiprocessors
NASA Technical Reports Server (NTRS)
Probst, David K.
1993-01-01
A scalable solution to the memory-latency problem is necessary to prevent the large latencies of synchronization and memory operations inherent in large-scale shared-memory multiprocessors from reducing high performance. We distinguish latency avoidance and latency tolerance. Latency is avoided when data is brought to nearby locales for future reference. Latency is tolerated when references are overlapped with other computation. Latency-avoiding locales include: processor registers, data caches used temporally, and nearby memory modules. Tolerating communication latency requires parallelism, allowing the overlap of communication and computation. Latency-tolerating techniques include: vector pipelining, data caches used spatially, prefetching in various forms, and multithreading in various forms. Relaxing the consistency model permits increased use of avoidance and tolerance techniques. Each model is a mapping from the program text to sets of partial orders on program operations; it is a convention about which temporal precedences among program operations are necessary. Information about temporal locality and parallelism constrains the use of avoidance and tolerance techniques. Suitable architectural primitives and compiler technology are required to exploit the increased freedom to reorder and overlap operations in relaxed models.
NASA Astrophysics Data System (ADS)
Lin, Youzuo; O'Malley, Daniel; Vesselinov, Velimir V.
2016-09-01
Inverse modeling seeks model parameters given a set of observations. However, for practical problems because the number of measurements is often large and the model parameters are also numerous, conventional methods for inverse modeling can be computationally expensive. We have developed a new, computationally efficient parallel Levenberg-Marquardt method for solving inverse modeling problems with a highly parameterized model space. Levenberg-Marquardt methods require the solution of a linear system of equations which can be prohibitively expensive to compute for moderate to large-scale problems. Our novel method projects the original linear problem down to a Krylov subspace such that the dimensionality of the problem can be significantly reduced. Furthermore, we store the Krylov subspace computed when using the first damping parameter and recycle the subspace for the subsequent damping parameters. The efficiency of our new inverse modeling algorithm is significantly improved using these computational techniques. We apply this new inverse modeling method to invert for random transmissivity fields in 2-D and a random hydraulic conductivity field in 3-D. Our algorithm is fast enough to solve for the distributed model parameters (transmissivity) in the model domain. The algorithm is coded in Julia and implemented in the MADS computational framework (http://mads.lanl.gov). By comparing with Levenberg-Marquardt methods using standard linear inversion techniques such as QR or SVD methods, our Levenberg-Marquardt method yields a speed-up ratio on the order of ˜101 to ˜102 in a multicore computational environment. Therefore, our new inverse modeling method is a powerful tool for characterizing subsurface heterogeneity for moderate to large-scale problems.
NASA Technical Reports Server (NTRS)
Talbot, Bryan; Zhou, Shu-Jia; Higgins, Glenn; Zukor, Dorothy (Technical Monitor)
2002-01-01
One of the most significant challenges in large-scale climate modeling, as well as in high-performance computing in other scientific fields, is that of effectively integrating many software models from multiple contributors. A software framework facilitates the integration task, both in the development and runtime stages of the simulation. Effective software frameworks reduce the programming burden for the investigators, freeing them to focus more on the science and less on the parallel communication implementation. while maintaining high performance across numerous supercomputer and workstation architectures. This document surveys numerous software frameworks for potential use in Earth science modeling. Several frameworks are evaluated in depth, including Parallel Object-Oriented Methods and Applications (POOMA), Cactus (from (he relativistic physics community), Overture, Goddard Earth Modeling System (GEMS), the National Center for Atmospheric Research Flux Coupler, and UCLA/UCB Distributed Data Broker (DDB). Frameworks evaluated in less detail include ROOT, Parallel Application Workspace (PAWS), and Advanced Large-Scale Integrated Computational Environment (ALICE). A host of other frameworks and related tools are referenced in this context. The frameworks are evaluated individually and also compared with each other.
The Scalable Checkpoint/Restart Library
DOE Office of Scientific and Technical Information (OSTI.GOV)
Moody, A.
The Scalable Checkpoint/Restart (SCR) library provides an interface that codes may use to worite our and read in application-level checkpoints in a scalable fashion. In the current implementation, checkpoint files are cached in local storage (hard disk or RAM disk) on the compute nodes. This technique provides scalable aggregate bandwidth and uses storage resources that are fully dedicated to the job. This approach addresses the two common drawbacks of checkpointing a large-scale application to a shared parallel file system, namely, limited bandwidth and file system contention. In fact, on current platforms, SCR scales linearly with the number of compute nodes.more » It has been benchmarked as high as 720GB/s on 1094 nodes of Atlas, which is nearly two orders of magnitude faster thanthe parallel file system.« less
High performance computing and communications: Advancing the frontiers of information technology
DOE Office of Scientific and Technical Information (OSTI.GOV)
NONE
1997-12-31
This report, which supplements the President`s Fiscal Year 1997 Budget, describes the interagency High Performance Computing and Communications (HPCC) Program. The HPCC Program will celebrate its fifth anniversary in October 1996 with an impressive array of accomplishments to its credit. Over its five-year history, the HPCC Program has focused on developing high performance computing and communications technologies that can be applied to computation-intensive applications. Major highlights for FY 1996: (1) High performance computing systems enable practical solutions to complex problems with accuracies not possible five years ago; (2) HPCC-funded research in very large scale networking techniques has been instrumental inmore » the evolution of the Internet, which continues exponential growth in size, speed, and availability of information; (3) The combination of hardware capability measured in gigaflop/s, networking technology measured in gigabit/s, and new computational science techniques for modeling phenomena has demonstrated that very large scale accurate scientific calculations can be executed across heterogeneous parallel processing systems located thousands of miles apart; (4) Federal investments in HPCC software R and D support researchers who pioneered the development of parallel languages and compilers, high performance mathematical, engineering, and scientific libraries, and software tools--technologies that allow scientists to use powerful parallel systems to focus on Federal agency mission applications; and (5) HPCC support for virtual environments has enabled the development of immersive technologies, where researchers can explore and manipulate multi-dimensional scientific and engineering problems. Educational programs fostered by the HPCC Program have brought into classrooms new science and engineering curricula designed to teach computational science. This document contains a small sample of the significant HPCC Program accomplishments in FY 1996.« less
NASA Astrophysics Data System (ADS)
Yang, Sheng-Chun; Lu, Zhong-Yuan; Qian, Hu-Jun; Wang, Yong-Lei; Han, Jie-Ping
2017-11-01
In this work, we upgraded the electrostatic interaction method of CU-ENUF (Yang, et al., 2016) which first applied CUNFFT (nonequispaced Fourier transforms based on CUDA) to the reciprocal-space electrostatic computation and made the computation of electrostatic interaction done thoroughly in GPU. The upgraded edition of CU-ENUF runs concurrently in a hybrid parallel way that enables the computation parallelizing on multiple computer nodes firstly, then further on the installed GPU in each computer. By this parallel strategy, the size of simulation system will be never restricted to the throughput of a single CPU or GPU. The most critical technical problem is how to parallelize a CUNFFT in the parallel strategy, which is conquered effectively by deep-seated research of basic principles and some algorithm skills. Furthermore, the upgraded method is capable of computing electrostatic interactions for both the atomistic molecular dynamics (MD) and the dissipative particle dynamics (DPD). Finally, the benchmarks conducted for validation and performance indicate that the upgraded method is able to not only present a good precision when setting suitable parameters, but also give an efficient way to compute electrostatic interactions for huge simulation systems. Program Files doi:http://dx.doi.org/10.17632/zncf24fhpv.1 Licensing provisions: GNU General Public License 3 (GPL) Programming language: C, C++, and CUDA C Supplementary material: The program is designed for effective electrostatic interactions of large-scale simulation systems, which runs on particular computers equipped with NVIDIA GPUs. It has been tested on (a) single computer node with Intel(R) Core(TM) i7-3770@ 3.40 GHz (CPU) and GTX 980 Ti (GPU), and (b) MPI parallel computer nodes with the same configurations. Nature of problem: For molecular dynamics simulation, the electrostatic interaction is the most time-consuming computation because of its long-range feature and slow convergence in simulation space, which approximately take up most of the total simulation time. Although the parallel method CU-ENUF (Yang et al., 2016) based on GPU has achieved a qualitative leap compared with previous methods in electrostatic interactions computation, the computation capability is limited to the throughput capacity of a single GPU for super-scale simulation system. Therefore, we should look for an effective method to handle the calculation of electrostatic interactions efficiently for a simulation system with super-scale size. Solution method: We constructed a hybrid parallel architecture, in which CPU and GPU are combined to accelerate the electrostatic computation effectively. Firstly, the simulation system is divided into many subtasks via domain-decomposition method. Then MPI (Message Passing Interface) is used to implement the CPU-parallel computation with each computer node corresponding to a particular subtask, and furthermore each subtask in one computer node will be executed in GPU in parallel efficiently. In this hybrid parallel method, the most critical technical problem is how to parallelize a CUNFFT (nonequispaced fast Fourier transform based on CUDA) in the parallel strategy, which is conquered effectively by deep-seated research of basic principles and some algorithm skills. Restrictions: The HP-ENUF is mainly oriented to super-scale system simulations, in which the performance superiority is shown adequately. However, for a small simulation system containing less than 106 particles, the mode of multiple computer nodes has no apparent efficiency advantage or even lower efficiency due to the serious network delay among computer nodes, than the mode of single computer node. References: (1) S.-C. Yang, H.-J. Qian, Z.-Y. Lu, Appl. Comput. Harmon. Anal. 2016, http://dx.doi.org/10.1016/j.acha.2016.04.009. (2) S.-C. Yang, Y.-L. Wang, G.-S. Jiao, H.-J. Qian, Z.-Y. Lu, J. Comput. Chem. 37 (2016) 378. (3) S.-C. Yang, Y.-L. Zhu, H.-J. Qian, Z.-Y. Lu, Appl. Chem. Res. Chin. Univ., 2017, http://dx.doi.org/10.1007/s40242-016-6354-5. (4) Y.-L. Zhu, H. Liu, Z.-W. Li, H.-J. Qian, G. Milano, Z.-Y. Lu, J. Comput. Chem. 34 (2013) 2197.
High Performance Geostatistical Modeling of Biospheric Resources
NASA Astrophysics Data System (ADS)
Pedelty, J. A.; Morisette, J. T.; Smith, J. A.; Schnase, J. L.; Crosier, C. S.; Stohlgren, T. J.
2004-12-01
We are using parallel geostatistical codes to study spatial relationships among biospheric resources in several study areas. For example, spatial statistical models based on large- and small-scale variability have been used to predict species richness of both native and exotic plants (hot spots of diversity) and patterns of exotic plant invasion. However, broader use of geostastics in natural resource modeling, especially at regional and national scales, has been limited due to the large computing requirements of these applications. To address this problem, we implemented parallel versions of the kriging spatial interpolation algorithm. The first uses the Message Passing Interface (MPI) in a master/slave paradigm on an open source Linux Beowulf cluster, while the second is implemented with the new proprietary Xgrid distributed processing system on an Xserve G5 cluster from Apple Computer, Inc. These techniques are proving effective and provide the basis for a national decision support capability for invasive species management that is being jointly developed by NASA and the US Geological Survey.
A learnable parallel processing architecture towards unity of memory and computing
NASA Astrophysics Data System (ADS)
Li, H.; Gao, B.; Chen, Z.; Zhao, Y.; Huang, P.; Ye, H.; Liu, L.; Liu, X.; Kang, J.
2015-08-01
Developing energy-efficient parallel information processing systems beyond von Neumann architecture is a long-standing goal of modern information technologies. The widely used von Neumann computer architecture separates memory and computing units, which leads to energy-hungry data movement when computers work. In order to meet the need of efficient information processing for the data-driven applications such as big data and Internet of Things, an energy-efficient processing architecture beyond von Neumann is critical for the information society. Here we show a non-von Neumann architecture built of resistive switching (RS) devices named “iMemComp”, where memory and logic are unified with single-type devices. Leveraging nonvolatile nature and structural parallelism of crossbar RS arrays, we have equipped “iMemComp” with capabilities of computing in parallel and learning user-defined logic functions for large-scale information processing tasks. Such architecture eliminates the energy-hungry data movement in von Neumann computers. Compared with contemporary silicon technology, adder circuits based on “iMemComp” can improve the speed by 76.8% and the power dissipation by 60.3%, together with a 700 times aggressive reduction in the circuit area.
A learnable parallel processing architecture towards unity of memory and computing.
Li, H; Gao, B; Chen, Z; Zhao, Y; Huang, P; Ye, H; Liu, L; Liu, X; Kang, J
2015-08-14
Developing energy-efficient parallel information processing systems beyond von Neumann architecture is a long-standing goal of modern information technologies. The widely used von Neumann computer architecture separates memory and computing units, which leads to energy-hungry data movement when computers work. In order to meet the need of efficient information processing for the data-driven applications such as big data and Internet of Things, an energy-efficient processing architecture beyond von Neumann is critical for the information society. Here we show a non-von Neumann architecture built of resistive switching (RS) devices named "iMemComp", where memory and logic are unified with single-type devices. Leveraging nonvolatile nature and structural parallelism of crossbar RS arrays, we have equipped "iMemComp" with capabilities of computing in parallel and learning user-defined logic functions for large-scale information processing tasks. Such architecture eliminates the energy-hungry data movement in von Neumann computers. Compared with contemporary silicon technology, adder circuits based on "iMemComp" can improve the speed by 76.8% and the power dissipation by 60.3%, together with a 700 times aggressive reduction in the circuit area.
Performance Characterization of Global Address Space Applications: A Case Study with NWChem
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hammond, Jeffrey R.; Krishnamoorthy, Sriram; Shende, Sameer
The use of global address space languages and one-sided communication for complex applications is gaining attention in the parallel computing community. However, lack of good evaluative methods to observe multiple levels of performance makes it difficult to isolate the cause of performance deficiencies and to understand the fundamental limitations of system and application design for future improvement. NWChem is a popular computational chemistry package which depends on the Global Arrays/ ARMCI suite for partitioned global address space functionality to deliver high-end molecular modeling capabilities. A workload characterization methodology was developed to support NWChem performance engineering on large-scale parallel platforms. Themore » research involved both the integration of performance instrumentation and measurement in the NWChem software, as well as the analysis of one-sided communication performance in the context of NWChem workloads. Scaling studies were conducted for NWChem on Blue Gene/P and on two large-scale clusters using different generation Infiniband interconnects and x86 processors. The performance analysis and results show how subtle changes in the runtime parameters related to the communication subsystem could have significant impact on performance behavior. The tool has successfully identified several algorithmic bottlenecks which are already being tackled by computational chemists to improve NWChem performance.« less
Accelerated Adaptive MGS Phase Retrieval
NASA Technical Reports Server (NTRS)
Lam, Raymond K.; Ohara, Catherine M.; Green, Joseph J.; Bikkannavar, Siddarayappa A.; Basinger, Scott A.; Redding, David C.; Shi, Fang
2011-01-01
The Modified Gerchberg-Saxton (MGS) algorithm is an image-based wavefront-sensing method that can turn any science instrument focal plane into a wavefront sensor. MGS characterizes optical systems by estimating the wavefront errors in the exit pupil using only intensity images of a star or other point source of light. This innovative implementation of MGS significantly accelerates the MGS phase retrieval algorithm by using stream-processing hardware on conventional graphics cards. Stream processing is a relatively new, yet powerful, paradigm to allow parallel processing of certain applications that apply single instructions to multiple data (SIMD). These stream processors are designed specifically to support large-scale parallel computing on a single graphics chip. Computationally intensive algorithms, such as the Fast Fourier Transform (FFT), are particularly well suited for this computing environment. This high-speed version of MGS exploits commercially available hardware to accomplish the same objective in a fraction of the original time. The exploit involves performing matrix calculations in nVidia graphic cards. The graphical processor unit (GPU) is hardware that is specialized for computationally intensive, highly parallel computation. From the software perspective, a parallel programming model is used, called CUDA, to transparently scale multicore parallelism in hardware. This technology gives computationally intensive applications access to the processing power of the nVidia GPUs through a C/C++ programming interface. The AAMGS (Accelerated Adaptive MGS) software takes advantage of these advanced technologies, to accelerate the optical phase error characterization. With a single PC that contains four nVidia GTX-280 graphic cards, the new implementation can process four images simultaneously to produce a JWST (James Webb Space Telescope) wavefront measurement 60 times faster than the previous code.
Global Load Balancing with Parallel Mesh Adaption on Distributed-Memory Systems
NASA Technical Reports Server (NTRS)
Biswas, Rupak; Oliker, Leonid; Sohn, Andrew
1996-01-01
Dynamic mesh adaptation on unstructured grids is a powerful tool for efficiently computing unsteady problems to resolve solution features of interest. Unfortunately, this causes load inbalances among processors on a parallel machine. This paper described the parallel implementation of a tetrahedral mesh adaption scheme and a new global load balancing method. A heuristic remapping algorithm is presented that assigns partitions to processors such that the redistribution coast is minimized. Results indicate that the parallel performance of the mesh adaption code depends on the nature of the adaption region and show a 35.5X speedup on 64 processors of an SP2 when 35 percent of the mesh is randomly adapted. For large scale scientific computations, our load balancing strategy gives an almost sixfold reduction in solver execution times over non-balanced loads. Furthermore, our heuristic remappier yields processor assignments that are less than 3 percent of the optimal solutions, but requires only 1 percent of the computational time.
NASA Astrophysics Data System (ADS)
Lin, Mingpei; Xu, Ming; Fu, Xiaoyu
2017-05-01
Currently, a tremendous amount of space debris in Earth's orbit imperils operational spacecraft. It is essential to undertake risk assessments of collisions and predict dangerous encounters in space. However, collision predictions for an enormous amount of space debris give rise to large-scale computations. In this paper, a parallel algorithm is established on the Compute Unified Device Architecture (CUDA) platform of NVIDIA Corporation for collision prediction. According to the parallel structure of NVIDIA graphics processors, a block decomposition strategy is adopted in the algorithm. Space debris is divided into batches, and the computation and data transfer operations of adjacent batches overlap. As a consequence, the latency to access shared memory during the entire computing process is significantly reduced, and a higher computing speed is reached. Theoretically, a simulation of collision prediction for space debris of any amount and for any time span can be executed. To verify this algorithm, a simulation example including 1382 pieces of debris, whose operational time scales vary from 1 min to 3 days, is conducted on Tesla C2075 of NVIDIA. The simulation results demonstrate that with the same computational accuracy as that of a CPU, the computing speed of the parallel algorithm on a GPU is 30 times that on a CPU. Based on this algorithm, collision prediction of over 150 Chinese spacecraft for a time span of 3 days can be completed in less than 3 h on a single computer, which meets the timeliness requirement of the initial screening task. Furthermore, the algorithm can be adapted for multiple tasks, including particle filtration, constellation design, and Monte-Carlo simulation of an orbital computation.
NASA Astrophysics Data System (ADS)
Lin, Y.; O'Malley, D.; Vesselinov, V. V.
2015-12-01
Inverse modeling seeks model parameters given a set of observed state variables. However, for many practical problems due to the facts that the observed data sets are often large and model parameters are often numerous, conventional methods for solving the inverse modeling can be computationally expensive. We have developed a new, computationally-efficient Levenberg-Marquardt method for solving large-scale inverse modeling. Levenberg-Marquardt methods require the solution of a dense linear system of equations which can be prohibitively expensive to compute for large-scale inverse problems. Our novel method projects the original large-scale linear problem down to a Krylov subspace, such that the dimensionality of the measurements can be significantly reduced. Furthermore, instead of solving the linear system for every Levenberg-Marquardt damping parameter, we store the Krylov subspace computed when solving the first damping parameter and recycle it for all the following damping parameters. The efficiency of our new inverse modeling algorithm is significantly improved by using these computational techniques. We apply this new inverse modeling method to invert for a random transitivity field. Our algorithm is fast enough to solve for the distributed model parameters (transitivity) at each computational node in the model domain. The inversion is also aided by the use regularization techniques. The algorithm is coded in Julia and implemented in the MADS computational framework (http://mads.lanl.gov). Julia is an advanced high-level scientific programing language that allows for efficient memory management and utilization of high-performance computational resources. By comparing with a Levenberg-Marquardt method using standard linear inversion techniques, our Levenberg-Marquardt method yields speed-up ratio of 15 in a multi-core computational environment and a speed-up ratio of 45 in a single-core computational environment. Therefore, our new inverse modeling method is a powerful tool for large-scale applications.
Approaching the exa-scale: a real-world evaluation of rendering extremely large data sets
DOE Office of Scientific and Technical Information (OSTI.GOV)
Patchett, John M; Ahrens, James P; Lo, Li - Ta
2010-10-15
Extremely large scale analysis is becoming increasingly important as supercomputers and their simulations move from petascale to exascale. The lack of dedicated hardware acceleration for rendering on today's supercomputing platforms motivates our detailed evaluation of the possibility of interactive rendering on the supercomputer. In order to facilitate our understanding of rendering on the supercomputing platform, we focus on scalability of rendering algorithms and architecture envisioned for exascale datasets. To understand tradeoffs for dealing with extremely large datasets, we compare three different rendering algorithms for large polygonal data: software based ray tracing, software based rasterization and hardware accelerated rasterization. We presentmore » a case study of strong and weak scaling of rendering extremely large data on both GPU and CPU based parallel supercomputers using Para View, a parallel visualization tool. Wc use three different data sets: two synthetic and one from a scientific application. At an extreme scale, algorithmic rendering choices make a difference and should be considered while approaching exascale computing, visualization, and analysis. We find software based ray-tracing offers a viable approach for scalable rendering of the projected future massive data sizes.« less
On the impact of approximate computation in an analog DeSTIN architecture.
Young, Steven; Lu, Junjie; Holleman, Jeremy; Arel, Itamar
2014-05-01
Deep machine learning (DML) holds the potential to revolutionize machine learning by automating rich feature extraction, which has become the primary bottleneck of human engineering in pattern recognition systems. However, the heavy computational burden renders DML systems implemented on conventional digital processors impractical for large-scale problems. The highly parallel computations required to implement large-scale deep learning systems are well suited to custom hardware. Analog computation has demonstrated power efficiency advantages of multiple orders of magnitude relative to digital systems while performing nonideal computations. In this paper, we investigate typical error sources introduced by analog computational elements and their impact on system-level performance in DeSTIN--a compositional deep learning architecture. These inaccuracies are evaluated on a pattern classification benchmark, clearly demonstrating the robustness of the underlying algorithm to the errors introduced by analog computational elements. A clear understanding of the impacts of nonideal computations is necessary to fully exploit the efficiency of analog circuits.
PoPLAR: Portal for Petascale Lifescience Applications and Research
2013-01-01
Background We are focusing specifically on fast data analysis and retrieval in bioinformatics that will have a direct impact on the quality of human health and the environment. The exponential growth of data generated in biology research, from small atoms to big ecosystems, necessitates an increasingly large computational component to perform analyses. Novel DNA sequencing technologies and complementary high-throughput approaches--such as proteomics, genomics, metabolomics, and meta-genomics--drive data-intensive bioinformatics. While individual research centers or universities could once provide for these applications, this is no longer the case. Today, only specialized national centers can deliver the level of computing resources required to meet the challenges posed by rapid data growth and the resulting computational demand. Consequently, we are developing massively parallel applications to analyze the growing flood of biological data and contribute to the rapid discovery of novel knowledge. Methods The efforts of previous National Science Foundation (NSF) projects provided for the generation of parallel modules for widely used bioinformatics applications on the Kraken supercomputer. We have profiled and optimized the code of some of the scientific community's most widely used desktop and small-cluster-based applications, including BLAST from the National Center for Biotechnology Information (NCBI), HMMER, and MUSCLE; scaled them to tens of thousands of cores on high-performance computing (HPC) architectures; made them robust and portable to next-generation architectures; and incorporated these parallel applications in science gateways with a web-based portal. Results This paper will discuss the various developmental stages, challenges, and solutions involved in taking bioinformatics applications from the desktop to petascale with a front-end portal for very-large-scale data analysis in the life sciences. Conclusions This research will help to bridge the gap between the rate of data generation and the speed at which scientists can study this data. The ability to rapidly analyze data at such a large scale is having a significant, direct impact on science achieved by collaborators who are currently using these tools on supercomputers. PMID:23902523
Portability and Cross-Platform Performance of an MPI-Based Parallel Polygon Renderer
NASA Technical Reports Server (NTRS)
Crockett, Thomas W.
1999-01-01
Visualizing the results of computations performed on large-scale parallel computers is a challenging problem, due to the size of the datasets involved. One approach is to perform the visualization and graphics operations in place, exploiting the available parallelism to obtain the necessary rendering performance. Over the past several years, we have been developing algorithms and software to support visualization applications on NASA's parallel supercomputers. Our results have been incorporated into a parallel polygon rendering system called PGL. PGL was initially developed on tightly-coupled distributed-memory message-passing systems, including Intel's iPSC/860 and Paragon, and IBM's SP2. Over the past year, we have ported it to a variety of additional platforms, including the HP Exemplar, SGI Origin2OOO, Cray T3E, and clusters of Sun workstations. In implementing PGL, we have had two primary goals: cross-platform portability and high performance. Portability is important because (1) our manpower resources are limited, making it difficult to develop and maintain multiple versions of the code, and (2) NASA's complement of parallel computing platforms is diverse and subject to frequent change. Performance is important in delivering adequate rendering rates for complex scenes and ensuring that parallel computing resources are used effectively. Unfortunately, these two goals are often at odds. In this paper we report on our experiences with portability and performance of the PGL polygon renderer across a range of parallel computing platforms.
Chen, Weiliang; De Schutter, Erik
2017-01-01
Stochastic, spatial reaction-diffusion simulations have been widely used in systems biology and computational neuroscience. However, the increasing scale and complexity of models and morphologies have exceeded the capacity of any serial implementation. This led to the development of parallel solutions that benefit from the boost in performance of modern supercomputers. In this paper, we describe an MPI-based, parallel operator-splitting implementation for stochastic spatial reaction-diffusion simulations with irregular tetrahedral meshes. The performance of our implementation is first examined and analyzed with simulations of a simple model. We then demonstrate its application to real-world research by simulating the reaction-diffusion components of a published calcium burst model in both Purkinje neuron sub-branch and full dendrite morphologies. Simulation results indicate that our implementation is capable of achieving super-linear speedup for balanced loading simulations with reasonable molecule density and mesh quality. In the best scenario, a parallel simulation with 2,000 processes runs more than 3,600 times faster than its serial SSA counterpart, and achieves more than 20-fold speedup relative to parallel simulation with 100 processes. In a more realistic scenario with dynamic calcium influx and data recording, the parallel simulation with 1,000 processes and no load balancing is still 500 times faster than the conventional serial SSA simulation. PMID:28239346
Chen, Weiliang; De Schutter, Erik
2017-01-01
Stochastic, spatial reaction-diffusion simulations have been widely used in systems biology and computational neuroscience. However, the increasing scale and complexity of models and morphologies have exceeded the capacity of any serial implementation. This led to the development of parallel solutions that benefit from the boost in performance of modern supercomputers. In this paper, we describe an MPI-based, parallel operator-splitting implementation for stochastic spatial reaction-diffusion simulations with irregular tetrahedral meshes. The performance of our implementation is first examined and analyzed with simulations of a simple model. We then demonstrate its application to real-world research by simulating the reaction-diffusion components of a published calcium burst model in both Purkinje neuron sub-branch and full dendrite morphologies. Simulation results indicate that our implementation is capable of achieving super-linear speedup for balanced loading simulations with reasonable molecule density and mesh quality. In the best scenario, a parallel simulation with 2,000 processes runs more than 3,600 times faster than its serial SSA counterpart, and achieves more than 20-fold speedup relative to parallel simulation with 100 processes. In a more realistic scenario with dynamic calcium influx and data recording, the parallel simulation with 1,000 processes and no load balancing is still 500 times faster than the conventional serial SSA simulation.
NASA Astrophysics Data System (ADS)
Jiang, Xikai; Li, Jiyuan; Zhao, Xujun; Qin, Jian; Karpeev, Dmitry; Hernandez-Ortiz, Juan; de Pablo, Juan J.; Heinonen, Olle
2016-08-01
Large classes of materials systems in physics and engineering are governed by magnetic and electrostatic interactions. Continuum or mesoscale descriptions of such systems can be cast in terms of integral equations, whose direct computational evaluation requires O(N2) operations, where N is the number of unknowns. Such a scaling, which arises from the many-body nature of the relevant Green's function, has precluded wide-spread adoption of integral methods for solution of large-scale scientific and engineering problems. In this work, a parallel computational approach is presented that relies on using scalable open source libraries and utilizes a kernel-independent Fast Multipole Method (FMM) to evaluate the integrals in O(N) operations, with O(N) memory cost, thereby substantially improving the scalability and efficiency of computational integral methods. We demonstrate the accuracy, efficiency, and scalability of our approach in the context of two examples. In the first, we solve a boundary value problem for a ferroelectric/ferromagnetic volume in free space. In the second, we solve an electrostatic problem involving polarizable dielectric bodies in an unbounded dielectric medium. The results from these test cases show that our proposed parallel approach, which is built on a kernel-independent FMM, can enable highly efficient and accurate simulations and allow for considerable flexibility in a broad range of applications.
Jiang, Xikai; Li, Jiyuan; Zhao, Xujun; ...
2016-08-10
Large classes of materials systems in physics and engineering are governed by magnetic and electrostatic interactions. Continuum or mesoscale descriptions of such systems can be cast in terms of integral equations, whose direct computational evaluation requires O( N 2) operations, where N is the number of unknowns. Such a scaling, which arises from the many-body nature of the relevant Green's function, has precluded wide-spread adoption of integral methods for solution of large-scale scientific and engineering problems. In this work, a parallel computational approach is presented that relies on using scalable open source libraries and utilizes a kernel-independent Fast Multipole Methodmore » (FMM) to evaluate the integrals in O( N) operations, with O( N) memory cost, thereby substantially improving the scalability and efficiency of computational integral methods. We demonstrate the accuracy, efficiency, and scalability of our approach in the context of two examples. In the first, we solve a boundary value problem for a ferroelectric/ferromagnetic volume in free space. In the second, we solve an electrostatic problem involving polarizable dielectric bodies in an unbounded dielectric medium. Lastly, the results from these test cases show that our proposed parallel approach, which is built on a kernel-independent FMM, can enable highly efficient and accurate simulations and allow for considerable flexibility in a broad range of applications.« less
Planetary-Scale Geospatial Data Analysis Techniques in Google's Earth Engine Platform (Invited)
NASA Astrophysics Data System (ADS)
Hancher, M.
2013-12-01
Geoscientists have more and more access to new tools for large-scale computing. With any tool, some tasks are easy and other tasks hard. It is natural to look to new computing platforms to increase the scale and efficiency of existing techniques, but there is a more exiting opportunity to discover and develop a new vocabulary of fundamental analysis idioms that are made easy and effective by these new tools. Google's Earth Engine platform is a cloud computing environment for earth data analysis that combines a public data catalog with a large-scale computational facility optimized for parallel processing of geospatial data. The data catalog includes a nearly complete archive of scenes from Landsat 4, 5, 7, and 8 that have been processed by the USGS, as well as a wide variety of other remotely-sensed and ancillary data products. Earth Engine supports a just-in-time computation model that enables real-time preview during algorithm development and debugging as well as during experimental data analysis and open-ended data exploration. Data processing operations are performed in parallel across many computers in Google's datacenters. The platform automatically handles many traditionally-onerous data management tasks, such as data format conversion, reprojection, resampling, and associating image metadata with pixel data. Early applications of Earth Engine have included the development of Google's global cloud-free fifteen-meter base map and global multi-decadal time-lapse animations, as well as numerous large and small experimental analyses by scientists from a range of academic, government, and non-governmental institutions, working in a wide variety of application areas including forestry, agriculture, urban mapping, and species habitat modeling. Patterns in the successes and failures of these early efforts have begun to emerge, sketching the outlines of a new set of simple and effective approaches to geospatial data analysis.
Quantum error correction in crossbar architectures
NASA Astrophysics Data System (ADS)
Helsen, Jonas; Steudtner, Mark; Veldhorst, Menno; Wehner, Stephanie
2018-07-01
A central challenge for the scaling of quantum computing systems is the need to control all qubits in the system without a large overhead. A solution for this problem in classical computing comes in the form of so-called crossbar architectures. Recently we made a proposal for a large-scale quantum processor (Li et al arXiv:1711.03807 (2017)) to be implemented in silicon quantum dots. This system features a crossbar control architecture which limits parallel single-qubit control, but allows the scheme to overcome control scaling issues that form a major hurdle to large-scale quantum computing systems. In this work, we develop a language that makes it possible to easily map quantum circuits to crossbar systems, taking into account their architecture and control limitations. Using this language we show how to map well known quantum error correction codes such as the planar surface and color codes in this limited control setting with only a small overhead in time. We analyze the logical error behavior of this surface code mapping for estimated experimental parameters of the crossbar system and conclude that logical error suppression to a level useful for real quantum computation is feasible.
Large-Scale Distributed Computational Fluid Dynamics on the Information Power Grid Using Globus
NASA Technical Reports Server (NTRS)
Barnard, Stephen; Biswas, Rupak; Saini, Subhash; VanderWijngaart, Robertus; Yarrow, Maurice; Zechtzer, Lou; Foster, Ian; Larsson, Olle
1999-01-01
This paper describes an experiment in which a large-scale scientific application development for tightly-coupled parallel machines is adapted to the distributed execution environment of the Information Power Grid (IPG). A brief overview of the IPG and a description of the computational fluid dynamics (CFD) algorithm are given. The Globus metacomputing toolkit is used as the enabling device for the geographically-distributed computation. Modifications related to latency hiding and Load balancing were required for an efficient implementation of the CFD application in the IPG environment. Performance results on a pair of SGI Origin 2000 machines indicate that real scientific applications can be effectively implemented on the IPG; however, a significant amount of continued effort is required to make such an environment useful and accessible to scientists and engineers.
Parallel computing in enterprise modeling.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Goldsby, Michael E.; Armstrong, Robert C.; Shneider, Max S.
2008-08-01
This report presents the results of our efforts to apply high-performance computing to entity-based simulations with a multi-use plugin for parallel computing. We use the term 'Entity-based simulation' to describe a class of simulation which includes both discrete event simulation and agent based simulation. What simulations of this class share, and what differs from more traditional models, is that the result sought is emergent from a large number of contributing entities. Logistic, economic and social simulations are members of this class where things or people are organized or self-organize to produce a solution. Entity-based problems never have an a priorimore » ergodic principle that will greatly simplify calculations. Because the results of entity-based simulations can only be realized at scale, scalable computing is de rigueur for large problems. Having said that, the absence of a spatial organizing principal makes the decomposition of the problem onto processors problematic. In addition, practitioners in this domain commonly use the Java programming language which presents its own problems in a high-performance setting. The plugin we have developed, called the Parallel Particle Data Model, overcomes both of these obstacles and is now being used by two Sandia frameworks: the Decision Analysis Center, and the Seldon social simulation facility. While the ability to engage U.S.-sized problems is now available to the Decision Analysis Center, this plugin is central to the success of Seldon. Because Seldon relies on computationally intensive cognitive sub-models, this work is necessary to achieve the scale necessary for realistic results. With the recent upheavals in the financial markets, and the inscrutability of terrorist activity, this simulation domain will likely need a capability with ever greater fidelity. High-performance computing will play an important part in enabling that greater fidelity.« less
Efficacy of the SU(3) scheme for ab initio large-scale calculations beyond the lightest nuclei
Dytrych, T.; Maris, P.; Launey, K. D.; ...
2016-06-22
We report on the computational characteristics of ab initio nuclear structure calculations in a symmetry-adapted no-core shell model (SA-NCSM) framework. We examine the computational complexity of the current implementation of the SA-NCSM approach, dubbed LSU3shell, by analyzing ab initio results for 6Li and 12C in large harmonic oscillator model spaces and SU3-selected subspaces. We demonstrate LSU3shell’s strong-scaling properties achieved with highly-parallel methods for computing the many-body matrix elements. Results compare favorably with complete model space calculations and significant memory savings are achieved in physically important applications. In particular, a well-chosen symmetry-adapted basis affords memory savings in calculations of states withmore » a fixed total angular momentum in large model spaces while exactly preserving translational invariance.« less
Design Aspects of the Rayleigh Convection Code
NASA Astrophysics Data System (ADS)
Featherstone, N. A.
2017-12-01
Understanding the long-term generation of planetary or stellar magnetic field requires complementary knowledge of the large-scale fluid dynamics pervading large fractions of the object's interior. Such large-scale motions are sensitive to the system's geometry which, in planets and stars, is spherical to a good approximation. As a result, computational models designed to study such systems often solve the MHD equations in spherical geometry, frequently employing a spectral approach involving spherical harmonics. We present computational and user-interface design aspects of one such modeling tool, the Rayleigh convection code, which is suitable for deployment on desktop and petascale-hpc architectures alike. In this poster, we will present an overview of this code's parallel design and its built-in diagnostics-output package. Rayleigh has been developed with NSF support through the Computational Infrastructure for Geodynamics and is expected to be released as open-source software in winter 2017/2018.
NASA Astrophysics Data System (ADS)
Takasaki, Koichi
This paper presents a program for the multidisciplinary optimization and identification problem of the nonlinear model of large aerospace vehicle structures. The program constructs the global matrix of the dynamic system in the time direction by the p-version finite element method (pFEM), and the basic matrix for each pFEM node in the time direction is described by a sparse matrix similarly to the static finite element problem. The algorithm used by the program does not require the Hessian matrix of the objective function and so has low memory requirements. It also has a relatively low computational cost, and is suited to parallel computation. The program was integrated as a solver module of the multidisciplinary analysis system CUMuLOUS (Computational Utility for Multidisciplinary Large scale Optimization of Undense System) which is under development by the Aerospace Research and Development Directorate (ARD) of the Japan Aerospace Exploration Agency (JAXA).
Efficacy of the SU(3) scheme for ab initio large-scale calculations beyond the lightest nuclei
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dytrych, T.; Maris, Pieter; Launey, K. D.
2016-06-09
We report on the computational characteristics of ab initio nuclear structure calculations in a symmetry-adapted no-core shell model (SA-NCSM) framework. We examine the computational complexity of the current implementation of the SA-NCSM approach, dubbed LSU3shell, by analyzing ab initio results for 6Li and 12C in large harmonic oscillator model spaces and SU(3)-selected subspaces. We demonstrate LSU3shell's strong-scaling properties achieved with highly-parallel methods for computing the many-body matrix elements. Results compare favorably with complete model space calculations and signi cant memory savings are achieved in physically important applications. In particular, a well-chosen symmetry-adapted basis a ords memory savings in calculations ofmore » states with a fixed total angular momentum in large model spaces while exactly preserving translational invariance.« less
Xyce Parallel Electronic Simulator Users' Guide Version 6.8
DOE Office of Scientific and Technical Information (OSTI.GOV)
Keiter, Eric R.; Aadithya, Karthik Venkatraman; Mei, Ting
This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been de- signed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel com- puting platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows onemore » to develop new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandia's needs, including some radiation- aware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase$-$ a message passing parallel implementation $-$ which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows.« less
Xyce parallel electronic simulator : users' guide.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mei, Ting; Rankin, Eric Lamont; Thornquist, Heidi K.
2011-05-01
This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: (1) Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). Note that this includes support for most popular parallel and serial computers; (2) Improved performance for all numerical kernels (e.g., time integrator, nonlinear and linear solvers) through state-of-the-artmore » algorithms and novel techniques. (3) Device models which are specifically tailored to meet Sandia's needs, including some radiation-aware devices (for Sandia users only); and (4) Object-oriented code design and implementation using modern coding practices that ensure that the Xyce Parallel Electronic Simulator will be maintainable and extensible far into the future. Xyce is a parallel code in the most general sense of the phrase - a message passing parallel implementation - which allows it to run efficiently on the widest possible number of computing platforms. These include serial, shared-memory and distributed-memory parallel as well as heterogeneous platforms. Careful attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows. The development of Xyce provides a platform for computational research and development aimed specifically at the needs of the Laboratory. With Xyce, Sandia has an 'in-house' capability with which both new electrical (e.g., device model development) and algorithmic (e.g., faster time-integration methods, parallel solver algorithms) research and development can be performed. As a result, Xyce is a unique electrical simulation capability, designed to meet the unique needs of the laboratory.« less
Large-Scale, Multi-Sensor Atmospheric Data Fusion Using Hybrid Cloud Computing
NASA Astrophysics Data System (ADS)
Wilson, Brian; Manipon, Gerald; Hua, Hook; Fetzer, Eric
2014-05-01
NASA's Earth Observing System (EOS) is an ambitious facility for studying global climate change. The mandate now is to combine measurements from the instruments on the "A-Train" platforms (AIRS, AMSR-E, MODIS, MISR, MLS, and CloudSat) and other Earth probes to enable large-scale studies of climate change over decades. Moving to multi-sensor, long-duration analyses of important climate variables presents serious challenges for large-scale data mining and fusion. For example, one might want to compare temperature and water vapor retrievals from one instrument (AIRS) to another (MODIS), and to a model (ECMWF), stratify the comparisons using a classification of the "cloud scenes" from CloudSat, and repeat the entire analysis over 10 years of data. To efficiently assemble such datasets, we are utilizing Elastic Computing in the Cloud and parallel map-reduce-based algorithms. However, these problems are Data Intensive computing so the data transfer times and storage costs (for caching) are key issues. SciReduce is a Hadoop-like parallel analysis system, programmed in parallel python, that is designed from the ground up for Earth science. SciReduce executes inside VMWare images and scales to any number of nodes in a hybrid Cloud (private eucalyptus & public Amazon). Unlike Hadoop, SciReduce operates on bundles of named numeric arrays, which can be passed in memory or serialized to disk in netCDF4 or HDF5. Multi-year datasets are automatically "sharded" by time and space across a cluster of nodes so that years of data (millions of files) can be processed in a massively parallel way. Input variables (arrays) are pulled on-demand into the Cloud using OPeNDAP URLs or other subsetting services, thereby minimizing the size of the cached input and intermediate datasets. We are using SciReduce to automate the production of multiple versions of a ten-year A-Train water vapor climatology under a NASA MEASURES grant. We will present the architecture of SciReduce, describe the achieved "clock time" speedups in fusing datasets on our own nodes and in the Cloud, and discuss the Cloud cost tradeoffs for storage, compute, and data transfer. We will also present a concept and prototype for staging NASA's A-Train Atmospheric datasets (Levels 2 & 3) in the Amazon Cloud so that any number of compute jobs can be executed "near" the multi-sensor data. Given such a system, multi-sensor climate studies over 10-20 years of data could be perform
Wilhelm, Jan; Seewald, Patrick; Del Ben, Mauro; Hutter, Jürg
2016-12-13
We present an algorithm for computing the correlation energy in the random phase approximation (RPA) in a Gaussian basis requiring [Formula: see text] operations and [Formula: see text] memory. The method is based on the resolution of the identity (RI) with the overlap metric, a reformulation of RI-RPA in the Gaussian basis, imaginary time, and imaginary frequency integration techniques, and the use of sparse linear algebra. Additional memory reduction without extra computations can be achieved by an iterative scheme that overcomes the memory bottleneck of canonical RPA implementations. We report a massively parallel implementation that is the key for the application to large systems. Finally, cubic-scaling RPA is applied to a thousand water molecules using a correlation-consistent triple-ζ quality basis.
Azad, Ariful; Ouzounis, Christos A; Kyrpides, Nikos C; Buluç, Aydin
2018-01-01
Abstract Biological networks capture structural or functional properties of relevant entities such as molecules, proteins or genes. Characteristic examples are gene expression networks or protein–protein interaction networks, which hold information about functional affinities or structural similarities. Such networks have been expanding in size due to increasing scale and abundance of biological data. While various clustering algorithms have been proposed to find highly connected regions, Markov Clustering (MCL) has been one of the most successful approaches to cluster sequence similarity or expression networks. Despite its popularity, MCL’s scalability to cluster large datasets still remains a bottleneck due to high running times and memory demands. Here, we present High-performance MCL (HipMCL), a parallel implementation of the original MCL algorithm that can run on distributed-memory computers. We show that HipMCL can efficiently utilize 2000 compute nodes and cluster a network of ∼70 million nodes with ∼68 billion edges in ∼2.4 h. By exploiting distributed-memory environments, HipMCL clusters large-scale networks several orders of magnitude faster than MCL and enables clustering of even bigger networks. HipMCL is based on MPI and OpenMP and is freely available under a modified BSD license. PMID:29315405
Azad, Ariful; Pavlopoulos, Georgios A.; Ouzounis, Christos A.; ...
2018-01-05
Biological networks capture structural or functional properties of relevant entities such as molecules, proteins or genes. Characteristic examples are gene expression networks or protein–protein interaction networks, which hold information about functional affinities or structural similarities. Such networks have been expanding in size due to increasing scale and abundance of biological data. While various clustering algorithms have been proposed to find highly connected regions, Markov Clustering (MCL) has been one of the most successful approaches to cluster sequence similarity or expression networks. Despite its popularity, MCL’s scalability to cluster large datasets still remains a bottleneck due to high running times andmore » memory demands. In this paper, we present High-performance MCL (HipMCL), a parallel implementation of the original MCL algorithm that can run on distributed-memory computers. We show that HipMCL can efficiently utilize 2000 compute nodes and cluster a network of ~70 million nodes with ~68 billion edges in ~2.4 h. By exploiting distributed-memory environments, HipMCL clusters large-scale networks several orders of magnitude faster than MCL and enables clustering of even bigger networks. Finally, HipMCL is based on MPI and OpenMP and is freely available under a modified BSD license.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Azad, Ariful; Pavlopoulos, Georgios A.; Ouzounis, Christos A.
Biological networks capture structural or functional properties of relevant entities such as molecules, proteins or genes. Characteristic examples are gene expression networks or protein–protein interaction networks, which hold information about functional affinities or structural similarities. Such networks have been expanding in size due to increasing scale and abundance of biological data. While various clustering algorithms have been proposed to find highly connected regions, Markov Clustering (MCL) has been one of the most successful approaches to cluster sequence similarity or expression networks. Despite its popularity, MCL’s scalability to cluster large datasets still remains a bottleneck due to high running times andmore » memory demands. In this paper, we present High-performance MCL (HipMCL), a parallel implementation of the original MCL algorithm that can run on distributed-memory computers. We show that HipMCL can efficiently utilize 2000 compute nodes and cluster a network of ~70 million nodes with ~68 billion edges in ~2.4 h. By exploiting distributed-memory environments, HipMCL clusters large-scale networks several orders of magnitude faster than MCL and enables clustering of even bigger networks. Finally, HipMCL is based on MPI and OpenMP and is freely available under a modified BSD license.« less
Emerging Nanophotonic Applications Explored with Advanced Scientific Parallel Computing
NASA Astrophysics Data System (ADS)
Meng, Xiang
The domain of nanoscale optical science and technology is a combination of the classical world of electromagnetics and the quantum mechanical regime of atoms and molecules. Recent advancements in fabrication technology allows the optical structures to be scaled down to nanoscale size or even to the atomic level, which are far smaller than the wavelength they are designed for. These nanostructures can have unique, controllable, and tunable optical properties and their interactions with quantum materials can have important near-field and far-field optical response. Undoubtedly, these optical properties can have many important applications, ranging from the efficient and tunable light sources, detectors, filters, modulators, high-speed all-optical switches; to the next-generation classical and quantum computation, and biophotonic medical sensors. This emerging research of nanoscience, known as nanophotonics, is a highly interdisciplinary field requiring expertise in materials science, physics, electrical engineering, and scientific computing, modeling and simulation. It has also become an important research field for investigating the science and engineering of light-matter interactions that take place on wavelength and subwavelength scales where the nature of the nanostructured matter controls the interactions. In addition, the fast advancements in the computing capabilities, such as parallel computing, also become as a critical element for investigating advanced nanophotonic devices. This role has taken on even greater urgency with the scale-down of device dimensions, and the design for these devices require extensive memory and extremely long core hours. Thus distributed computing platforms associated with parallel computing are required for faster designs processes. Scientific parallel computing constructs mathematical models and quantitative analysis techniques, and uses the computing machines to analyze and solve otherwise intractable scientific challenges. In particular, parallel computing are forms of computation operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently. In this dissertation, we report a series of new nanophotonic developments using the advanced parallel computing techniques. The applications include the structure optimizations at the nanoscale to control both the electromagnetic response of materials, and to manipulate nanoscale structures for enhanced field concentration, which enable breakthroughs in imaging, sensing systems (chapter 3 and 4) and improve the spatial-temporal resolutions of spectroscopies (chapter 5). We also report the investigations on the confinement study of optical-matter interactions at the quantum mechanical regime, where the size-dependent novel properties enhanced a wide range of technologies from the tunable and efficient light sources, detectors, to other nanophotonic elements with enhanced functionality (chapter 6 and 7).
NASA Astrophysics Data System (ADS)
Furuichi, M.; Nishiura, D.
2015-12-01
Fully Lagrangian methods such as Smoothed Particle Hydrodynamics (SPH) and Discrete Element Method (DEM) have been widely used to solve the continuum and particles motions in the computational geodynamics field. These mesh-free methods are suitable for the problems with the complex geometry and boundary. In addition, their Lagrangian nature allows non-diffusive advection useful for tracking history dependent properties (e.g. rheology) of the material. These potential advantages over the mesh-based methods offer effective numerical applications to the geophysical flow and tectonic processes, which are for example, tsunami with free surface and floating body, magma intrusion with fracture of rock, and shear zone pattern generation of granular deformation. In order to investigate such geodynamical problems with the particle based methods, over millions to billion particles are required for the realistic simulation. Parallel computing is therefore important for handling such huge computational cost. An efficient parallel implementation of SPH and DEM methods is however known to be difficult especially for the distributed-memory architecture. Lagrangian methods inherently show workload imbalance problem for parallelization with the fixed domain in space, because particles move around and workloads change during the simulation. Therefore dynamic load balance is key technique to perform the large scale SPH and DEM simulation. In this work, we present the parallel implementation technique of SPH and DEM method utilizing dynamic load balancing algorithms toward the high resolution simulation over large domain using the massively parallel super computer system. Our method utilizes the imbalances of the executed time of each MPI process as the nonlinear term of parallel domain decomposition and minimizes them with the Newton like iteration method. In order to perform flexible domain decomposition in space, the slice-grid algorithm is used. Numerical tests show that our approach is suitable for solving the particles with different calculation costs (e.g. boundary particles) as well as the heterogeneous computer architecture. We analyze the parallel efficiency and scalability on the super computer systems (K-computer, Earth simulator 3, etc.).
Parallel algorithms for large-scale biological sequence alignment on Xeon-Phi based clusters.
Lan, Haidong; Chan, Yuandong; Xu, Kai; Schmidt, Bertil; Peng, Shaoliang; Liu, Weiguo
2016-07-19
Computing alignments between two or more sequences are common operations frequently performed in computational molecular biology. The continuing growth of biological sequence databases establishes the need for their efficient parallel implementation on modern accelerators. This paper presents new approaches to high performance biological sequence database scanning with the Smith-Waterman algorithm and the first stage of progressive multiple sequence alignment based on the ClustalW heuristic on a Xeon Phi-based compute cluster. Our approach uses a three-level parallelization scheme to take full advantage of the compute power available on this type of architecture; i.e. cluster-level data parallelism, thread-level coarse-grained parallelism, and vector-level fine-grained parallelism. Furthermore, we re-organize the sequence datasets and use Xeon Phi shuffle operations to improve I/O efficiency. Evaluations show that our method achieves a peak overall performance up to 220 GCUPS for scanning real protein sequence databanks on a single node consisting of two Intel E5-2620 CPUs and two Intel Xeon Phi 7110P cards. It also exhibits good scalability in terms of sequence length and size, and number of compute nodes for both database scanning and multiple sequence alignment. Furthermore, the achieved performance is highly competitive in comparison to optimized Xeon Phi and GPU implementations. Our implementation is available at https://github.com/turbo0628/LSDBS-mpi .
The force on the flex: Global parallelism and portability
NASA Technical Reports Server (NTRS)
Jordan, H. F.
1986-01-01
A parallel programming methodology, called the force, supports the construction of programs to be executed in parallel by an unspecified, but potentially large, number of processes. The methodology was originally developed on a pipelined, shared memory multiprocessor, the Denelcor HEP, and embodies the primitive operations of the force in a set of macros which expand into multiprocessor Fortran code. A small set of primitives is sufficient to write large parallel programs, and the system has been used to produce 10,000 line programs in computational fluid dynamics. The level of complexity of the force primitives is intermediate. It is high enough to mask detailed architectural differences between multiprocessors but low enough to give the user control over performance. The system is being ported to a medium scale multiprocessor, the Flex/32, which is a 20 processor system with a mixture of shared and local memory. Memory organization and the type of processor synchronization supported by the hardware on the two machines lead to some differences in efficient implementations of the force primitives, but the user interface remains the same. An initial implementation was done by retargeting the macros to Flexible Computer Corporation's ConCurrent C language. Subsequently, the macros were caused to directly produce the system calls which form the basis for ConCurrent C. The implementation of the Fortran based system is in step with Flexible Computer Corporations's implementation of a Fortran system in the parallel environment.
Transport of cosmic-ray protons in intermittent heliospheric turbulence: Model and simulations
DOE Office of Scientific and Technical Information (OSTI.GOV)
Alouani-Bibi, Fathallah; Le Roux, Jakobus A., E-mail: fb0006@uah.edu
The transport of charged energetic particles in the presence of strong intermittent heliospheric turbulence is computationally analyzed based on known properties of the interplanetary magnetic field and solar wind plasma at 1 astronomical unit. The turbulence is assumed to be static, composite, and quasi-three-dimensional with a varying energy distribution between a one-dimensional Alfvénic (slab) and a structured two-dimensional component. The spatial fluctuations of the turbulent magnetic field are modeled either as homogeneous with a Gaussian probability distribution function (PDF), or as intermittent on large and small scales with a q-Gaussian PDF. Simulations showed that energetic particle diffusion coefficients both parallelmore » and perpendicular to the background magnetic field are significantly affected by intermittency in the turbulence. This effect is especially strong for parallel transport where for large-scale intermittency results show an extended phase of subdiffusive parallel transport during which cross-field transport diffusion dominates. The effects of intermittency are found to depend on particle rigidity and the fraction of slab energy in the turbulence, yielding a perpendicular to parallel mean free path ratio close to 1 for large-scale intermittency. Investigation of higher order transport moments (kurtosis) indicates that non-Gaussian statistical properties of the intermittent turbulent magnetic field are present in the parallel transport, especially for low rigidity particles at all times.« less
Three-Dimensional High-Lift Analysis Using a Parallel Unstructured Multigrid Solver
NASA Technical Reports Server (NTRS)
Mavriplis, Dimitri J.
1998-01-01
A directional implicit unstructured agglomeration multigrid solver is ported to shared and distributed memory massively parallel machines using the explicit domain-decomposition and message-passing approach. Because the algorithm operates on local implicit lines in the unstructured mesh, special care is required in partitioning the problem for parallel computing. A weighted partitioning strategy is described which avoids breaking the implicit lines across processor boundaries, while incurring minimal additional communication overhead. Good scalability is demonstrated on a 128 processor SGI Origin 2000 machine and on a 512 processor CRAY T3E machine for reasonably fine grids. The feasibility of performing large-scale unstructured grid calculations with the parallel multigrid algorithm is demonstrated by computing the flow over a partial-span flap wing high-lift geometry on a highly resolved grid of 13.5 million points in approximately 4 hours of wall clock time on the CRAY T3E.
Hi-Corrector: a fast, scalable and memory-efficient package for normalizing large-scale Hi-C data.
Li, Wenyuan; Gong, Ke; Li, Qingjiao; Alber, Frank; Zhou, Xianghong Jasmine
2015-03-15
Genome-wide proximity ligation assays, e.g. Hi-C and its variant TCC, have recently become important tools to study spatial genome organization. Removing biases from chromatin contact matrices generated by such techniques is a critical preprocessing step of subsequent analyses. The continuing decline of sequencing costs has led to an ever-improving resolution of the Hi-C data, resulting in very large matrices of chromatin contacts. Such large-size matrices, however, pose a great challenge on the memory usage and speed of its normalization. Therefore, there is an urgent need for fast and memory-efficient methods for normalization of Hi-C data. We developed Hi-Corrector, an easy-to-use, open source implementation of the Hi-C data normalization algorithm. Its salient features are (i) scalability-the software is capable of normalizing Hi-C data of any size in reasonable times; (ii) memory efficiency-the sequential version can run on any single computer with very limited memory, no matter how little; (iii) fast speed-the parallel version can run very fast on multiple computing nodes with limited local memory. The sequential version is implemented in ANSI C and can be easily compiled on any system; the parallel version is implemented in ANSI C with the MPI library (a standardized and portable parallel environment designed for solving large-scale scientific problems). The package is freely available at http://zhoulab.usc.edu/Hi-Corrector/. © The Author 2014. Published by Oxford University Press.
Scaling Optimization of the SIESTA MHD Code
NASA Astrophysics Data System (ADS)
Seal, Sudip; Hirshman, Steven; Perumalla, Kalyan
2013-10-01
SIESTA is a parallel three-dimensional plasma equilibrium code capable of resolving magnetic islands at high spatial resolutions for toroidal plasmas. Originally designed to exploit small-scale parallelism, SIESTA has now been scaled to execute efficiently over several thousands of processors P. This scaling improvement was accomplished with minimal intrusion to the execution flow of the original version. First, the efficiency of the iterative solutions was improved by integrating the parallel tridiagonal block solver code BCYCLIC. Krylov-space generation in GMRES was then accelerated using a customized parallel matrix-vector multiplication algorithm. Novel parallel Hessian generation algorithms were integrated and memory access latencies were dramatically reduced through loop nest optimizations and data layout rearrangement. These optimizations sped up equilibria calculations by factors of 30-50. It is possible to compute solutions with granularity N/P near unity on extremely fine radial meshes (N > 1024 points). Grid separation in SIESTA, which manifests itself primarily in the resonant components of the pressure far from rational surfaces, is strongly suppressed by finer meshes. Large problem sizes of up to 300 K simultaneous non-linear coupled equations have been solved on the NERSC supercomputers. Work supported by U.S. DOE under Contract DE-AC05-00OR22725 with UT-Battelle, LLC.
Vectorial finite elements for solving the radiative transfer equation
NASA Astrophysics Data System (ADS)
Badri, M. A.; Jolivet, P.; Rousseau, B.; Le Corre, S.; Digonnet, H.; Favennec, Y.
2018-06-01
The discrete ordinate method coupled with the finite element method is often used for the spatio-angular discretization of the radiative transfer equation. In this paper we attempt to improve upon such a discretization technique. Instead of using standard finite elements, we reformulate the radiative transfer equation using vectorial finite elements. In comparison to standard finite elements, this reformulation yields faster timings for the linear system assemblies, as well as for the solution phase when using scattering media. The proposed vectorial finite element discretization for solving the radiative transfer equation is cross-validated against a benchmark problem available in literature. In addition, we have used the method of manufactured solutions to verify the order of accuracy for our discretization technique within different absorbing, scattering, and emitting media. For solving large problems of radiation on parallel computers, the vectorial finite element method is parallelized using domain decomposition. The proposed domain decomposition method scales on large number of processes, and its performance is unaffected by the changes in optical thickness of the medium. Our parallel solver is used to solve a large scale radiative transfer problem of the Kelvin-cell radiation.
Parallel group independent component analysis for massive fMRI data sets.
Chen, Shaojie; Huang, Lei; Qiu, Huitong; Nebel, Mary Beth; Mostofsky, Stewart H; Pekar, James J; Lindquist, Martin A; Eloyan, Ani; Caffo, Brian S
2017-01-01
Independent component analysis (ICA) is widely used in the field of functional neuroimaging to decompose data into spatio-temporal patterns of co-activation. In particular, ICA has found wide usage in the analysis of resting state fMRI (rs-fMRI) data. Recently, a number of large-scale data sets have become publicly available that consist of rs-fMRI scans from thousands of subjects. As a result, efficient ICA algorithms that scale well to the increased number of subjects are required. To address this problem, we propose a two-stage likelihood-based algorithm for performing group ICA, which we denote Parallel Group Independent Component Analysis (PGICA). By utilizing the sequential nature of the algorithm and parallel computing techniques, we are able to efficiently analyze data sets from large numbers of subjects. We illustrate the efficacy of PGICA, which has been implemented in R and is freely available through the Comprehensive R Archive Network, through simulation studies and application to rs-fMRI data from two large multi-subject data sets, consisting of 301 and 779 subjects respectively.
Automating the selection of standard parallels for conic map projections
NASA Astrophysics Data System (ADS)
Šavriǒ, Bojan; Jenny, Bernhard
2016-05-01
Conic map projections are appropriate for mapping regions at medium and large scales with east-west extents at intermediate latitudes. Conic projections are appropriate for these cases because they show the mapped area with less distortion than other projections. In order to minimize the distortion of the mapped area, the two standard parallels of conic projections need to be selected carefully. Rules of thumb exist for placing the standard parallels based on the width-to-height ratio of the map. These rules of thumb are simple to apply, but do not result in maps with minimum distortion. There also exist more sophisticated methods that determine standard parallels such that distortion in the mapped area is minimized. These methods are computationally expensive and cannot be used for real-time web mapping and GIS applications where the projection is adjusted automatically to the displayed area. This article presents a polynomial model that quickly provides the standard parallels for the three most common conic map projections: the Albers equal-area, the Lambert conformal, and the equidistant conic projection. The model defines the standard parallels with polynomial expressions based on the spatial extent of the mapped area. The spatial extent is defined by the length of the mapped central meridian segment, the central latitude of the displayed area, and the width-to-height ratio of the map. The polynomial model was derived from 3825 maps-each with a different spatial extent and computationally determined standard parallels that minimize the mean scale distortion index. The resulting model is computationally simple and can be used for the automatic selection of the standard parallels of conic map projections in GIS software and web mapping applications.
Drawert, Brian; Trogdon, Michael; Toor, Salman; Petzold, Linda; Hellander, Andreas
2016-01-01
Computational experiments using spatial stochastic simulations have led to important new biological insights, but they require specialized tools and a complex software stack, as well as large and scalable compute and data analysis resources due to the large computational cost associated with Monte Carlo computational workflows. The complexity of setting up and managing a large-scale distributed computation environment to support productive and reproducible modeling can be prohibitive for practitioners in systems biology. This results in a barrier to the adoption of spatial stochastic simulation tools, effectively limiting the type of biological questions addressed by quantitative modeling. In this paper, we present PyURDME, a new, user-friendly spatial modeling and simulation package, and MOLNs, a cloud computing appliance for distributed simulation of stochastic reaction-diffusion models. MOLNs is based on IPython and provides an interactive programming platform for development of sharable and reproducible distributed parallel computational experiments.
NASA Technical Reports Server (NTRS)
Lyster, P. M.; Liewer, P. C.; Decyk, V. K.; Ferraro, R. D.
1995-01-01
A three-dimensional electrostatic particle-in-cell (PIC) plasma simulation code has been developed on coarse-grain distributed-memory massively parallel computers with message passing communications. Our implementation is the generalization to three-dimensions of the general concurrent particle-in-cell (GCPIC) algorithm. In the GCPIC algorithm, the particle computation is divided among the processors using a domain decomposition of the simulation domain. In a three-dimensional simulation, the domain can be partitioned into one-, two-, or three-dimensional subdomains ("slabs," "rods," or "cubes") and we investigate the efficiency of the parallel implementation of the push for all three choices. The present implementation runs on the Intel Touchstone Delta machine at Caltech; a multiple-instruction-multiple-data (MIMD) parallel computer with 512 nodes. We find that the parallel efficiency of the push is very high, with the ratio of communication to computation time in the range 0.3%-10.0%. The highest efficiency (> 99%) occurs for a large, scaled problem with 64(sup 3) particles per processing node (approximately 134 million particles of 512 nodes) which has a push time of about 250 ns per particle per time step. We have also developed expressions for the timing of the code which are a function of both code parameters (number of grid points, particles, etc.) and machine-dependent parameters (effective FLOP rate, and the effective interprocessor bandwidths for the communication of particles and grid points). These expressions can be used to estimate the performance of scaled problems--including those with inhomogeneous plasmas--to other parallel machines once the machine-dependent parameters are known.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lin, Youzuo; O'Malley, Daniel; Vesselinov, Velimir V.
Inverse modeling seeks model parameters given a set of observations. However, for practical problems because the number of measurements is often large and the model parameters are also numerous, conventional methods for inverse modeling can be computationally expensive. We have developed a new, computationally-efficient parallel Levenberg-Marquardt method for solving inverse modeling problems with a highly parameterized model space. Levenberg-Marquardt methods require the solution of a linear system of equations which can be prohibitively expensive to compute for moderate to large-scale problems. Our novel method projects the original linear problem down to a Krylov subspace, such that the dimensionality of themore » problem can be significantly reduced. Furthermore, we store the Krylov subspace computed when using the first damping parameter and recycle the subspace for the subsequent damping parameters. The efficiency of our new inverse modeling algorithm is significantly improved using these computational techniques. We apply this new inverse modeling method to invert for random transmissivity fields in 2D and a random hydraulic conductivity field in 3D. Our algorithm is fast enough to solve for the distributed model parameters (transmissivity) in the model domain. The algorithm is coded in Julia and implemented in the MADS computational framework (http://mads.lanl.gov). By comparing with Levenberg-Marquardt methods using standard linear inversion techniques such as QR or SVD methods, our Levenberg-Marquardt method yields a speed-up ratio on the order of ~10 1 to ~10 2 in a multi-core computational environment. Furthermore, our new inverse modeling method is a powerful tool for characterizing subsurface heterogeneity for moderate- to large-scale problems.« less
Lin, Youzuo; O'Malley, Daniel; Vesselinov, Velimir V.
2016-09-01
Inverse modeling seeks model parameters given a set of observations. However, for practical problems because the number of measurements is often large and the model parameters are also numerous, conventional methods for inverse modeling can be computationally expensive. We have developed a new, computationally-efficient parallel Levenberg-Marquardt method for solving inverse modeling problems with a highly parameterized model space. Levenberg-Marquardt methods require the solution of a linear system of equations which can be prohibitively expensive to compute for moderate to large-scale problems. Our novel method projects the original linear problem down to a Krylov subspace, such that the dimensionality of themore » problem can be significantly reduced. Furthermore, we store the Krylov subspace computed when using the first damping parameter and recycle the subspace for the subsequent damping parameters. The efficiency of our new inverse modeling algorithm is significantly improved using these computational techniques. We apply this new inverse modeling method to invert for random transmissivity fields in 2D and a random hydraulic conductivity field in 3D. Our algorithm is fast enough to solve for the distributed model parameters (transmissivity) in the model domain. The algorithm is coded in Julia and implemented in the MADS computational framework (http://mads.lanl.gov). By comparing with Levenberg-Marquardt methods using standard linear inversion techniques such as QR or SVD methods, our Levenberg-Marquardt method yields a speed-up ratio on the order of ~10 1 to ~10 2 in a multi-core computational environment. Furthermore, our new inverse modeling method is a powerful tool for characterizing subsurface heterogeneity for moderate- to large-scale problems.« less
A class of hybrid finite element methods for electromagnetics: A review
NASA Technical Reports Server (NTRS)
Volakis, J. L.; Chatterjee, A.; Gong, J.
1993-01-01
Integral equation methods have generally been the workhorse for antenna and scattering computations. In the case of antennas, they continue to be the prominent computational approach, but for scattering applications the requirement for large-scale computations has turned researchers' attention to near neighbor methods such as the finite element method, which has low O(N) storage requirements and is readily adaptable in modeling complex geometrical features and material inhomogeneities. In this paper, we review three hybrid finite element methods for simulating composite scatterers, conformal microstrip antennas, and finite periodic arrays. Specifically, we discuss the finite element method and its application to electromagnetic problems when combined with the boundary integral, absorbing boundary conditions, and artificial absorbers for terminating the mesh. Particular attention is given to large-scale simulations, methods, and solvers for achieving low memory requirements and code performance on parallel computing architectures.
Advanced computational simulations of water waves interacting with wave energy converters
NASA Astrophysics Data System (ADS)
Pathak, Ashish; Freniere, Cole; Raessi, Mehdi
2017-03-01
Wave energy converter (WEC) devices harness the renewable ocean wave energy and convert it into useful forms of energy, e.g. mechanical or electrical. This paper presents an advanced 3D computational framework to study the interaction between water waves and WEC devices. The computational tool solves the full Navier-Stokes equations and considers all important effects impacting the device performance. To enable large-scale simulations in fast turnaround times, the computational solver was developed in an MPI parallel framework. A fast multigrid preconditioned solver is introduced to solve the computationally expensive pressure Poisson equation. The computational solver was applied to two surface-piercing WEC geometries: bottom-hinged cylinder and flap. Their numerically simulated response was validated against experimental data. Additional simulations were conducted to investigate the applicability of Froude scaling in predicting full-scale WEC response from the model experiments.
National Laboratory for Advanced Scientific Visualization at UNAM - Mexico
NASA Astrophysics Data System (ADS)
Manea, Marina; Constantin Manea, Vlad; Varela, Alfredo
2016-04-01
In 2015, the National Autonomous University of Mexico (UNAM) joined the family of Universities and Research Centers where advanced visualization and computing plays a key role to promote and advance missions in research, education, community outreach, as well as business-oriented consulting. This initiative provides access to a great variety of advanced hardware and software resources and offers a range of consulting services that spans a variety of areas related to scientific visualization, among which are: neuroanatomy, embryonic development, genome related studies, geosciences, geography, physics and mathematics related disciplines. The National Laboratory for Advanced Scientific Visualization delivers services through three main infrastructure environments: the 3D fully immersive display system Cave, the high resolution parallel visualization system Powerwall, the high resolution spherical displays Earth Simulator. The entire visualization infrastructure is interconnected to a high-performance-computing-cluster (HPCC) called ADA in honor to Ada Lovelace, considered to be the first computer programmer. The Cave is an extra large 3.6m wide room with projected images on the front, left and right, as well as floor walls. Specialized crystal eyes LCD-shutter glasses provide a strong stereo depth perception, and a variety of tracking devices allow software to track the position of a user's hand, head and wand. The Powerwall is designed to bring large amounts of complex data together through parallel computing for team interaction and collaboration. This system is composed by 24 (6x4) high-resolution ultra-thin (2 mm) bezel monitors connected to a high-performance GPU cluster. The Earth Simulator is a large (60") high-resolution spherical display used for global-scale data visualization like geophysical, meteorological, climate and ecology data. The HPCC-ADA, is a 1000+ computing core system, which offers parallel computing resources to applications that requires large quantity of memory as well as large and fast parallel storage systems. The entire system temperature is controlled by an energy and space efficient cooling solution, based on large rear door liquid cooled heat exchangers. This state-of-the-art infrastructure will boost research activities in the region, offer a powerful scientific tool for teaching at undergraduate and graduate levels, and enhance association and cooperation with business-oriented organizations.
Kepper, Nick; Ettig, Ramona; Dickmann, Frank; Stehr, Rene; Grosveld, Frank G; Wedemann, Gero; Knoch, Tobias A
2010-01-01
Especially in the life-science and the health-care sectors the huge IT requirements are imminent due to the large and complex systems to be analysed and simulated. Grid infrastructures play here a rapidly increasing role for research, diagnostics, and treatment, since they provide the necessary large-scale resources efficiently. Whereas grids were first used for huge number crunching of trivially parallelizable problems, increasingly parallel high-performance computing is required. Here, we show for the prime example of molecular dynamic simulations how the presence of large grid clusters including very fast network interconnects within grid infrastructures allows now parallel high-performance grid computing efficiently and thus combines the benefits of dedicated super-computing centres and grid infrastructures. The demands for this service class are the highest since the user group has very heterogeneous requirements: i) two to many thousands of CPUs, ii) different memory architectures, iii) huge storage capabilities, and iv) fast communication via network interconnects, are all needed in different combinations and must be considered in a highly dedicated manner to reach highest performance efficiency. Beyond, advanced and dedicated i) interaction with users, ii) the management of jobs, iii) accounting, and iv) billing, not only combines classic with parallel high-performance grid usage, but more importantly is also able to increase the efficiency of IT resource providers. Consequently, the mere "yes-we-can" becomes a huge opportunity like e.g. the life-science and health-care sectors as well as grid infrastructures by reaching higher level of resource efficiency.
Scalable domain decomposition solvers for stochastic PDEs in high performance computing
Desai, Ajit; Khalil, Mohammad; Pettit, Chris; ...
2017-09-21
Stochastic spectral finite element models of practical engineering systems may involve solutions of linear systems or linearized systems for non-linear problems with billions of unknowns. For stochastic modeling, it is therefore essential to design robust, parallel and scalable algorithms that can efficiently utilize high-performance computing to tackle such large-scale systems. Domain decomposition based iterative solvers can handle such systems. And though these algorithms exhibit excellent scalabilities, significant algorithmic and implementational challenges exist to extend them to solve extreme-scale stochastic systems using emerging computing platforms. Intrusive polynomial chaos expansion based domain decomposition algorithms are extended here to concurrently handle high resolutionmore » in both spatial and stochastic domains using an in-house implementation. Sparse iterative solvers with efficient preconditioners are employed to solve the resulting global and subdomain level local systems through multi-level iterative solvers. We also use parallel sparse matrix–vector operations to reduce the floating-point operations and memory requirements. Numerical and parallel scalabilities of these algorithms are presented for the diffusion equation having spatially varying diffusion coefficient modeled by a non-Gaussian stochastic process. Scalability of the solvers with respect to the number of random variables is also investigated.« less
Scalable domain decomposition solvers for stochastic PDEs in high performance computing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Desai, Ajit; Khalil, Mohammad; Pettit, Chris
Stochastic spectral finite element models of practical engineering systems may involve solutions of linear systems or linearized systems for non-linear problems with billions of unknowns. For stochastic modeling, it is therefore essential to design robust, parallel and scalable algorithms that can efficiently utilize high-performance computing to tackle such large-scale systems. Domain decomposition based iterative solvers can handle such systems. And though these algorithms exhibit excellent scalabilities, significant algorithmic and implementational challenges exist to extend them to solve extreme-scale stochastic systems using emerging computing platforms. Intrusive polynomial chaos expansion based domain decomposition algorithms are extended here to concurrently handle high resolutionmore » in both spatial and stochastic domains using an in-house implementation. Sparse iterative solvers with efficient preconditioners are employed to solve the resulting global and subdomain level local systems through multi-level iterative solvers. We also use parallel sparse matrix–vector operations to reduce the floating-point operations and memory requirements. Numerical and parallel scalabilities of these algorithms are presented for the diffusion equation having spatially varying diffusion coefficient modeled by a non-Gaussian stochastic process. Scalability of the solvers with respect to the number of random variables is also investigated.« less
Yang, L. H.; Brooks III, E. D.; Belak, J.
1992-01-01
A molecular dynamics algorithm for performing large-scale simulations using the Parallel C Preprocessor (PCP) programming paradigm on the BBN TC2000, a massively parallel computer, is discussed. The algorithm uses a linked-cell data structure to obtain the near neighbors of each atom as time evoles. Each processor is assigned to a geometric domain containing many subcells and the storage for that domain is private to the processor. Within this scheme, the interdomain (i.e., interprocessor) communication is minimized.
NASA Technical Reports Server (NTRS)
Schreiber, Robert; Simon, Horst D.
1992-01-01
We are surveying current projects in the area of parallel supercomputers. The machines considered here will become commercially available in the 1990 - 1992 time frame. All are suitable for exploring the critical issues in applying parallel processors to large scale scientific computations, in particular CFD calculations. This chapter presents an overview of the surveyed machines, and a detailed analysis of the various architectural and technology approaches taken. Particular emphasis is placed on the feasibility of a Teraflops capability following the paths proposed by various developers.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bennett, Janine Camille; Thompson, David; Pebay, Philippe Pierre
Statistical analysis is typically used to reduce the dimensionality of and infer meaning from data. A key challenge of any statistical analysis package aimed at large-scale, distributed data is to address the orthogonal issues of parallel scalability and numerical stability. Many statistical techniques, e.g., descriptive statistics or principal component analysis, are based on moments and co-moments and, using robust online update formulas, can be computed in an embarrassingly parallel manner, amenable to a map-reduce style implementation. In this paper we focus on contingency tables, through which numerous derived statistics such as joint and marginal probability, point-wise mutual information, information entropy,more » and {chi}{sup 2} independence statistics can be directly obtained. However, contingency tables can become large as data size increases, requiring a correspondingly large amount of communication between processors. This potential increase in communication prevents optimal parallel speedup and is the main difference with moment-based statistics (which we discussed in [1]) where the amount of inter-processor communication is independent of data size. Here we present the design trade-offs which we made to implement the computation of contingency tables in parallel. We also study the parallel speedup and scalability properties of our open source implementation. In particular, we observe optimal speed-up and scalability when the contingency statistics are used in their appropriate context, namely, when the data input is not quasi-diffuse.« less
Lee, Wei-Po; Hsiao, Yu-Ting; Hwang, Wei-Che
2014-01-16
To improve the tedious task of reconstructing gene networks through testing experimentally the possible interactions between genes, it becomes a trend to adopt the automated reverse engineering procedure instead. Some evolutionary algorithms have been suggested for deriving network parameters. However, to infer large networks by the evolutionary algorithm, it is necessary to address two important issues: premature convergence and high computational cost. To tackle the former problem and to enhance the performance of traditional evolutionary algorithms, it is advisable to use parallel model evolutionary algorithms. To overcome the latter and to speed up the computation, it is advocated to adopt the mechanism of cloud computing as a promising solution: most popular is the method of MapReduce programming model, a fault-tolerant framework to implement parallel algorithms for inferring large gene networks. This work presents a practical framework to infer large gene networks, by developing and parallelizing a hybrid GA-PSO optimization method. Our parallel method is extended to work with the Hadoop MapReduce programming model and is executed in different cloud computing environments. To evaluate the proposed approach, we use a well-known open-source software GeneNetWeaver to create several yeast S. cerevisiae sub-networks and use them to produce gene profiles. Experiments have been conducted and the results have been analyzed. They show that our parallel approach can be successfully used to infer networks with desired behaviors and the computation time can be largely reduced. Parallel population-based algorithms can effectively determine network parameters and they perform better than the widely-used sequential algorithms in gene network inference. These parallel algorithms can be distributed to the cloud computing environment to speed up the computation. By coupling the parallel model population-based optimization method and the parallel computational framework, high quality solutions can be obtained within relatively short time. This integrated approach is a promising way for inferring large networks.
2014-01-01
Background To improve the tedious task of reconstructing gene networks through testing experimentally the possible interactions between genes, it becomes a trend to adopt the automated reverse engineering procedure instead. Some evolutionary algorithms have been suggested for deriving network parameters. However, to infer large networks by the evolutionary algorithm, it is necessary to address two important issues: premature convergence and high computational cost. To tackle the former problem and to enhance the performance of traditional evolutionary algorithms, it is advisable to use parallel model evolutionary algorithms. To overcome the latter and to speed up the computation, it is advocated to adopt the mechanism of cloud computing as a promising solution: most popular is the method of MapReduce programming model, a fault-tolerant framework to implement parallel algorithms for inferring large gene networks. Results This work presents a practical framework to infer large gene networks, by developing and parallelizing a hybrid GA-PSO optimization method. Our parallel method is extended to work with the Hadoop MapReduce programming model and is executed in different cloud computing environments. To evaluate the proposed approach, we use a well-known open-source software GeneNetWeaver to create several yeast S. cerevisiae sub-networks and use them to produce gene profiles. Experiments have been conducted and the results have been analyzed. They show that our parallel approach can be successfully used to infer networks with desired behaviors and the computation time can be largely reduced. Conclusions Parallel population-based algorithms can effectively determine network parameters and they perform better than the widely-used sequential algorithms in gene network inference. These parallel algorithms can be distributed to the cloud computing environment to speed up the computation. By coupling the parallel model population-based optimization method and the parallel computational framework, high quality solutions can be obtained within relatively short time. This integrated approach is a promising way for inferring large networks. PMID:24428926
Parallel computing method for simulating hydrological processesof large rivers under climate change
NASA Astrophysics Data System (ADS)
Wang, H.; Chen, Y.
2016-12-01
Climate change is one of the proverbial global environmental problems in the world.Climate change has altered the watershed hydrological processes in time and space distribution, especially in worldlarge rivers.Watershed hydrological process simulation based on physically based distributed hydrological model can could have better results compared with the lumped models.However, watershed hydrological process simulation includes large amount of calculations, especially in large rivers, thus needing huge computing resources that may not be steadily available for the researchers or at high expense, this seriously restricted the research and application. To solve this problem, the current parallel method are mostly parallel computing in space and time dimensions.They calculate the natural features orderly thatbased on distributed hydrological model by grid (unit, a basin) from upstream to downstream.This articleproposes ahigh-performancecomputing method of hydrological process simulation with high speedratio and parallel efficiency.It combinedthe runoff characteristics of time and space of distributed hydrological model withthe methods adopting distributed data storage, memory database, distributed computing, parallel computing based on computing power unit.The method has strong adaptability and extensibility,which means it canmake full use of the computing and storage resources under the condition of limited computing resources, and the computing efficiency can be improved linearly with the increase of computing resources .This method can satisfy the parallel computing requirements ofhydrological process simulation in small, medium and large rivers.
NASA Astrophysics Data System (ADS)
Vnukov, A. A.; Shershnev, M. B.
2018-01-01
The aim of this work is the software implementation of three image scaling algorithms using parallel computations, as well as the development of an application with a graphical user interface for the Windows operating system to demonstrate the operation of algorithms and to study the relationship between system performance, algorithm execution time and the degree of parallelization of computations. Three methods of interpolation were studied, formalized and adapted to scale images. The result of the work is a program for scaling images by different methods. Comparison of the quality of scaling by different methods is given.
MOOSE: A PARALLEL COMPUTATIONAL FRAMEWORK FOR COUPLED SYSTEMS OF NONLINEAR EQUATIONS.
DOE Office of Scientific and Technical Information (OSTI.GOV)
G. Hansen; C. Newman; D. Gaston
Systems of coupled, nonlinear partial di?erential equations often arise in sim- ulation of nuclear processes. MOOSE: Multiphysics Ob ject Oriented Simulation Environment, a parallel computational framework targeted at solving these systems is presented. As opposed to traditional data / ?ow oriented com- putational frameworks, MOOSE is instead founded on mathematics based on Jacobian-free Newton Krylov (JFNK). Utilizing the mathematical structure present in JFNK, physics are modularized into “Kernels” allowing for rapid production of new simulation tools. In addition, systems are solved fully cou- pled and fully implicit employing physics based preconditioning allowing for a large amount of ?exibility even withmore » large variance in time scales. Background on the mathematics, an inspection of the structure of MOOSE and several rep- resentative solutions from applications built on the framework are presented.« less
Argonne simulation framework for intelligent transportation systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ewing, T.; Doss, E.; Hanebutte, U.
1996-04-01
A simulation framework has been developed which defines a high-level architecture for a large-scale, comprehensive, scalable simulation of an Intelligent Transportation System (ITS). The simulator is designed to run on parallel computers and distributed (networked) computer systems; however, a version for a stand alone workstation is also available. The ITS simulator includes an Expert Driver Model (EDM) of instrumented ``smart`` vehicles with in-vehicle navigation units. The EDM is capable of performing optimal route planning and communicating with Traffic Management Centers (TMC). A dynamic road map data base is sued for optimum route planning, where the data is updated periodically tomore » reflect any changes in road or weather conditions. The TMC has probe vehicle tracking capabilities (display position and attributes of instrumented vehicles), and can provide 2-way interaction with traffic to provide advisories and link times. Both the in-vehicle navigation module and the TMC feature detailed graphical user interfaces that includes human-factors studies to support safety and operational research. Realistic modeling of variations of the posted driving speed are based on human factor studies that take into consideration weather, road conditions, driver`s personality and behavior and vehicle type. The simulator has been developed on a distributed system of networked UNIX computers, but is designed to run on ANL`s IBM SP-X parallel computer system for large scale problems. A novel feature of the developed simulator is that vehicles will be represented by autonomous computer processes, each with a behavior model which performs independent route selection and reacts to external traffic events much like real vehicles. Vehicle processes interact with each other and with ITS components by exchanging messages. With this approach, one will be able to take advantage of emerging massively parallel processor (MPP) systems.« less
Divide and Conquer (DC) BLAST: fast and easy BLAST execution within HPC environments
Yim, Won Cheol; Cushman, John C.
2017-07-22
Bioinformatics is currently faced with very large-scale data sets that lead to computational jobs, especially sequence similarity searches, that can take absurdly long times to run. For example, the National Center for Biotechnology Information (NCBI) Basic Local Alignment Search Tool (BLAST and BLAST+) suite, which is by far the most widely used tool for rapid similarity searching among nucleic acid or amino acid sequences, is highly central processing unit (CPU) intensive. While the BLAST suite of programs perform searches very rapidly, they have the potential to be accelerated. In recent years, distributed computing environments have become more widely accessible andmore » used due to the increasing availability of high-performance computing (HPC) systems. Therefore, simple solutions for data parallelization are needed to expedite BLAST and other sequence analysis tools. However, existing software for parallel sequence similarity searches often requires extensive computational experience and skill on the part of the user. In order to accelerate BLAST and other sequence analysis tools, Divide and Conquer BLAST (DCBLAST) was developed to perform NCBI BLAST searches within a cluster, grid, or HPC environment by using a query sequence distribution approach. Scaling from one (1) to 256 CPU cores resulted in significant improvements in processing speed. Thus, DCBLAST dramatically accelerates the execution of BLAST searches using a simple, accessible, robust, and parallel approach. DCBLAST works across multiple nodes automatically and it overcomes the speed limitation of single-node BLAST programs. DCBLAST can be used on any HPC system, can take advantage of hundreds of nodes, and has no output limitations. Thus, this freely available tool simplifies distributed computation pipelines to facilitate the rapid discovery of sequence similarities between very large data sets.« less
Divide and Conquer (DC) BLAST: fast and easy BLAST execution within HPC environments
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yim, Won Cheol; Cushman, John C.
Bioinformatics is currently faced with very large-scale data sets that lead to computational jobs, especially sequence similarity searches, that can take absurdly long times to run. For example, the National Center for Biotechnology Information (NCBI) Basic Local Alignment Search Tool (BLAST and BLAST+) suite, which is by far the most widely used tool for rapid similarity searching among nucleic acid or amino acid sequences, is highly central processing unit (CPU) intensive. While the BLAST suite of programs perform searches very rapidly, they have the potential to be accelerated. In recent years, distributed computing environments have become more widely accessible andmore » used due to the increasing availability of high-performance computing (HPC) systems. Therefore, simple solutions for data parallelization are needed to expedite BLAST and other sequence analysis tools. However, existing software for parallel sequence similarity searches often requires extensive computational experience and skill on the part of the user. In order to accelerate BLAST and other sequence analysis tools, Divide and Conquer BLAST (DCBLAST) was developed to perform NCBI BLAST searches within a cluster, grid, or HPC environment by using a query sequence distribution approach. Scaling from one (1) to 256 CPU cores resulted in significant improvements in processing speed. Thus, DCBLAST dramatically accelerates the execution of BLAST searches using a simple, accessible, robust, and parallel approach. DCBLAST works across multiple nodes automatically and it overcomes the speed limitation of single-node BLAST programs. DCBLAST can be used on any HPC system, can take advantage of hundreds of nodes, and has no output limitations. Thus, this freely available tool simplifies distributed computation pipelines to facilitate the rapid discovery of sequence similarities between very large data sets.« less
GATECloud.net: a platform for large-scale, open-source text processing on the cloud.
Tablan, Valentin; Roberts, Ian; Cunningham, Hamish; Bontcheva, Kalina
2013-01-28
Cloud computing is increasingly being regarded as a key enabler of the 'democratization of science', because on-demand, highly scalable cloud computing facilities enable researchers anywhere to carry out data-intensive experiments. In the context of natural language processing (NLP), algorithms tend to be complex, which makes their parallelization and deployment on cloud platforms a non-trivial task. This study presents a new, unique, cloud-based platform for large-scale NLP research--GATECloud. net. It enables researchers to carry out data-intensive NLP experiments by harnessing the vast, on-demand compute power of the Amazon cloud. Important infrastructural issues are dealt with by the platform, completely transparently for the researcher: load balancing, efficient data upload and storage, deployment on the virtual machines, security and fault tolerance. We also include a cost-benefit analysis and usage evaluation.
A parallel-vector algorithm for rapid structural analysis on high-performance computers
NASA Technical Reports Server (NTRS)
Storaasli, Olaf O.; Nguyen, Duc T.; Agarwal, Tarun K.
1990-01-01
A fast, accurate Choleski method for the solution of symmetric systems of linear equations is presented. This direct method is based on a variable-band storage scheme and takes advantage of column heights to reduce the number of operations in the Choleski factorization. The method employs parallel computation in the outermost DO-loop and vector computation via the 'loop unrolling' technique in the innermost DO-loop. The method avoids computations with zeros outside the column heights, and as an option, zeros inside the band. The close relationship between Choleski and Gauss elimination methods is examined. The minor changes required to convert the Choleski code to a Gauss code to solve non-positive-definite symmetric systems of equations are identified. The results for two large-scale structural analyses performed on supercomputers, demonstrate the accuracy and speed of the method.
A parallel-vector algorithm for rapid structural analysis on high-performance computers
NASA Technical Reports Server (NTRS)
Storaasli, Olaf O.; Nguyen, Duc T.; Agarwal, Tarun K.
1990-01-01
A fast, accurate Choleski method for the solution of symmetric systems of linear equations is presented. This direct method is based on a variable-band storage scheme and takes advantage of column heights to reduce the number of operations in the Choleski factorization. The method employs parallel computation in the outermost DO-loop and vector computation via the loop unrolling technique in the innermost DO-loop. The method avoids computations with zeros outside the column heights, and as an option, zeros inside the band. The close relationship between Choleski and Gauss elimination methods is examined. The minor changes required to convert the Choleski code to a Gauss code to solve non-positive-definite symmetric systems of equations are identified. The results for two large scale structural analyses performed on supercomputers, demonstrate the accuracy and speed of the method.
Hyperswitch communication network
NASA Technical Reports Server (NTRS)
Peterson, J.; Pniel, M.; Upchurch, E.
1991-01-01
The Hyperswitch Communication Network (HCN) is a large scale parallel computer prototype being developed at JPL. Commercial versions of the HCN computer are planned. The HCN computer being designed is a message passing multiple instruction multiple data (MIMD) computer, and offers many advantages in price-performance ratio, reliability and availability, and manufacturing over traditional uniprocessors and bus based multiprocessors. The design of the HCN operating system is a uniquely flexible environment that combines both parallel processing and distributed processing. This programming paradigm can achieve a balance among the following competing factors: performance in processing and communications, user friendliness, and fault tolerance. The prototype is being designed to accommodate a maximum of 64 state of the art microprocessors. The HCN is classified as a distributed supercomputer. The HCN system is described, and the performance/cost analysis and other competing factors within the system design are reviewed.
Pushing configuration-interaction to the limit: Towards massively parallel MCSCF calculations
DOE Office of Scientific and Technical Information (OSTI.GOV)
Vogiatzis, Konstantinos D.; Ma, Dongxia; Olsen, Jeppe
A new large-scale parallel multiconfigurational self-consistent field (MCSCF) implementation in the open-source NWChem computational chemistry code is presented. The generalized active space approach is used to partition large configuration interaction (CI) vectors and generate a sufficient number of batches that can be distributed to the available cores. Massively parallel CI calculations with large active spaces can be performed. The new parallel MCSCF implementation is tested for the chromium trimer and for an active space of 20 electrons in 20 orbitals, which can now routinely be performed. Unprecedented CI calculations with an active space of 22 electrons in 22 orbitals formore » the pentacene systems were performed and a single CI iteration calculation with an active space of 24 electrons in 24 orbitals for the chromium tetramer was possible. In conclusion, the chromium tetramer corresponds to a CI expansion of one trillion Slater determinants (914 058 513 424) and is the largest conventional CI calculation attempted up to date.« less
Pushing configuration-interaction to the limit: Towards massively parallel MCSCF calculations
Vogiatzis, Konstantinos D.; Ma, Dongxia; Olsen, Jeppe; ...
2017-11-14
A new large-scale parallel multiconfigurational self-consistent field (MCSCF) implementation in the open-source NWChem computational chemistry code is presented. The generalized active space approach is used to partition large configuration interaction (CI) vectors and generate a sufficient number of batches that can be distributed to the available cores. Massively parallel CI calculations with large active spaces can be performed. The new parallel MCSCF implementation is tested for the chromium trimer and for an active space of 20 electrons in 20 orbitals, which can now routinely be performed. Unprecedented CI calculations with an active space of 22 electrons in 22 orbitals formore » the pentacene systems were performed and a single CI iteration calculation with an active space of 24 electrons in 24 orbitals for the chromium tetramer was possible. In conclusion, the chromium tetramer corresponds to a CI expansion of one trillion Slater determinants (914 058 513 424) and is the largest conventional CI calculation attempted up to date.« less
Pushing configuration-interaction to the limit: Towards massively parallel MCSCF calculations
NASA Astrophysics Data System (ADS)
Vogiatzis, Konstantinos D.; Ma, Dongxia; Olsen, Jeppe; Gagliardi, Laura; de Jong, Wibe A.
2017-11-01
A new large-scale parallel multiconfigurational self-consistent field (MCSCF) implementation in the open-source NWChem computational chemistry code is presented. The generalized active space approach is used to partition large configuration interaction (CI) vectors and generate a sufficient number of batches that can be distributed to the available cores. Massively parallel CI calculations with large active spaces can be performed. The new parallel MCSCF implementation is tested for the chromium trimer and for an active space of 20 electrons in 20 orbitals, which can now routinely be performed. Unprecedented CI calculations with an active space of 22 electrons in 22 orbitals for the pentacene systems were performed and a single CI iteration calculation with an active space of 24 electrons in 24 orbitals for the chromium tetramer was possible. The chromium tetramer corresponds to a CI expansion of one trillion Slater determinants (914 058 513 424) and is the largest conventional CI calculation attempted up to date.
Airbreathing Propulsion System Analysis Using Multithreaded Parallel Processing
NASA Technical Reports Server (NTRS)
Schunk, Richard Gregory; Chung, T. J.; Rodriguez, Pete (Technical Monitor)
2000-01-01
In this paper, parallel processing is used to analyze the mixing, and combustion behavior of hypersonic flow. Preliminary work for a sonic transverse hydrogen jet injected from a slot into a Mach 4 airstream in a two-dimensional duct combustor has been completed [Moon and Chung, 1996]. Our aim is to extend this work to three-dimensional domain using multithreaded domain decomposition parallel processing based on the flowfield-dependent variation theory. Numerical simulations of chemically reacting flows are difficult because of the strong interactions between the turbulent hydrodynamic and chemical processes. The algorithm must provide an accurate representation of the flowfield, since unphysical flowfield calculations will lead to the faulty loss or creation of species mass fraction, or even premature ignition, which in turn alters the flowfield information. Another difficulty arises from the disparity in time scales between the flowfield and chemical reactions, which may require the use of finite rate chemistry. The situations are more complex when there is a disparity in length scales involved in turbulence. In order to cope with these complicated physical phenomena, it is our plan to utilize the flowfield-dependent variation theory mentioned above, facilitated by large eddy simulation. Undoubtedly, the proposed computation requires the most sophisticated computational strategies. The multithreaded domain decomposition parallel processing will be necessary in order to reduce both computational time and storage. Without special treatments involved in computer engineering, our attempt to analyze the airbreathing combustion appears to be difficult, if not impossible.
A method for data handling numerical results in parallel OpenFOAM simulations
DOE Office of Scientific and Technical Information (OSTI.GOV)
Anton, Alin; Muntean, Sebastian
Parallel computational fluid dynamics simulations produce vast amount of numerical result data. This paper introduces a method for reducing the size of the data by replaying the interprocessor traffic. The results are recovered only in certain regions of interest configured by the user. A known test case is used for several mesh partitioning scenarios using the OpenFOAM toolkit{sup ®}[1]. The space savings obtained with classic algorithms remain constant for more than 60 Gb of floating point data. Our method is most efficient on large simulation meshes and is much better suited for compressing large scale simulation results than the regular algorithms.
Topical perspective on massive threading and parallelism.
Farber, Robert M
2011-09-01
Unquestionably computer architectures have undergone a recent and noteworthy paradigm shift that now delivers multi- and many-core systems with tens to many thousands of concurrent hardware processing elements per workstation or supercomputer node. GPGPU (General Purpose Graphics Processor Unit) technology in particular has attracted significant attention as new software development capabilities, namely CUDA (Compute Unified Device Architecture) and OpenCL™, have made it possible for students as well as small and large research organizations to achieve excellent speedup for many applications over more conventional computing architectures. The current scientific literature reflects this shift with numerous examples of GPGPU applications that have achieved one, two, and in some special cases, three-orders of magnitude increased computational performance through the use of massive threading to exploit parallelism. Multi-core architectures are also evolving quickly to exploit both massive-threading and massive-parallelism such as the 1.3 million threads Blue Waters supercomputer. The challenge confronting scientists in planning future experimental and theoretical research efforts--be they individual efforts with one computer or collaborative efforts proposing to use the largest supercomputers in the world is how to capitalize on these new massively threaded computational architectures--especially as not all computational problems will scale to massive parallelism. In particular, the costs associated with restructuring software (and potentially redesigning algorithms) to exploit the parallelism of these multi- and many-threaded machines must be considered along with application scalability and lifespan. This perspective is an overview of the current state of threading and parallelize with some insight into the future. Published by Elsevier Inc.
Jung, Jaewoon; Mori, Takaharu; Kobayashi, Chigusa; Matsunaga, Yasuhiro; Yoda, Takao; Feig, Michael; Sugita, Yuji
2015-07-01
GENESIS (Generalized-Ensemble Simulation System) is a new software package for molecular dynamics (MD) simulations of macromolecules. It has two MD simulators, called ATDYN and SPDYN. ATDYN is parallelized based on an atomic decomposition algorithm for the simulations of all-atom force-field models as well as coarse-grained Go-like models. SPDYN is highly parallelized based on a domain decomposition scheme, allowing large-scale MD simulations on supercomputers. Hybrid schemes combining OpenMP and MPI are used in both simulators to target modern multicore computer architectures. Key advantages of GENESIS are (1) the highly parallel performance of SPDYN for very large biological systems consisting of more than one million atoms and (2) the availability of various REMD algorithms (T-REMD, REUS, multi-dimensional REMD for both all-atom and Go-like models under the NVT, NPT, NPAT, and NPγT ensembles). The former is achieved by a combination of the midpoint cell method and the efficient three-dimensional Fast Fourier Transform algorithm, where the domain decomposition space is shared in real-space and reciprocal-space calculations. Other features in SPDYN, such as avoiding concurrent memory access, reducing communication times, and usage of parallel input/output files, also contribute to the performance. We show the REMD simulation results of a mixed (POPC/DMPC) lipid bilayer as a real application using GENESIS. GENESIS is released as free software under the GPLv2 licence and can be easily modified for the development of new algorithms and molecular models. WIREs Comput Mol Sci 2015, 5:310-323. doi: 10.1002/wcms.1220.
Concurrent electromagnetic scattering analysis
NASA Technical Reports Server (NTRS)
Patterson, Jean E.; Cwik, Tom; Ferraro, Robert D.; Jacobi, Nathan; Liewer, Paulett C.; Lockhart, Thomas G.; Lyzenga, Gregory A.; Parker, Jay
1989-01-01
The computational power of the hypercube parallel computing architecture is applied to the solution of large-scale electromagnetic scattering and radiation problems. Three analysis codes have been implemented. A Hypercube Electromagnetic Interactive Analysis Workstation was developed to aid in the design and analysis of metallic structures such as antennas and to facilitate the use of these analysis codes. The workstation provides a general user environment for specification of the structure to be analyzed and graphical representations of the results.
Modeling Large Scale Circuits Using Massively Parallel Descrete-Event Simulation
2013-06-01
exascale levels of performance, the smallest elements of a single processor can greatly affect the entire computer system (e.g. its power consumption...grow to exascale levels of performance, the smallest elements of a single processor can greatly affect the entire computer system (e.g. its power...Warp Speed 10.0. 2.0 INTRODUCTION As supercomputer systems approach exascale , the core count will exceed 1024 and number of transistors used in
NASA Astrophysics Data System (ADS)
Hobson, T.; Clarkson, V.
2012-09-01
As a result of continual space activity since the 1950s, there are now a large number of man-made Resident Space Objects (RSOs) orbiting the Earth. Because of the large number of items and their relative speeds, the possibility of destructive collisions involving important space assets is now of significant concern to users and operators of space-borne technologies. As a result, a growing number of international agencies are researching methods for improving techniques to maintain Space Situational Awareness (SSA). Computer simulation is a method commonly used by many countries to validate competing methodologies prior to full scale adoption. The use of supercomputing and/or reduced scale testing is often necessary to effectively simulate such a complex problem on todays computers. Recently the authors presented a simulation aimed at reducing the computational burden by selecting the minimum level of fidelity necessary for contrasting methodologies and by utilising multi-core CPU parallelism for increased computational efficiency. The resulting simulation runs on a single PC while maintaining the ability to effectively evaluate competing methodologies. Nonetheless, the ability to control the scale and expand upon the computational demands of the sensor management system is limited. In this paper, we examine the advantages of increasing the parallelism of the simulation by means of General Purpose computing on Graphics Processing Units (GPGPU). As many sub-processes pertaining to SSA management are independent, we demonstrate how parallelisation via GPGPU has the potential to significantly enhance not only research into techniques for maintaining SSA, but also to enhance the level of sophistication of existing space surveillance sensors and sensor management systems. Nonetheless, the use of GPGPU imposes certain limitations and adds to the implementation complexity, both of which require consideration to achieve an effective system. We discuss these challenges and how they can be overcome. We further describe an application of the parallelised system where visibility prediction is used to enhance sensor management. This facilitates significant improvement in maximum catalogue error when RSOs become temporarily unobservable. The objective is to demonstrate the enhanced scalability and increased computational capability of the system.
NASA Technical Reports Server (NTRS)
Liu, Nan-Suey
2001-01-01
A multi-disciplinary design/analysis tool for combustion systems is critical for optimizing the low-emission, high-performance combustor design process. Based on discussions between then NASA Lewis Research Center and the jet engine companies, an industry-government team was formed in early 1995 to develop the National Combustion Code (NCC), which is an integrated system of computer codes for the design and analysis of combustion systems. NCC has advanced features that address the need to meet designer's requirements such as "assured accuracy", "fast turnaround", and "acceptable cost". The NCC development team is comprised of Allison Engine Company (Allison), CFD Research Corporation (CFDRC), GE Aircraft Engines (GEAE), NASA Glenn Research Center (LeRC), and Pratt & Whitney (P&W). The "unstructured mesh" capability and "parallel computing" are fundamental features of NCC from its inception. The NCC system is composed of a set of "elements" which includes grid generator, main flow solver, turbulence module, turbulence and chemistry interaction module, chemistry module, spray module, radiation heat transfer module, data visualization module, and a post-processor for evaluating engine performance parameters. Each element may have contributions from several team members. Such a multi-source multi-element system needs to be integrated in a way that facilitates inter-module data communication, flexibility in module selection, and ease of integration. The development of the NCC beta version was essentially completed in June 1998. Technical details of the NCC elements are given in the Reference List. Elements such as the baseline flow solver, turbulence module, and the chemistry module, have been extensively validated; and their parallel performance on large-scale parallel systems has been evaluated and optimized. However the scalar PDF module and the Spray module, as well as their coupling with the baseline flow solver, were developed in a small-scale distributed computing environment. As a result, the validation of the NCC beta version as a whole was quite limited. Current effort has been focused on the validation of the integrated code and the evaluation/optimization of its overall performance on large-scale parallel systems.
Towards a large-scale scalable adaptive heart model using shallow tree meshes
NASA Astrophysics Data System (ADS)
Krause, Dorian; Dickopf, Thomas; Potse, Mark; Krause, Rolf
2015-10-01
Electrophysiological heart models are sophisticated computational tools that place high demands on the computing hardware due to the high spatial resolution required to capture the steep depolarization front. To address this challenge, we present a novel adaptive scheme for resolving the deporalization front accurately using adaptivity in space. Our adaptive scheme is based on locally structured meshes. These tensor meshes in space are organized in a parallel forest of trees, which allows us to resolve complicated geometries and to realize high variations in the local mesh sizes with a minimal memory footprint in the adaptive scheme. We discuss both a non-conforming mortar element approximation and a conforming finite element space and present an efficient technique for the assembly of the respective stiffness matrices using matrix representations of the inclusion operators into the product space on the so-called shallow tree meshes. We analyzed the parallel performance and scalability for a two-dimensional ventricle slice as well as for a full large-scale heart model. Our results demonstrate that the method has good performance and high accuracy.
Barreiros, Willian; Teodoro, George; Kurc, Tahsin; Kong, Jun; Melo, Alba C. M. A.; Saltz, Joel
2017-01-01
We investigate efficient sensitivity analysis (SA) of algorithms that segment and classify image features in a large dataset of high-resolution images. Algorithm SA is the process of evaluating variations of methods and parameter values to quantify differences in the output. A SA can be very compute demanding because it requires re-processing the input dataset several times with different parameters to assess variations in output. In this work, we introduce strategies to efficiently speed up SA via runtime optimizations targeting distributed hybrid systems and reuse of computations from runs with different parameters. We evaluate our approach using a cancer image analysis workflow on a hybrid cluster with 256 nodes, each with an Intel Phi and a dual socket CPU. The SA attained a parallel efficiency of over 90% on 256 nodes. The cooperative execution using the CPUs and the Phi available in each node with smart task assignment strategies resulted in an additional speedup of about 2×. Finally, multi-level computation reuse lead to an additional speedup of up to 2.46× on the parallel version. The level of performance attained with the proposed optimizations will allow the use of SA in large-scale studies. PMID:29081725
Parallel simulation of tsunami inundation on a large-scale supercomputer
NASA Astrophysics Data System (ADS)
Oishi, Y.; Imamura, F.; Sugawara, D.
2013-12-01
An accurate prediction of tsunami inundation is important for disaster mitigation purposes. One approach is to approximate the tsunami wave source through an instant inversion analysis using real-time observation data (e.g., Tsushima et al., 2009) and then use the resulting wave source data in an instant tsunami inundation simulation. However, a bottleneck of this approach is the large computational cost of the non-linear inundation simulation and the computational power of recent massively parallel supercomputers is helpful to enable faster than real-time execution of a tsunami inundation simulation. Parallel computers have become approximately 1000 times faster in 10 years (www.top500.org), and so it is expected that very fast parallel computers will be more and more prevalent in the near future. Therefore, it is important to investigate how to efficiently conduct a tsunami simulation on parallel computers. In this study, we are targeting very fast tsunami inundation simulations on the K computer, currently the fastest Japanese supercomputer, which has a theoretical peak performance of 11.2 PFLOPS. One computing node of the K computer consists of 1 CPU with 8 cores that share memory, and the nodes are connected through a high-performance torus-mesh network. The K computer is designed for distributed-memory parallel computation, so we have developed a parallel tsunami model. Our model is based on TUNAMI-N2 model of Tohoku University, which is based on a leap-frog finite difference method. A grid nesting scheme is employed to apply high-resolution grids only at the coastal regions. To balance the computation load of each CPU in the parallelization, CPUs are first allocated to each nested layer in proportion to the number of grid points of the nested layer. Using CPUs allocated to each layer, 1-D domain decomposition is performed on each layer. In the parallel computation, three types of communication are necessary: (1) communication to adjacent neighbours for the finite difference calculation, (2) communication between adjacent layers for the calculations to connect each layer, and (3) global communication to obtain the time step which satisfies the CFL condition in the whole domain. A preliminary test on the K computer showed the parallel efficiency on 1024 cores was 57% relative to 64 cores. We estimate that the parallel efficiency will be considerably improved by applying a 2-D domain decomposition instead of the present 1-D domain decomposition in future work. The present parallel tsunami model was applied to the 2011 Great Tohoku tsunami. The coarsest resolution layer covers a 758 km × 1155 km region with a 405 m grid spacing. A nesting of five layers was used with the resolution ratio of 1/3 between nested layers. The finest resolution region has 5 m resolution and covers most of the coastal region of Sendai city. To complete 2 hours of simulation time, the serial (non-parallel) computation took approximately 4 days on a workstation. To complete the same simulation on 1024 cores of the K computer, it took 45 minutes which is more than two times faster than real-time. This presentation discusses the updated parallel computational performance and the efficient use of the K computer when considering the characteristics of the tsunami inundation simulation model in relation to the characteristics and capabilities of the K computer.
Extreme-Scale Bayesian Inference for Uncertainty Quantification of Complex Simulations
DOE Office of Scientific and Technical Information (OSTI.GOV)
Biros, George
Uncertainty quantification (UQ)—that is, quantifying uncertainties in complex mathematical models and their large-scale computational implementations—is widely viewed as one of the outstanding challenges facing the field of CS&E over the coming decade. The EUREKA project set to address the most difficult class of UQ problems: those for which both the underlying PDE model as well as the uncertain parameters are of extreme scale. In the project we worked on these extreme-scale challenges in the following four areas: 1. Scalable parallel algorithms for sampling and characterizing the posterior distribution that exploit the structure of the underlying PDEs and parameter-to-observable map. Thesemore » include structure-exploiting versions of the randomized maximum likelihood method, which aims to overcome the intractability of employing conventional MCMC methods for solving extreme-scale Bayesian inversion problems by appealing to and adapting ideas from large-scale PDE-constrained optimization, which have been very successful at exploring high-dimensional spaces. 2. Scalable parallel algorithms for construction of prior and likelihood functions based on learning methods and non-parametric density estimation. Constructing problem-specific priors remains a critical challenge in Bayesian inference, and more so in high dimensions. Another challenge is construction of likelihood functions that capture unmodeled couplings between observations and parameters. We will create parallel algorithms for non-parametric density estimation using high dimensional N-body methods and combine them with supervised learning techniques for the construction of priors and likelihood functions. 3. Bayesian inadequacy models, which augment physics models with stochastic models that represent their imperfections. The success of the Bayesian inference framework depends on the ability to represent the uncertainty due to imperfections of the mathematical model of the phenomena of interest. This is a central challenge in UQ, especially for large-scale models. We propose to develop the mathematical tools to address these challenges in the context of extreme-scale problems. 4. Parallel scalable algorithms for Bayesian optimal experimental design (OED). Bayesian inversion yields quantified uncertainties in the model parameters, which can be propagated forward through the model to yield uncertainty in outputs of interest. This opens the way for designing new experiments to reduce the uncertainties in the model parameters and model predictions. Such experimental design problems have been intractable for large-scale problems using conventional methods; we will create OED algorithms that exploit the structure of the PDE model and the parameter-to-output map to overcome these challenges. Parallel algorithms for these four problems were created, analyzed, prototyped, implemented, tuned, and scaled up for leading-edge supercomputers, including UT-Austin’s own 10 petaflops Stampede system, ANL’s Mira system, and ORNL’s Titan system. While our focus is on fundamental mathematical/computational methods and algorithms, we will assess our methods on model problems derived from several DOE mission applications, including multiscale mechanics and ice sheet dynamics.« less
New Parallel Algorithms for Landscape Evolution Model
NASA Astrophysics Data System (ADS)
Jin, Y.; Zhang, H.; Shi, Y.
2017-12-01
Most landscape evolution models (LEM) developed in the last two decades solve the diffusion equation to simulate the transportation of surface sediments. This numerical approach is difficult to parallelize due to the computation of drainage area for each node, which needs huge amount of communication if run in parallel. In order to overcome this difficulty, we developed two parallel algorithms for LEM with a stream net. One algorithm handles the partition of grid with traditional methods and applies an efficient global reduction algorithm to do the computation of drainage areas and transport rates for the stream net; the other algorithm is based on a new partition algorithm, which partitions the nodes in catchments between processes first, and then partitions the cells according to the partition of nodes. Both methods focus on decreasing communication between processes and take the advantage of massive computing techniques, and numerical experiments show that they are both adequate to handle large scale problems with millions of cells. We implemented the two algorithms in our program based on the widely used finite element library deal.II, so that it can be easily coupled with ASPECT.
Efficient multitasking of Choleski matrix factorization on CRAY supercomputers
NASA Technical Reports Server (NTRS)
Overman, Andrea L.; Poole, Eugene L.
1991-01-01
A Choleski method is described and used to solve linear systems of equations that arise in large scale structural analysis. The method uses a novel variable-band storage scheme and is structured to exploit fast local memory caches while minimizing data access delays between main memory and vector registers. Several parallel implementations of this method are described for the CRAY-2 and CRAY Y-MP computers demonstrating the use of microtasking and autotasking directives. A portable parallel language, FORCE, is used for comparison with the microtasked and autotasked implementations. Results are presented comparing the matrix factorization times for three representative structural analysis problems from runs made in both dedicated and multi-user modes on both computers. CPU and wall clock timings are given for the parallel implementations and are compared to single processor timings of the same algorithm.
Using parallel banded linear system solvers in generalized eigenvalue problems
NASA Technical Reports Server (NTRS)
Zhang, Hong; Moss, William F.
1993-01-01
Subspace iteration is a reliable and cost effective method for solving positive definite banded symmetric generalized eigenproblems, especially in the case of large scale problems. This paper discusses an algorithm that makes use of two parallel banded solvers in subspace iteration. A shift is introduced to decompose the banded linear systems into relatively independent subsystems and to accelerate the iterations. With this shift, an eigenproblem is mapped efficiently into the memories of a multiprocessor and a high speed-up is obtained for parallel implementations. An optimal shift is a shift that balances total computation and communication costs. Under certain conditions, we show how to estimate an optimal shift analytically using the decay rate for the inverse of a banded matrix, and how to improve this estimate. Computational results on iPSC/2 and iPSC/860 multiprocessors are presented.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chrisochoides, N.; Sukup, F.
In this paper we present a parallel implementation of the Bowyer-Watson (BW) algorithm using the task-parallel programming model. The BW algorithm constitutes an ideal mesh refinement strategy for implementing a large class of unstructured mesh generation techniques on both sequential and parallel computers, by preventing the need for global mesh refinement. Its implementation on distributed memory multicomputes using the traditional data-parallel model has been proven very inefficient due to excessive synchronization needed among processors. In this paper we demonstrate that with the task-parallel model we can tolerate synchronization costs inherent to data-parallel methods by exploring concurrency in the processor level.more » Our preliminary performance data indicate that the task- parallel approach: (i) is almost four times faster than the existing data-parallel methods, (ii) scales linearly, and (iii) introduces minimum overheads compared to the {open_quotes}best{close_quotes} sequential implementation of the BW algorithm.« less
Load Balancing Unstructured Adaptive Grids for CFD Problems
NASA Technical Reports Server (NTRS)
Biswas, Rupak; Oliker, Leonid
1996-01-01
Mesh adaption is a powerful tool for efficient unstructured-grid computations but causes load imbalance among processors on a parallel machine. A dynamic load balancing method is presented that balances the workload across all processors with a global view. After each parallel tetrahedral mesh adaption, the method first determines if the new mesh is sufficiently unbalanced to warrant a repartitioning. If so, the adapted mesh is repartitioned, with new partitions assigned to processors so that the redistribution cost is minimized. The new partitions are accepted only if the remapping cost is compensated by the improved load balance. Results indicate that this strategy is effective for large-scale scientific computations on distributed-memory multiprocessors.
Parallel multiscale simulations of a brain aneurysm
Grinberg, Leopold; Fedosov, Dmitry A.; Karniadakis, George Em
2012-01-01
Cardiovascular pathologies, such as a brain aneurysm, are affected by the global blood circulation as well as by the local microrheology. Hence, developing computational models for such cases requires the coupling of disparate spatial and temporal scales often governed by diverse mathematical descriptions, e.g., by partial differential equations (continuum) and ordinary differential equations for discrete particles (atomistic). However, interfacing atomistic-based with continuum-based domain discretizations is a challenging problem that requires both mathematical and computational advances. We present here a hybrid methodology that enabled us to perform the first multi-scale simulations of platelet depositions on the wall of a brain aneurysm. The large scale flow features in the intracranial network are accurately resolved by using the high-order spectral element Navier-Stokes solver εκ αr. The blood rheology inside the aneurysm is modeled using a coarse-grained stochastic molecular dynamics approach (the dissipative particle dynamics method) implemented in the parallel code LAMMPS. The continuum and atomistic domains overlap with interface conditions provided by effective forces computed adaptively to ensure continuity of states across the interface boundary. A two-way interaction is allowed with the time-evolving boundary of the (deposited) platelet clusters tracked by an immersed boundary method. The corresponding heterogeneous solvers ( εκ αr and LAMMPS) are linked together by a computational multilevel message passing interface that facilitates modularity and high parallel efficiency. Results of multiscale simulations of clot formation inside the aneurysm in a patient-specific arterial tree are presented. We also discuss the computational challenges involved and present scalability results of our coupled solver on up to 300K computer processors. Validation of such coupled atomistic-continuum models is a main open issue that has to be addressed in future work. PMID:23734066
Parallel multiscale simulations of a brain aneurysm.
Grinberg, Leopold; Fedosov, Dmitry A; Karniadakis, George Em
2013-07-01
Cardiovascular pathologies, such as a brain aneurysm, are affected by the global blood circulation as well as by the local microrheology. Hence, developing computational models for such cases requires the coupling of disparate spatial and temporal scales often governed by diverse mathematical descriptions, e.g., by partial differential equations (continuum) and ordinary differential equations for discrete particles (atomistic). However, interfacing atomistic-based with continuum-based domain discretizations is a challenging problem that requires both mathematical and computational advances. We present here a hybrid methodology that enabled us to perform the first multi-scale simulations of platelet depositions on the wall of a brain aneurysm. The large scale flow features in the intracranial network are accurately resolved by using the high-order spectral element Navier-Stokes solver εκ αr . The blood rheology inside the aneurysm is modeled using a coarse-grained stochastic molecular dynamics approach (the dissipative particle dynamics method) implemented in the parallel code LAMMPS. The continuum and atomistic domains overlap with interface conditions provided by effective forces computed adaptively to ensure continuity of states across the interface boundary. A two-way interaction is allowed with the time-evolving boundary of the (deposited) platelet clusters tracked by an immersed boundary method. The corresponding heterogeneous solvers ( εκ αr and LAMMPS) are linked together by a computational multilevel message passing interface that facilitates modularity and high parallel efficiency. Results of multiscale simulations of clot formation inside the aneurysm in a patient-specific arterial tree are presented. We also discuss the computational challenges involved and present scalability results of our coupled solver on up to 300K computer processors. Validation of such coupled atomistic-continuum models is a main open issue that has to be addressed in future work.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hasenkamp, Daren; Sim, Alexander; Wehner, Michael
Extensive computing power has been used to tackle issues such as climate changes, fusion energy, and other pressing scientific challenges. These computations produce a tremendous amount of data; however, many of the data analysis programs currently only run a single processor. In this work, we explore the possibility of using the emerging cloud computing platform to parallelize such sequential data analysis tasks. As a proof of concept, we wrap a program for analyzing trends of tropical cyclones in a set of virtual machines (VMs). This approach allows the user to keep their familiar data analysis environment in the VMs, whilemore » we provide the coordination and data transfer services to ensure the necessary input and output are directed to the desired locations. This work extensively exercises the networking capability of the cloud computing systems and has revealed a number of weaknesses in the current cloud system software. In our tests, we are able to scale the parallel data analysis job to a modest number of VMs and achieve a speedup that is comparable to running the same analysis task using MPI. However, compared to MPI based parallelization, the cloud-based approach has a number of advantages. The cloud-based approach is more flexible because the VMs can capture arbitrary software dependencies without requiring the user to rewrite their programs. The cloud-based approach is also more resilient to failure; as long as a single VM is running, it can make progress while as soon as one MPI node fails the whole analysis job fails. In short, this initial work demonstrates that a cloud computing system is a viable platform for distributed scientific data analyses traditionally conducted on dedicated supercomputing systems.« less
Banerjee, Amartya S.; Lin, Lin; Hu, Wei; ...
2016-10-21
The Discontinuous Galerkin (DG) electronic structure method employs an adaptive local basis (ALB) set to solve the Kohn-Sham equations of density functional theory in a discontinuous Galerkin framework. The adaptive local basis is generated on-the-fly to capture the local material physics and can systematically attain chemical accuracy with only a few tens of degrees of freedom per atom. A central issue for large-scale calculations, however, is the computation of the electron density (and subsequently, ground state properties) from the discretized Hamiltonian in an efficient and scalable manner. We show in this work how Chebyshev polynomial filtered subspace iteration (CheFSI) canmore » be used to address this issue and push the envelope in large-scale materials simulations in a discontinuous Galerkin framework. We describe how the subspace filtering steps can be performed in an efficient and scalable manner using a two-dimensional parallelization scheme, thanks to the orthogonality of the DG basis set and block-sparse structure of the DG Hamiltonian matrix. The on-the-fly nature of the ALB functions requires additional care in carrying out the subspace iterations. We demonstrate the parallel scalability of the DG-CheFSI approach in calculations of large-scale twodimensional graphene sheets and bulk three-dimensional lithium-ion electrolyte systems. In conclusion, employing 55 296 computational cores, the time per self-consistent field iteration for a sample of the bulk 3D electrolyte containing 8586 atoms is 90 s, and the time for a graphene sheet containing 11 520 atoms is 75 s.« less
NASA Astrophysics Data System (ADS)
Lanari, Riccardo; Bonano, Manuela; Buonanno, Sabatino; Casu, Francesco; De Luca, Claudio; Fusco, Adele; Manunta, Michele; Manzo, Mariarosaria; Pepe, Antonio; Zinno, Ivana
2017-04-01
The SENTINEL-1 (S1) mission is designed to provide operational capability for continuous mapping of the Earth thanks to its two polar-orbiting satellites (SENTINEL-1A and B) performing C-band synthetic aperture radar (SAR) imaging. It is, indeed, characterized by enhanced revisit frequency, coverage and reliability for operational services and applications requiring long SAR data time series. Moreover, SENTINEL-1 is specifically oriented to interferometry applications with stringent requirements based on attitude and orbit accuracy and it is intrinsically characterized by small spatial and temporal baselines. Consequently, SENTINEL-1 data are particularly suitable to be exploited through advanced interferometric techniques such as the well-known DInSAR algorithm referred to as Small BAseline Subset (SBAS), which allows the generation of deformation time series and displacement velocity maps. In this work we present an advanced interferometric processing chain, based on the Parallel SBAS (P-SBAS) approach, for the massive processing of S1 Interferometric Wide Swath (IWS) data aimed at generating deformation time series in efficient, automatic and systematic way. Such a DInSAR chain is designed to exploit distributed computing infrastructures, and more specifically Cloud Computing environments, to properly deal with the storage and the processing of huge S1 datasets. In particular, since S1 IWS data are acquired with the innovative Terrain Observation with Progressive Scans (TOPS) mode, we could benefit from the structure of S1 data, which are composed by bursts that can be considered as separate acquisitions. Indeed, the processing is intrinsically parallelizable with respect to such independent input data and therefore we basically exploited this coarse granularity parallelization strategy in the majority of the steps of the SBAS processing chain. Moreover, we also implemented more sophisticated parallelization approaches, exploiting both multi-node and multi-core programming techniques. Currently, Cloud Computing environments make available large collections of computing resources and storage that can be effectively exploited through the presented S1 P-SBAS processing chain to carry out interferometric analyses at a very large scale, in reduced time. This allows us to deal also with the problems connected to the use of S1 P-SBAS chain in operational contexts, related to hazard monitoring and risk prevention and mitigation, where handling large amounts of data represents a challenging task. As a significant experimental result we performed a large spatial scale SBAS analysis relevant to the Central and Southern Italy by exploiting the Amazon Web Services Cloud Computing platform. In particular, we processed in parallel 300 S1 acquisitions covering the Italian peninsula from Lazio to Sicily through the presented S1 P-SBAS processing chain, generating 710 interferograms, thus finally obtaining the displacement time series of the whole processed area. This work has been partially supported by the CNR-DPC agreement, the H2020 EPOS-IP project (GA 676564) and the ESA GEP project.
A compositional reservoir simulator on distributed memory parallel computers
DOE Office of Scientific and Technical Information (OSTI.GOV)
Rame, M.; Delshad, M.
1995-12-31
This paper presents the application of distributed memory parallel computes to field scale reservoir simulations using a parallel version of UTCHEM, The University of Texas Chemical Flooding Simulator. The model is a general purpose highly vectorized chemical compositional simulator that can simulate a wide range of displacement processes at both field and laboratory scales. The original simulator was modified to run on both distributed memory parallel machines (Intel iPSC/960 and Delta, Connection Machine 5, Kendall Square 1 and 2, and CRAY T3D) and a cluster of workstations. A domain decomposition approach has been taken towards parallelization of the code. Amore » portion of the discrete reservoir model is assigned to each processor by a set-up routine that attempts a data layout as even as possible from the load-balance standpoint. Each of these subdomains is extended so that data can be shared between adjacent processors for stencil computation. The added routines that make parallel execution possible are written in a modular fashion that makes the porting to new parallel platforms straight forward. Results of the distributed memory computing performance of Parallel simulator are presented for field scale applications such as tracer flood and polymer flood. A comparison of the wall-clock times for same problems on a vector supercomputer is also presented.« less
Sharma, Parichit; Mantri, Shrikant S
2014-01-01
The function of a newly sequenced gene can be discovered by determining its sequence homology with known proteins. BLAST is the most extensively used sequence analysis program for sequence similarity search in large databases of sequences. With the advent of next generation sequencing technologies it has now become possible to study genes and their expression at a genome-wide scale through RNA-seq and metagenome sequencing experiments. Functional annotation of all the genes is done by sequence similarity search against multiple protein databases. This annotation task is computationally very intensive and can take days to obtain complete results. The program mpiBLAST, an open-source parallelization of BLAST that achieves superlinear speedup, can be used to accelerate large-scale annotation by using supercomputers and high performance computing (HPC) clusters. Although many parallel bioinformatics applications using the Message Passing Interface (MPI) are available in the public domain, researchers are reluctant to use them due to lack of expertise in the Linux command line and relevant programming experience. With these limitations, it becomes difficult for biologists to use mpiBLAST for accelerating annotation. No web interface is available in the open-source domain for mpiBLAST. We have developed WImpiBLAST, a user-friendly open-source web interface for parallel BLAST searches. It is implemented in Struts 1.3 using a Java backbone and runs atop the open-source Apache Tomcat Server. WImpiBLAST supports script creation and job submission features and also provides a robust job management interface for system administrators. It combines script creation and modification features with job monitoring and management through the Torque resource manager on a Linux-based HPC cluster. Use case information highlights the acceleration of annotation analysis achieved by using WImpiBLAST. Here, we describe the WImpiBLAST web interface features and architecture, explain design decisions, describe workflows and provide a detailed analysis.
Sharma, Parichit; Mantri, Shrikant S.
2014-01-01
The function of a newly sequenced gene can be discovered by determining its sequence homology with known proteins. BLAST is the most extensively used sequence analysis program for sequence similarity search in large databases of sequences. With the advent of next generation sequencing technologies it has now become possible to study genes and their expression at a genome-wide scale through RNA-seq and metagenome sequencing experiments. Functional annotation of all the genes is done by sequence similarity search against multiple protein databases. This annotation task is computationally very intensive and can take days to obtain complete results. The program mpiBLAST, an open-source parallelization of BLAST that achieves superlinear speedup, can be used to accelerate large-scale annotation by using supercomputers and high performance computing (HPC) clusters. Although many parallel bioinformatics applications using the Message Passing Interface (MPI) are available in the public domain, researchers are reluctant to use them due to lack of expertise in the Linux command line and relevant programming experience. With these limitations, it becomes difficult for biologists to use mpiBLAST for accelerating annotation. No web interface is available in the open-source domain for mpiBLAST. We have developed WImpiBLAST, a user-friendly open-source web interface for parallel BLAST searches. It is implemented in Struts 1.3 using a Java backbone and runs atop the open-source Apache Tomcat Server. WImpiBLAST supports script creation and job submission features and also provides a robust job management interface for system administrators. It combines script creation and modification features with job monitoring and management through the Torque resource manager on a Linux-based HPC cluster. Use case information highlights the acceleration of annotation analysis achieved by using WImpiBLAST. Here, we describe the WImpiBLAST web interface features and architecture, explain design decisions, describe workflows and provide a detailed analysis. PMID:24979410
NASA Astrophysics Data System (ADS)
Newman, Gregory A.; Commer, Michael
2009-07-01
Three-dimensional (3D) geophysical imaging is now receiving considerable attention for electrical conductivity mapping of potential offshore oil and gas reservoirs. The imaging technology employs controlled source electromagnetic (CSEM) and magnetotelluric (MT) fields and treats geological media exhibiting transverse anisotropy. Moreover when combined with established seismic methods, direct imaging of reservoir fluids is possible. Because of the size of the 3D conductivity imaging problem, strategies are required exploiting computational parallelism and optimal meshing. The algorithm thus developed has been shown to scale to tens of thousands of processors. In one imaging experiment, 32,768 tasks/processors on the IBM Watson Research Blue Gene/L supercomputer were successfully utilized. Over a 24 hour period we were able to image a large scale field data set that previously required over four months of processing time on distributed clusters based on Intel or AMD processors utilizing 1024 tasks on an InfiniBand fabric. Electrical conductivity imaging using massively parallel computational resources produces results that cannot be obtained otherwise and are consistent with timeframes required for practical exploration problems.
High performance computing applications in neurobiological research
NASA Technical Reports Server (NTRS)
Ross, Muriel D.; Cheng, Rei; Doshay, David G.; Linton, Samuel W.; Montgomery, Kevin; Parnas, Bruce R.
1994-01-01
The human nervous system is a massively parallel processor of information. The vast numbers of neurons, synapses and circuits is daunting to those seeking to understand the neural basis of consciousness and intellect. Pervading obstacles are lack of knowledge of the detailed, three-dimensional (3-D) organization of even a simple neural system and the paucity of large scale, biologically relevant computer simulations. We use high performance graphics workstations and supercomputers to study the 3-D organization of gravity sensors as a prototype architecture foreshadowing more complex systems. Scaled-down simulations run on a Silicon Graphics workstation and scale-up, three-dimensional versions run on the Cray Y-MP and CM5 supercomputers.
Neural Parallel Engine: A toolbox for massively parallel neural signal processing.
Tam, Wing-Kin; Yang, Zhi
2018-05-01
Large-scale neural recordings provide detailed information on neuronal activities and can help elicit the underlying neural mechanisms of the brain. However, the computational burden is also formidable when we try to process the huge data stream generated by such recordings. In this study, we report the development of Neural Parallel Engine (NPE), a toolbox for massively parallel neural signal processing on graphical processing units (GPUs). It offers a selection of the most commonly used routines in neural signal processing such as spike detection and spike sorting, including advanced algorithms such as exponential-component-power-component (EC-PC) spike detection and binary pursuit spike sorting. We also propose a new method for detecting peaks in parallel through a parallel compact operation. Our toolbox is able to offer a 5× to 110× speedup compared with its CPU counterparts depending on the algorithms. A user-friendly MATLAB interface is provided to allow easy integration of the toolbox into existing workflows. Previous efforts on GPU neural signal processing only focus on a few rudimentary algorithms, are not well-optimized and often do not provide a user-friendly programming interface to fit into existing workflows. There is a strong need for a comprehensive toolbox for massively parallel neural signal processing. A new toolbox for massively parallel neural signal processing has been created. It can offer significant speedup in processing signals from large-scale recordings up to thousands of channels. Copyright © 2018 Elsevier B.V. All rights reserved.
García-Grajales, Julián A.; Rucabado, Gabriel; García-Dopico, Antonio; Peña, José-María; Jérusalem, Antoine
2015-01-01
With the growing body of research on traumatic brain injury and spinal cord injury, computational neuroscience has recently focused its modeling efforts on neuronal functional deficits following mechanical loading. However, in most of these efforts, cell damage is generally only characterized by purely mechanistic criteria, functions of quantities such as stress, strain or their corresponding rates. The modeling of functional deficits in neurites as a consequence of macroscopic mechanical insults has been rarely explored. In particular, a quantitative mechanically based model of electrophysiological impairment in neuronal cells, Neurite, has only very recently been proposed. In this paper, we present the implementation details of this model: a finite difference parallel program for simulating electrical signal propagation along neurites under mechanical loading. Following the application of a macroscopic strain at a given strain rate produced by a mechanical insult, Neurite is able to simulate the resulting neuronal electrical signal propagation, and thus the corresponding functional deficits. The simulation of the coupled mechanical and electrophysiological behaviors requires computational expensive calculations that increase in complexity as the network of the simulated cells grows. The solvers implemented in Neurite—explicit and implicit—were therefore parallelized using graphics processing units in order to reduce the burden of the simulation costs of large scale scenarios. Cable Theory and Hodgkin-Huxley models were implemented to account for the electrophysiological passive and active regions of a neurite, respectively, whereas a coupled mechanical model accounting for the neurite mechanical behavior within its surrounding medium was adopted as a link between electrophysiology and mechanics. This paper provides the details of the parallel implementation of Neurite, along with three different application examples: a long myelinated axon, a segmented dendritic tree, and a damaged axon. The capabilities of the program to deal with large scale scenarios, segmented neuronal structures, and functional deficits under mechanical loading are specifically highlighted. PMID:25680098
DOE Office of Scientific and Technical Information (OSTI.GOV)
Rubel, Oliver; Loring, Burlen; Vay, Jean -Luc
The generation of short pulses of ion beams through the interaction of an intense laser with a plasma sheath offers the possibility of compact and cheaper ion sources for many applications--from fast ignition and radiography of dense targets to hadron therapy and injection into conventional accelerators. To enable the efficient analysis of large-scale, high-fidelity particle accelerator simulations using the Warp simulation suite, the authors introduce the Warp In situ Visualization Toolkit (WarpIV). WarpIV integrates state-of-the-art in situ visualization and analysis using VisIt with Warp, supports management and control of complex in situ visualization and analysis workflows, and implements integrated analyticsmore » to facilitate query- and feature-based data analytics and efficient large-scale data analysis. WarpIV enables for the first time distributed parallel, in situ visualization of the full simulation data using high-performance compute resources as the data is being generated by Warp. The authors describe the application of WarpIV to study and compare large 2D and 3D ion accelerator simulations, demonstrating significant differences in the acceleration process in 2D and 3D simulations. WarpIV is available to the public via https://bitbucket.org/berkeleylab/warpiv. The Warp In situ Visualization Toolkit (WarpIV) supports large-scale, parallel, in situ visualization and analysis and facilitates query- and feature-based analytics, enabling for the first time high-performance analysis of large-scale, high-fidelity particle accelerator simulations while the data is being generated by the Warp simulation suite. Furthermore, this supplemental material https://extras.computer.org/extra/mcg2016030022s1.pdf provides more details regarding the memory profiling and optimization and the Yee grid recentering optimization results discussed in the main article.« less
WarpIV: In situ visualization and analysis of ion accelerator simulations
Rubel, Oliver; Loring, Burlen; Vay, Jean -Luc; ...
2016-05-09
The generation of short pulses of ion beams through the interaction of an intense laser with a plasma sheath offers the possibility of compact and cheaper ion sources for many applications--from fast ignition and radiography of dense targets to hadron therapy and injection into conventional accelerators. To enable the efficient analysis of large-scale, high-fidelity particle accelerator simulations using the Warp simulation suite, the authors introduce the Warp In situ Visualization Toolkit (WarpIV). WarpIV integrates state-of-the-art in situ visualization and analysis using VisIt with Warp, supports management and control of complex in situ visualization and analysis workflows, and implements integrated analyticsmore » to facilitate query- and feature-based data analytics and efficient large-scale data analysis. WarpIV enables for the first time distributed parallel, in situ visualization of the full simulation data using high-performance compute resources as the data is being generated by Warp. The authors describe the application of WarpIV to study and compare large 2D and 3D ion accelerator simulations, demonstrating significant differences in the acceleration process in 2D and 3D simulations. WarpIV is available to the public via https://bitbucket.org/berkeleylab/warpiv. The Warp In situ Visualization Toolkit (WarpIV) supports large-scale, parallel, in situ visualization and analysis and facilitates query- and feature-based analytics, enabling for the first time high-performance analysis of large-scale, high-fidelity particle accelerator simulations while the data is being generated by the Warp simulation suite. Furthermore, this supplemental material https://extras.computer.org/extra/mcg2016030022s1.pdf provides more details regarding the memory profiling and optimization and the Yee grid recentering optimization results discussed in the main article.« less
Leveraging human oversight and intervention in large-scale parallel processing of open-source data
NASA Astrophysics Data System (ADS)
Casini, Enrico; Suri, Niranjan; Bradshaw, Jeffrey M.
2015-05-01
The popularity of cloud computing along with the increased availability of cheap storage have led to the necessity of elaboration and transformation of large volumes of open-source data, all in parallel. One way to handle such extensive volumes of information properly is to take advantage of distributed computing frameworks like Map-Reduce. Unfortunately, an entirely automated approach that excludes human intervention is often unpredictable and error prone. Highly accurate data processing and decision-making can be achieved by supporting an automatic process through human collaboration, in a variety of environments such as warfare, cyber security and threat monitoring. Although this mutual participation seems easily exploitable, human-machine collaboration in the field of data analysis presents several challenges. First, due to the asynchronous nature of human intervention, it is necessary to verify that once a correction is made, all the necessary reprocessing is done in chain. Second, it is often needed to minimize the amount of reprocessing in order to optimize the usage of resources due to limited availability. In order to improve on these strict requirements, this paper introduces improvements to an innovative approach for human-machine collaboration in the processing of large amounts of open-source data in parallel.
Genetic Parallel Programming: design and implementation.
Cheang, Sin Man; Leung, Kwong Sak; Lee, Kin Hong
2006-01-01
This paper presents a novel Genetic Parallel Programming (GPP) paradigm for evolving parallel programs running on a Multi-Arithmetic-Logic-Unit (Multi-ALU) Processor (MAP). The MAP is a Multiple Instruction-streams, Multiple Data-streams (MIMD), general-purpose register machine that can be implemented on modern Very Large-Scale Integrated Circuits (VLSIs) in order to evaluate genetic programs at high speed. For human programmers, writing parallel programs is more difficult than writing sequential programs. However, experimental results show that GPP evolves parallel programs with less computational effort than that of their sequential counterparts. It creates a new approach to evolving a feasible problem solution in parallel program form and then serializes it into a sequential program if required. The effectiveness and efficiency of GPP are investigated using a suite of 14 well-studied benchmark problems. Experimental results show that GPP speeds up evolution substantially.
Parallel-In-Time For Moving Meshes
DOE Office of Scientific and Technical Information (OSTI.GOV)
Falgout, R. D.; Manteuffel, T. A.; Southworth, B.
2016-02-04
With steadily growing computational resources available, scientists must develop e ective ways to utilize the increased resources. High performance, highly parallel software has be- come a standard. However until recent years parallelism has focused primarily on the spatial domain. When solving a space-time partial di erential equation (PDE), this leads to a sequential bottleneck in the temporal dimension, particularly when taking a large number of time steps. The XBraid parallel-in-time library was developed as a practical way to add temporal parallelism to existing se- quential codes with only minor modi cations. In this work, a rezoning-type moving mesh is appliedmore » to a di usion problem and formulated in a parallel-in-time framework. Tests and scaling studies are run using XBraid and demonstrate excellent results for the simple model problem considered herein.« less
Developing eThread pipeline using SAGA-pilot abstraction for large-scale structural bioinformatics.
Ragothaman, Anjani; Boddu, Sairam Chowdary; Kim, Nayong; Feinstein, Wei; Brylinski, Michal; Jha, Shantenu; Kim, Joohyun
2014-01-01
While most of computational annotation approaches are sequence-based, threading methods are becoming increasingly attractive because of predicted structural information that could uncover the underlying function. However, threading tools are generally compute-intensive and the number of protein sequences from even small genomes such as prokaryotes is large typically containing many thousands, prohibiting their application as a genome-wide structural systems biology tool. To leverage its utility, we have developed a pipeline for eThread--a meta-threading protein structure modeling tool, that can use computational resources efficiently and effectively. We employ a pilot-based approach that supports seamless data and task-level parallelism and manages large variation in workload and computational requirements. Our scalable pipeline is deployed on Amazon EC2 and can efficiently select resources based upon task requirements. We present runtime analysis to characterize computational complexity of eThread and EC2 infrastructure. Based on results, we suggest a pathway to an optimized solution with respect to metrics such as time-to-solution or cost-to-solution. Our eThread pipeline can scale to support a large number of sequences and is expected to be a viable solution for genome-scale structural bioinformatics and structure-based annotation, particularly, amenable for small genomes such as prokaryotes. The developed pipeline is easily extensible to other types of distributed cyberinfrastructure.
Developing eThread Pipeline Using SAGA-Pilot Abstraction for Large-Scale Structural Bioinformatics
Ragothaman, Anjani; Feinstein, Wei; Jha, Shantenu; Kim, Joohyun
2014-01-01
While most of computational annotation approaches are sequence-based, threading methods are becoming increasingly attractive because of predicted structural information that could uncover the underlying function. However, threading tools are generally compute-intensive and the number of protein sequences from even small genomes such as prokaryotes is large typically containing many thousands, prohibiting their application as a genome-wide structural systems biology tool. To leverage its utility, we have developed a pipeline for eThread—a meta-threading protein structure modeling tool, that can use computational resources efficiently and effectively. We employ a pilot-based approach that supports seamless data and task-level parallelism and manages large variation in workload and computational requirements. Our scalable pipeline is deployed on Amazon EC2 and can efficiently select resources based upon task requirements. We present runtime analysis to characterize computational complexity of eThread and EC2 infrastructure. Based on results, we suggest a pathway to an optimized solution with respect to metrics such as time-to-solution or cost-to-solution. Our eThread pipeline can scale to support a large number of sequences and is expected to be a viable solution for genome-scale structural bioinformatics and structure-based annotation, particularly, amenable for small genomes such as prokaryotes. The developed pipeline is easily extensible to other types of distributed cyberinfrastructure. PMID:24995285
NASA Astrophysics Data System (ADS)
Loring, B.; Karimabadi, H.; Rortershteyn, V.
2015-10-01
The surface line integral convolution(LIC) visualization technique produces dense visualization of vector fields on arbitrary surfaces. We present a screen space surface LIC algorithm for use in distributed memory data parallel sort last rendering infrastructures. The motivations for our work are to support analysis of datasets that are too large to fit in the main memory of a single computer and compatibility with prevalent parallel scientific visualization tools such as ParaView and VisIt. By working in screen space using OpenGL we can leverage the computational power of GPUs when they are available and run without them when they are not. We address efficiency and performance issues that arise from the transformation of data from physical to screen space by selecting an alternate screen space domain decomposition. We analyze the algorithm's scaling behavior with and without GPUs on two high performance computing systems using data from turbulent plasma simulations.
Design and Verification of Remote Sensing Image Data Center Storage Architecture Based on Hadoop
NASA Astrophysics Data System (ADS)
Tang, D.; Zhou, X.; Jing, Y.; Cong, W.; Li, C.
2018-04-01
The data center is a new concept of data processing and application proposed in recent years. It is a new method of processing technologies based on data, parallel computing, and compatibility with different hardware clusters. While optimizing the data storage management structure, it fully utilizes cluster resource computing nodes and improves the efficiency of data parallel application. This paper used mature Hadoop technology to build a large-scale distributed image management architecture for remote sensing imagery. Using MapReduce parallel processing technology, it called many computing nodes to process image storage blocks and pyramids in the background to improve the efficiency of image reading and application and sovled the need for concurrent multi-user high-speed access to remotely sensed data. It verified the rationality, reliability and superiority of the system design by testing the storage efficiency of different image data and multi-users and analyzing the distributed storage architecture to improve the application efficiency of remote sensing images through building an actual Hadoop service system.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Loring, Burlen; Karimabadi, Homa; Rortershteyn, Vadim
2014-07-01
The surface line integral convolution(LIC) visualization technique produces dense visualization of vector fields on arbitrary surfaces. We present a screen space surface LIC algorithm for use in distributed memory data parallel sort last rendering infrastructures. The motivations for our work are to support analysis of datasets that are too large to fit in the main memory of a single computer and compatibility with prevalent parallel scientific visualization tools such as ParaView and VisIt. By working in screen space using OpenGL we can leverage the computational power of GPUs when they are available and run without them when they are not.more » We address efficiency and performance issues that arise from the transformation of data from physical to screen space by selecting an alternate screen space domain decomposition. We analyze the algorithm's scaling behavior with and without GPUs on two high performance computing systems using data from turbulent plasma simulations.« less
Womack, James C; Anton, Lucian; Dziedzic, Jacek; Hasnip, Phil J; Probert, Matt I J; Skylaris, Chris-Kriton
2018-03-13
The solution of the Poisson equation is a crucial step in electronic structure calculations, yielding the electrostatic potential-a key component of the quantum mechanical Hamiltonian. In recent decades, theoretical advances and increases in computer performance have made it possible to simulate the electronic structure of extended systems in complex environments. This requires the solution of more complicated variants of the Poisson equation, featuring nonhomogeneous dielectric permittivities, ionic concentrations with nonlinear dependencies, and diverse boundary conditions. The analytic solutions generally used to solve the Poisson equation in vacuum (or with homogeneous permittivity) are not applicable in these circumstances, and numerical methods must be used. In this work, we present DL_MG, a flexible, scalable, and accurate solver library, developed specifically to tackle the challenges of solving the Poisson equation in modern large-scale electronic structure calculations on parallel computers. Our solver is based on the multigrid approach and uses an iterative high-order defect correction method to improve the accuracy of solutions. Using two chemically relevant model systems, we tested the accuracy and computational performance of DL_MG when solving the generalized Poisson and Poisson-Boltzmann equations, demonstrating excellent agreement with analytic solutions and efficient scaling to ∼10 9 unknowns and 100s of CPU cores. We also applied DL_MG in actual large-scale electronic structure calculations, using the ONETEP linear-scaling electronic structure package to study a 2615 atom protein-ligand complex with routinely available computational resources. In these calculations, the overall execution time with DL_MG was not significantly greater than the time required for calculations using a conventional FFT-based solver.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chow, Edmond
Solving sparse problems is at the core of many DOE computational science applications. We focus on the challenge of developing sparse algorithms that can fully exploit the parallelism in extreme-scale computing systems, in particular systems with massive numbers of cores per node. Our approach is to express a sparse matrix factorization as a large number of bilinear constraint equations, and then solving these equations via an asynchronous iterative method. The unknowns in these equations are the matrix entries of the factorization that is desired.
Predictive wind turbine simulation with an adaptive lattice Boltzmann method for moving boundaries
NASA Astrophysics Data System (ADS)
Deiterding, Ralf; Wood, Stephen L.
2016-09-01
Operating horizontal axis wind turbines create large-scale turbulent wake structures that affect the power output of downwind turbines considerably. The computational prediction of this phenomenon is challenging as efficient low dissipation schemes are necessary that represent the vorticity production by the moving structures accurately and that are able to transport wakes without significant artificial decay over distances of several rotor diameters. We have developed a parallel adaptive lattice Boltzmann method for large eddy simulation of turbulent weakly compressible flows with embedded moving structures that considers these requirements rather naturally and enables first principle simulations of wake-turbine interaction phenomena at reasonable computational costs. The paper describes the employed computational techniques and presents validation simulations for the Mexnext benchmark experiments as well as simulations of the wake propagation in the Scaled Wind Farm Technology (SWIFT) array consisting of three Vestas V27 turbines in triangular arrangement.
TDat: An Efficient Platform for Processing Petabyte-Scale Whole-Brain Volumetric Images.
Li, Yuxin; Gong, Hui; Yang, Xiaoquan; Yuan, Jing; Jiang, Tao; Li, Xiangning; Sun, Qingtao; Zhu, Dan; Wang, Zhenyu; Luo, Qingming; Li, Anan
2017-01-01
Three-dimensional imaging of whole mammalian brains at single-neuron resolution has generated terabyte (TB)- and even petabyte (PB)-sized datasets. Due to their size, processing these massive image datasets can be hindered by the computer hardware and software typically found in biological laboratories. To fill this gap, we have developed an efficient platform named TDat, which adopts a novel data reformatting strategy by reading cuboid data and employing parallel computing. In data reformatting, TDat is more efficient than any other software. In data accessing, we adopted parallelization to fully explore the capability for data transmission in computers. We applied TDat in large-volume data rigid registration and neuron tracing in whole-brain data with single-neuron resolution, which has never been demonstrated in other studies. We also showed its compatibility with various computing platforms, image processing software and imaging systems.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chen, Chao; Pouransari, Hadi; Rajamanickam, Sivasankaran
We present a parallel hierarchical solver for general sparse linear systems on distributed-memory machines. For large-scale problems, this fully algebraic algorithm is faster and more memory-efficient than sparse direct solvers because it exploits the low-rank structure of fill-in blocks. Depending on the accuracy of low-rank approximations, the hierarchical solver can be used either as a direct solver or as a preconditioner. The parallel algorithm is based on data decomposition and requires only local communication for updating boundary data on every processor. Moreover, the computation-to-communication ratio of the parallel algorithm is approximately the volume-to-surface-area ratio of the subdomain owned by everymore » processor. We also provide various numerical results to demonstrate the versatility and scalability of the parallel algorithm.« less
Large Scale Document Inversion using a Multi-threaded Computing System
Jung, Sungbo; Chang, Dar-Jen; Park, Juw Won
2018-01-01
Current microprocessor architecture is moving towards multi-core/multi-threaded systems. This trend has led to a surge of interest in using multi-threaded computing devices, such as the Graphics Processing Unit (GPU), for general purpose computing. We can utilize the GPU in computation as a massive parallel coprocessor because the GPU consists of multiple cores. The GPU is also an affordable, attractive, and user-programmable commodity. Nowadays a lot of information has been flooded into the digital domain around the world. Huge volume of data, such as digital libraries, social networking services, e-commerce product data, and reviews, etc., is produced or collected every moment with dramatic growth in size. Although the inverted index is a useful data structure that can be used for full text searches or document retrieval, a large number of documents will require a tremendous amount of time to create the index. The performance of document inversion can be improved by multi-thread or multi-core GPU. Our approach is to implement a linear-time, hash-based, single program multiple data (SPMD), document inversion algorithm on the NVIDIA GPU/CUDA programming platform utilizing the huge computational power of the GPU, to develop high performance solutions for document indexing. Our proposed parallel document inversion system shows 2-3 times faster performance than a sequential system on two different test datasets from PubMed abstract and e-commerce product reviews. CCS Concepts •Information systems➝Information retrieval • Computing methodologies➝Massively parallel and high-performance simulations. PMID:29861701
Large Scale Document Inversion using a Multi-threaded Computing System.
Jung, Sungbo; Chang, Dar-Jen; Park, Juw Won
2017-06-01
Current microprocessor architecture is moving towards multi-core/multi-threaded systems. This trend has led to a surge of interest in using multi-threaded computing devices, such as the Graphics Processing Unit (GPU), for general purpose computing. We can utilize the GPU in computation as a massive parallel coprocessor because the GPU consists of multiple cores. The GPU is also an affordable, attractive, and user-programmable commodity. Nowadays a lot of information has been flooded into the digital domain around the world. Huge volume of data, such as digital libraries, social networking services, e-commerce product data, and reviews, etc., is produced or collected every moment with dramatic growth in size. Although the inverted index is a useful data structure that can be used for full text searches or document retrieval, a large number of documents will require a tremendous amount of time to create the index. The performance of document inversion can be improved by multi-thread or multi-core GPU. Our approach is to implement a linear-time, hash-based, single program multiple data (SPMD), document inversion algorithm on the NVIDIA GPU/CUDA programming platform utilizing the huge computational power of the GPU, to develop high performance solutions for document indexing. Our proposed parallel document inversion system shows 2-3 times faster performance than a sequential system on two different test datasets from PubMed abstract and e-commerce product reviews. •Information systems➝Information retrieval • Computing methodologies➝Massively parallel and high-performance simulations.
Operation of the Institute for Computer Applications in Science and Engineering
NASA Technical Reports Server (NTRS)
1975-01-01
The ICASE research program is described in detail; it consists of four major categories: (1) efficient use of vector and parallel computers, with particular emphasis on the CDC STAR-100; (2) numerical analysis, with particular emphasis on the development and analysis of basic numerical algorithms; (3) analysis and planning of large-scale software systems; and (4) computational research in engineering and the natural sciences, with particular emphasis on fluid dynamics. The work in each of these areas is described in detail; other activities are discussed, a prognosis of future activities are included.
Aquilante, Francesco; Autschbach, Jochen; Carlson, Rebecca K; Chibotaru, Liviu F; Delcey, Mickaël G; De Vico, Luca; Fdez Galván, Ignacio; Ferré, Nicolas; Frutos, Luis Manuel; Gagliardi, Laura; Garavelli, Marco; Giussani, Angelo; Hoyer, Chad E; Li Manni, Giovanni; Lischka, Hans; Ma, Dongxia; Malmqvist, Per Åke; Müller, Thomas; Nenov, Artur; Olivucci, Massimo; Pedersen, Thomas Bondo; Peng, Daoling; Plasser, Felix; Pritchard, Ben; Reiher, Markus; Rivalta, Ivan; Schapiro, Igor; Segarra-Martí, Javier; Stenrup, Michael; Truhlar, Donald G; Ungur, Liviu; Valentini, Alessio; Vancoillie, Steven; Veryazov, Valera; Vysotskiy, Victor P; Weingart, Oliver; Zapata, Felipe; Lindh, Roland
2016-02-15
In this report, we summarize and describe the recent unique updates and additions to the Molcas quantum chemistry program suite as contained in release version 8. These updates include natural and spin orbitals for studies of magnetic properties, local and linear scaling methods for the Douglas-Kroll-Hess transformation, the generalized active space concept in MCSCF methods, a combination of multiconfigurational wave functions with density functional theory in the MC-PDFT method, additional methods for computation of magnetic properties, methods for diabatization, analytical gradients of state average complete active space SCF in association with density fitting, methods for constrained fragment optimization, large-scale parallel multireference configuration interaction including analytic gradients via the interface to the Columbus package, and approximations of the CASPT2 method to be used for computations of large systems. In addition, the report includes the description of a computational machinery for nonlinear optical spectroscopy through an interface to the QM/MM package Cobramm. Further, a module to run molecular dynamics simulations is added, two surface hopping algorithms are included to enable nonadiabatic calculations, and the DQ method for diabatization is added. Finally, we report on the subject of improvements with respects to alternative file options and parallelization. © 2015 Wiley Periodicals, Inc.
Large-Scale, Parallel, Multi-Sensor Data Fusion in the Cloud
NASA Astrophysics Data System (ADS)
Wilson, B. D.; Manipon, G.; Hua, H.
2012-12-01
NASA's Earth Observing System (EOS) is an ambitious facility for studying global climate change. The mandate now is to combine measurements from the instruments on the "A-Train" platforms (AIRS, AMSR-E, MODIS, MISR, MLS, and CloudSat) and other Earth probes to enable large-scale studies of climate change over periods of years to decades. However, moving from predominantly single-instrument studies to a multi-sensor, measurement-based model for long-duration analysis of important climate variables presents serious challenges for large-scale data mining and data fusion. For example, one might want to compare temperature and water vapor retrievals from one instrument (AIRS) to another instrument (MODIS), and to a model (ECMWF), stratify the comparisons using a classification of the "cloud scenes" from CloudSat, and repeat the entire analysis over years of AIRS data. To perform such an analysis, one must discover & access multiple datasets from remote sites, find the space/time "matchups" between instruments swaths and model grids, understand the quality flags and uncertainties for retrieved physical variables, assemble merged datasets, and compute fused products for further scientific and statistical analysis. To efficiently assemble such decade-scale datasets in a timely manner, we are utilizing Elastic Computing in the Cloud and parallel map/reduce-based algorithms. "SciReduce" is a Hadoop-like parallel analysis system, programmed in parallel python, that is designed from the ground up for Earth science. SciReduce executes inside VMWare images and scales to any number of nodes in the Cloud. Unlike Hadoop, in which simple tuples (keys & values) are passed between the map and reduce functions, SciReduce operates on bundles of named numeric arrays, which can be passed in memory or serialized to disk in netCDF4 or HDF5. Thus, SciReduce uses the native datatypes (geolocated grids, swaths, and points) that geo-scientists are familiar with. We are deploying within SciReduce a versatile set of python operators for data lookup, access, subsetting, co-registration, mining, fusion, and statistical analysis. All operators take in sets of geo-located arrays and generate more arrays. Large, multi-year satellite and model datasets are automatically "sharded" by time and space across a cluster of nodes so that years of data (millions of granules) can be compared or fused in a massively parallel way. Input variables (arrays) are pulled on-demand into the Cloud using OPeNDAP or webification URLs, thereby minimizing the size of the stored input and intermediate datasets. A typical map function might assemble and quality control AIRS Level-2 water vapor profiles for a year of data in parallel, then a reduce function would average the profiles in lat/lon bins (again, in parallel), and a final reduce would aggregate the climatology and write it to output files. We are using SciReduce to automate the production of multiple versions of a multi-year water vapor climatology (AIRS & MODIS), stratified by Cloudsat cloud classification, and compare it to models (ECMWF & MERRA reanalysis). We will present the architecture of SciReduce, describe the achieved "clock time" speedups in fusing huge datasets on our own nodes and in the Amazon Cloud, and discuss the Cloud cost tradeoffs for storage, compute, and data transfer.
Large-Scale, Parallel, Multi-Sensor Data Fusion in the Cloud
NASA Astrophysics Data System (ADS)
Wilson, B.; Manipon, G.; Hua, H.
2012-04-01
NASA's Earth Observing System (EOS) is an ambitious facility for studying global climate change. The mandate now is to combine measurements from the instruments on the "A-Train" platforms (AIRS, AMSR-E, MODIS, MISR, MLS, and CloudSat) and other Earth probes to enable large-scale studies of climate change over periods of years to decades. However, moving from predominantly single-instrument studies to a multi-sensor, measurement-based model for long-duration analysis of important climate variables presents serious challenges for large-scale data mining and data fusion. For example, one might want to compare temperature and water vapor retrievals from one instrument (AIRS) to another instrument (MODIS), and to a model (ECMWF), stratify the comparisons using a classification of the "cloud scenes" from CloudSat, and repeat the entire analysis over years of AIRS data. To perform such an analysis, one must discover & access multiple datasets from remote sites, find the space/time "matchups" between instruments swaths and model grids, understand the quality flags and uncertainties for retrieved physical variables, assemble merged datasets, and compute fused products for further scientific and statistical analysis. To efficiently assemble such decade-scale datasets in a timely manner, we are utilizing Elastic Computing in the Cloud and parallel map/reduce-based algorithms. "SciReduce" is a Hadoop-like parallel analysis system, programmed in parallel python, that is designed from the ground up for Earth science. SciReduce executes inside VMWare images and scales to any number of nodes in the Cloud. Unlike Hadoop, in which simple tuples (keys & values) are passed between the map and reduce functions, SciReduce operates on bundles of named numeric arrays, which can be passed in memory or serialized to disk in netCDF4 or HDF5. Thus, SciReduce uses the native datatypes (geolocated grids, swaths, and points) that geo-scientists are familiar with. We are deploying within SciReduce a versatile set of python operators for data lookup, access, subsetting, co-registration, mining, fusion, and statistical analysis. All operators take in sets of geo-arrays and generate more arrays. Large, multi-year satellite and model datasets are automatically "sharded" by time and space across a cluster of nodes so that years of data (millions of granules) can be compared or fused in a massively parallel way. Input variables (arrays) are pulled on-demand into the Cloud using OPeNDAP or webification URLs, thereby minimizing the size of the stored input and intermediate datasets. A typical map function might assemble and quality control AIRS Level-2 water vapor profiles for a year of data in parallel, then a reduce function would average the profiles in bins (again, in parallel), and a final reduce would aggregate the climatology and write it to output files. We are using SciReduce to automate the production of multiple versions of a multi-year water vapor climatology (AIRS & MODIS), stratified by Cloudsat cloud classification, and compare it to models (ECMWF & MERRA reanalysis). We will present the architecture of SciReduce, describe the achieved "clock time" speedups in fusing huge datasets on our own nodes and in the Amazon Cloud, and discuss the Cloud cost tradeoffs for storage, compute, and data transfer.
Substructured multibody molecular dynamics.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Grest, Gary Stephen; Stevens, Mark Jackson; Plimpton, Steven James
2006-11-01
We have enhanced our parallel molecular dynamics (MD) simulation software LAMMPS (Large-scale Atomic/Molecular Massively Parallel Simulator, lammps.sandia.gov) to include many new features for accelerated simulation including articulated rigid body dynamics via coupling to the Rensselaer Polytechnic Institute code POEMS (Parallelizable Open-source Efficient Multibody Software). We use new features of the LAMMPS software package to investigate rhodopsin photoisomerization, and water model surface tension and capillary waves at the vapor-liquid interface. Finally, we motivate the recipes of MD for practitioners and researchers in numerical analysis and computational mechanics.
1987-12-01
8217ftp.. *,*IS ~. ~bw ~ ft.. p ’ft ’ft ft.. ’ft *I~ P* ’ft ’p 0n-I ci via 1 ca j I .11’ ft~ ’ fttH vialca *- ’ft ft..I ft. ’ft ft.. --ft ..ft ’ft ftp
Block-Parallel Data Analysis with DIY2
DOE Office of Scientific and Technical Information (OSTI.GOV)
Morozov, Dmitriy; Peterka, Tom
DIY2 is a programming model and runtime for block-parallel analytics on distributed-memory machines. Its main abstraction is block-structured data parallelism: data are decomposed into blocks; blocks are assigned to processing elements (processes or threads); computation is described as iterations over these blocks, and communication between blocks is defined by reusable patterns. By expressing computation in this general form, the DIY2 runtime is free to optimize the movement of blocks between slow and fast memories (disk and flash vs. DRAM) and to concurrently execute blocks residing in memory with multiple threads. This enables the same program to execute in-core, out-of-core, serial,more » parallel, single-threaded, multithreaded, or combinations thereof. This paper describes the implementation of the main features of the DIY2 programming model and optimizations to improve performance. DIY2 is evaluated on benchmark test cases to establish baseline performance for several common patterns and on larger complete analysis codes running on large-scale HPC machines.« less
DIALIGN P: fast pair-wise and multiple sequence alignment using parallel processors.
Schmollinger, Martin; Nieselt, Kay; Kaufmann, Michael; Morgenstern, Burkhard
2004-09-09
Parallel computing is frequently used to speed up computationally expensive tasks in Bioinformatics. Herein, a parallel version of the multi-alignment program DIALIGN is introduced. We propose two ways of dividing the program into independent sub-routines that can be run on different processors: (a) pair-wise sequence alignments that are used as a first step to multiple alignment account for most of the CPU time in DIALIGN. Since alignments of different sequence pairs are completely independent of each other, they can be distributed to multiple processors without any effect on the resulting output alignments. (b) For alignments of large genomic sequences, we use a heuristics by splitting up sequences into sub-sequences based on a previously introduced anchored alignment procedure. For our test sequences, this combined approach reduces the program running time of DIALIGN by up to 97%. By distributing sub-routines to multiple processors, the running time of DIALIGN can be crucially improved. With these improvements, it is possible to apply the program in large-scale genomics and proteomics projects that were previously beyond its scope.
Big data driven cycle time parallel prediction for production planning in wafer manufacturing
NASA Astrophysics Data System (ADS)
Wang, Junliang; Yang, Jungang; Zhang, Jie; Wang, Xiaoxi; Zhang, Wenjun Chris
2018-07-01
Cycle time forecasting (CTF) is one of the most crucial issues for production planning to keep high delivery reliability in semiconductor wafer fabrication systems (SWFS). This paper proposes a novel data-intensive cycle time (CT) prediction system with parallel computing to rapidly forecast the CT of wafer lots with large datasets. First, a density peak based radial basis function network (DP-RBFN) is designed to forecast the CT with the diverse and agglomerative CT data. Second, the network learning method based on a clustering technique is proposed to determine the density peak. Third, a parallel computing approach for network training is proposed in order to speed up the training process with large scaled CT data. Finally, an experiment with respect to SWFS is presented, which demonstrates that the proposed CTF system can not only speed up the training process of the model but also outperform the radial basis function network, the back-propagation-network and multivariate regression methodology based CTF methods in terms of the mean absolute deviation and standard deviation.
NASA Technical Reports Server (NTRS)
Krasteva, Denitza T.
1998-01-01
Multidisciplinary design optimization (MDO) for large-scale engineering problems poses many challenges (e.g., the design of an efficient concurrent paradigm for global optimization based on disciplinary analyses, expensive computations over vast data sets, etc.) This work focuses on the application of distributed schemes for massively parallel architectures to MDO problems, as a tool for reducing computation time and solving larger problems. The specific problem considered here is configuration optimization of a high speed civil transport (HSCT), and the efficient parallelization of the embedded paradigm for reasonable design space identification. Two distributed dynamic load balancing techniques (random polling and global round robin with message combining) and two necessary termination detection schemes (global task count and token passing) were implemented and evaluated in terms of effectiveness and scalability to large problem sizes and a thousand processors. The effect of certain parameters on execution time was also inspected. Empirical results demonstrated stable performance and effectiveness for all schemes, and the parametric study showed that the selected algorithmic parameters have a negligible effect on performance.
A parallel computer implementation of fast low-rank QR approximation of the Biot-Savart law
DOE Office of Scientific and Technical Information (OSTI.GOV)
White, D A; Fasenfest, B J; Stowell, M L
2005-11-07
In this paper we present a low-rank QR method for evaluating the discrete Biot-Savart law on parallel computers. It is assumed that the known current density and the unknown magnetic field are both expressed in a finite element expansion, and we wish to compute the degrees-of-freedom (DOF) in the basis function expansion of the magnetic field. The matrix that maps the current DOF to the field DOF is full, but if the spatial domain is properly partitioned the matrix can be written as a block matrix, with blocks representing distant interactions being low rank and having a compressed QR representation.more » The matrix partitioning is determined by the number of processors, the rank of each block (i.e. the compression) is determined by the specific geometry and is computed dynamically. In this paper we provide the algorithmic details and present computational results for large-scale computations.« less
Drawert, Brian; Trogdon, Michael; Toor, Salman; Petzold, Linda; Hellander, Andreas
2017-01-01
Computational experiments using spatial stochastic simulations have led to important new biological insights, but they require specialized tools and a complex software stack, as well as large and scalable compute and data analysis resources due to the large computational cost associated with Monte Carlo computational workflows. The complexity of setting up and managing a large-scale distributed computation environment to support productive and reproducible modeling can be prohibitive for practitioners in systems biology. This results in a barrier to the adoption of spatial stochastic simulation tools, effectively limiting the type of biological questions addressed by quantitative modeling. In this paper, we present PyURDME, a new, user-friendly spatial modeling and simulation package, and MOLNs, a cloud computing appliance for distributed simulation of stochastic reaction-diffusion models. MOLNs is based on IPython and provides an interactive programming platform for development of sharable and reproducible distributed parallel computational experiments. PMID:28190948
Graph-based linear scaling electronic structure theory.
Niklasson, Anders M N; Mniszewski, Susan M; Negre, Christian F A; Cawkwell, Marc J; Swart, Pieter J; Mohd-Yusof, Jamal; Germann, Timothy C; Wall, Michael E; Bock, Nicolas; Rubensson, Emanuel H; Djidjev, Hristo
2016-06-21
We show how graph theory can be combined with quantum theory to calculate the electronic structure of large complex systems. The graph formalism is general and applicable to a broad range of electronic structure methods and materials, including challenging systems such as biomolecules. The methodology combines well-controlled accuracy, low computational cost, and natural low-communication parallelism. This combination addresses substantial shortcomings of linear scaling electronic structure theory, in particular with respect to quantum-based molecular dynamics simulations.
Graph-based linear scaling electronic structure theory
DOE Office of Scientific and Technical Information (OSTI.GOV)
Niklasson, Anders M. N., E-mail: amn@lanl.gov; Negre, Christian F. A.; Cawkwell, Marc J.
2016-06-21
We show how graph theory can be combined with quantum theory to calculate the electronic structure of large complex systems. The graph formalism is general and applicable to a broad range of electronic structure methods and materials, including challenging systems such as biomolecules. The methodology combines well-controlled accuracy, low computational cost, and natural low-communication parallelism. This combination addresses substantial shortcomings of linear scaling electronic structure theory, in particular with respect to quantum-based molecular dynamics simulations.
Harvey, Benjamin Simeon; Ji, Soo-Yeon
2017-01-01
As microarray data available to scientists continues to increase in size and complexity, it has become overwhelmingly important to find multiple ways to bring forth oncological inference to the bioinformatics community through the analysis of large-scale cancer genomic (LSCG) DNA and mRNA microarray data that is useful to scientists. Though there have been many attempts to elucidate the issue of bringing forth biological interpretation by means of wavelet preprocessing and classification, there has not been a research effort that focuses on a cloud-scale distributed parallel (CSDP) separable 1-D wavelet decomposition technique for denoising through differential expression thresholding and classification of LSCG microarray data. This research presents a novel methodology that utilizes a CSDP separable 1-D method for wavelet-based transformation in order to initialize a threshold which will retain significantly expressed genes through the denoising process for robust classification of cancer patients. Additionally, the overall study was implemented and encompassed within CSDP environment. The utilization of cloud computing and wavelet-based thresholding for denoising was used for the classification of samples within the Global Cancer Map, Cancer Cell Line Encyclopedia, and The Cancer Genome Atlas. The results proved that separable 1-D parallel distributed wavelet denoising in the cloud and differential expression thresholding increased the computational performance and enabled the generation of higher quality LSCG microarray datasets, which led to more accurate classification results.
Scientific Services on the Cloud
NASA Astrophysics Data System (ADS)
Chapman, David; Joshi, Karuna P.; Yesha, Yelena; Halem, Milt; Yesha, Yaacov; Nguyen, Phuong
Scientific Computing was one of the first every applications for parallel and distributed computation. To this date, scientific applications remain some of the most compute intensive, and have inspired creation of petaflop compute infrastructure such as the Oak Ridge Jaguar and Los Alamos RoadRunner. Large dedicated hardware infrastructure has become both a blessing and a curse to the scientific community. Scientists are interested in cloud computing for much the same reason as businesses and other professionals. The hardware is provided, maintained, and administrated by a third party. Software abstraction and virtualization provide reliability, and fault tolerance. Graduated fees allow for multi-scale prototyping and execution. Cloud computing resources are only a few clicks away, and by far the easiest high performance distributed platform to gain access to. There may still be dedicated infrastructure for ultra-scale science, but the cloud can easily play a major part of the scientific computing initiative.
PARALLEL HOP: A SCALABLE HALO FINDER FOR MASSIVE COSMOLOGICAL DATA SETS
DOE Office of Scientific and Technical Information (OSTI.GOV)
Skory, Stephen; Turk, Matthew J.; Norman, Michael L.
2010-11-15
Modern N-body cosmological simulations contain billions (10{sup 9}) of dark matter particles. These simulations require hundreds to thousands of gigabytes of memory and employ hundreds to tens of thousands of processing cores on many compute nodes. In order to study the distribution of dark matter in a cosmological simulation, the dark matter halos must be identified using a halo finder, which establishes the halo membership of every particle in the simulation. The resources required for halo finding are similar to the requirements for the simulation itself. In particular, simulations have become too extensive to use commonly employed halo finders, suchmore » that the computational requirements to identify halos must now be spread across multiple nodes and cores. Here, we present a scalable-parallel halo finding method called Parallel HOP for large-scale cosmological simulation data. Based on the halo finder HOP, it utilizes message passing interface and domain decomposition to distribute the halo finding workload across multiple compute nodes, enabling analysis of much larger data sets than is possible with the strictly serial or previous parallel implementations of HOP. We provide a reference implementation of this method as a part of the toolkit {sup yt}, an analysis toolkit for adaptive mesh refinement data that include complementary analysis modules. Additionally, we discuss a suite of benchmarks that demonstrate that this method scales well up to several hundred tasks and data sets in excess of 2000{sup 3} particles. The Parallel HOP method and our implementation can be readily applied to any kind of N-body simulation data and is therefore widely applicable.« less
Parallel Adjective High-Order CFD Simulations Characterizing SOFIA Cavity Acoustics
NASA Technical Reports Server (NTRS)
Barad, Michael F.; Brehm, Christoph; Kiris, Cetin C.; Biswas, Rupak
2016-01-01
This paper presents large-scale MPI-parallel computational uid dynamics simulations for the Stratospheric Observatory for Infrared Astronomy (SOFIA). SOFIA is an airborne, 2.5-meter infrared telescope mounted in an open cavity in the aft fuselage of a Boeing 747SP. These simulations focus on how the unsteady ow eld inside and over the cavity interferes with the optical path and mounting structure of the telescope. A temporally fourth-order accurate Runge-Kutta, and spatially fth-order accurate WENO- 5Z scheme was used to perform implicit large eddy simulations. An immersed boundary method provides automated gridding for complex geometries and natural coupling to a block-structured Cartesian adaptive mesh re nement framework. Strong scaling studies using NASA's Pleiades supercomputer with up to 32k CPU cores and 4 billion compu- tational cells shows excellent scaling. Dynamic load balancing based on execution time on individual AMR blocks addresses irregular numerical cost associated with blocks con- taining boundaries. Limits to scaling beyond 32k cores are identi ed, and targeted code optimizations are discussed.
Parallel Adaptive High-Order CFD Simulations Characterizing SOFIA Cavitiy Acoustics
NASA Technical Reports Server (NTRS)
Barad, Michael F.; Brehm, Christoph; Kiris, Cetin C.; Biswas, Rupak
2015-01-01
This paper presents large-scale MPI-parallel computational uid dynamics simulations for the Stratospheric Observatory for Infrared Astronomy (SOFIA). SOFIA is an airborne, 2.5-meter infrared telescope mounted in an open cavity in the aft fuselage of a Boeing 747SP. These simulations focus on how the unsteady ow eld inside and over the cavity interferes with the optical path and mounting structure of the telescope. A tempo- rally fourth-order accurate Runge-Kutta, and a spatially fth-order accurate WENO-5Z scheme were used to perform implicit large eddy simulations. An immersed boundary method provides automated gridding for complex geometries and natural coupling to a block-structured Cartesian adaptive mesh re nement framework. Strong scaling studies using NASA's Pleiades supercomputer with up to 32k CPU cores and 4 billion compu- tational cells shows excellent scaling. Dynamic load balancing based on execution time on individual AMR blocks addresses irregular numerical cost associated with blocks con- taining boundaries. Limits to scaling beyond 32k cores are identi ed, and targeted code optimizations are discussed.
Eddy diffusivity of quasi-neutrally-buoyant inertial particles
NASA Astrophysics Data System (ADS)
Martins Afonso, Marco; Muratore-Ginanneschi, Paolo; Gama, Sílvio M. A.; Mazzino, Andrea
2018-04-01
We investigate the large-scale transport properties of quasi-neutrally-buoyant inertial particles carried by incompressible zero-mean periodic or steady ergodic flows. We show how to compute large-scale indicators such as the inertial-particle terminal velocity and eddy diffusivity from first principles in a perturbative expansion around the limit of added-mass factor close to unity. Physically, this limit corresponds to the case where the mass density of the particles is constant and close in value to the mass density of the fluid, which is also constant. Our approach differs from the usual over-damped expansion inasmuch as we do not assume a separation of time scales between thermalization and small-scale convection effects. For a general flow in the class of incompressible zero-mean periodic velocity fields, we derive closed-form cell equations for the auxiliary quantities determining the terminal velocity and effective diffusivity. In the special case of parallel flows these equations admit explicit analytic solution. We use parallel flows to show that our approach sheds light onto the behavior of terminal velocity and effective diffusivity for Stokes numbers of the order of unity.
Jung, Jaewoon; Mori, Takaharu; Kobayashi, Chigusa; Matsunaga, Yasuhiro; Yoda, Takao; Feig, Michael; Sugita, Yuji
2015-01-01
GENESIS (Generalized-Ensemble Simulation System) is a new software package for molecular dynamics (MD) simulations of macromolecules. It has two MD simulators, called ATDYN and SPDYN. ATDYN is parallelized based on an atomic decomposition algorithm for the simulations of all-atom force-field models as well as coarse-grained Go-like models. SPDYN is highly parallelized based on a domain decomposition scheme, allowing large-scale MD simulations on supercomputers. Hybrid schemes combining OpenMP and MPI are used in both simulators to target modern multicore computer architectures. Key advantages of GENESIS are (1) the highly parallel performance of SPDYN for very large biological systems consisting of more than one million atoms and (2) the availability of various REMD algorithms (T-REMD, REUS, multi-dimensional REMD for both all-atom and Go-like models under the NVT, NPT, NPAT, and NPγT ensembles). The former is achieved by a combination of the midpoint cell method and the efficient three-dimensional Fast Fourier Transform algorithm, where the domain decomposition space is shared in real-space and reciprocal-space calculations. Other features in SPDYN, such as avoiding concurrent memory access, reducing communication times, and usage of parallel input/output files, also contribute to the performance. We show the REMD simulation results of a mixed (POPC/DMPC) lipid bilayer as a real application using GENESIS. GENESIS is released as free software under the GPLv2 licence and can be easily modified for the development of new algorithms and molecular models. WIREs Comput Mol Sci 2015, 5:310–323. doi: 10.1002/wcms.1220 PMID:26753008
A General-purpose Framework for Parallel Processing of Large-scale LiDAR Data
NASA Astrophysics Data System (ADS)
Li, Z.; Hodgson, M.; Li, W.
2016-12-01
Light detection and ranging (LiDAR) technologies have proven efficiency to quickly obtain very detailed Earth surface data for a large spatial extent. Such data is important for scientific discoveries such as Earth and ecological sciences and natural disasters and environmental applications. However, handling LiDAR data poses grand geoprocessing challenges due to data intensity and computational intensity. Previous studies received notable success on parallel processing of LiDAR data to these challenges. However, these studies either relied on high performance computers and specialized hardware (GPUs) or focused mostly on finding customized solutions for some specific algorithms. We developed a general-purpose scalable framework coupled with sophisticated data decomposition and parallelization strategy to efficiently handle big LiDAR data. Specifically, 1) a tile-based spatial index is proposed to manage big LiDAR data in the scalable and fault-tolerable Hadoop distributed file system, 2) two spatial decomposition techniques are developed to enable efficient parallelization of different types of LiDAR processing tasks, and 3) by coupling existing LiDAR processing tools with Hadoop, this framework is able to conduct a variety of LiDAR data processing tasks in parallel in a highly scalable distributed computing environment. The performance and scalability of the framework is evaluated with a series of experiments conducted on a real LiDAR dataset using a proof-of-concept prototype system. The results show that the proposed framework 1) is able to handle massive LiDAR data more efficiently than standalone tools; and 2) provides almost linear scalability in terms of either increased workload (data volume) or increased computing nodes with both spatial decomposition strategies. We believe that the proposed framework provides valuable references on developing a collaborative cyberinfrastructure for processing big earth science data in a highly scalable environment.
Petascale turbulence simulation using a highly parallel fast multipole method on GPUs
NASA Astrophysics Data System (ADS)
Yokota, Rio; Barba, L. A.; Narumi, Tetsu; Yasuoka, Kenji
2013-03-01
This paper reports large-scale direct numerical simulations of homogeneous-isotropic fluid turbulence, achieving sustained performance of 1.08 petaflop/s on GPU hardware using single precision. The simulations use a vortex particle method to solve the Navier-Stokes equations, with a highly parallel fast multipole method (FMM) as numerical engine, and match the current record in mesh size for this application, a cube of 40963 computational points solved with a spectral method. The standard numerical approach used in this field is the pseudo-spectral method, relying on the FFT algorithm as the numerical engine. The particle-based simulations presented in this paper quantitatively match the kinetic energy spectrum obtained with a pseudo-spectral method, using a trusted code. In terms of parallel performance, weak scaling results show the FMM-based vortex method achieving 74% parallel efficiency on 4096 processes (one GPU per MPI process, 3 GPUs per node of the TSUBAME-2.0 system). The FFT-based spectral method is able to achieve just 14% parallel efficiency on the same number of MPI processes (using only CPU cores), due to the all-to-all communication pattern of the FFT algorithm. The calculation time for one time step was 108 s for the vortex method and 154 s for the spectral method, under these conditions. Computing with 69 billion particles, this work exceeds by an order of magnitude the largest vortex-method calculations to date.
Paradigms and strategies for scientific computing on distributed memory concurrent computers
DOE Office of Scientific and Technical Information (OSTI.GOV)
Foster, I.T.; Walker, D.W.
1994-06-01
In this work we examine recent advances in parallel languages and abstractions that have the potential for improving the programmability and maintainability of large-scale, parallel, scientific applications running on high performance architectures and networks. This paper focuses on Fortran M, a set of extensions to Fortran 77 that supports the modular design of message-passing programs. We describe the Fortran M implementation of a particle-in-cell (PIC) plasma simulation application, and discuss issues in the optimization of the code. The use of two other methodologies for parallelizing the PIC application are considered. The first is based on the shared object abstraction asmore » embodied in the Orca language. The second approach is the Split-C language. In Fortran M, Orca, and Split-C the ability of the programmer to control the granularity of communication is important is designing an efficient implementation.« less
Lagardère, Louis; Lipparini, Filippo; Polack, Étienne; Stamm, Benjamin; Cancès, Éric; Schnieders, Michael; Ren, Pengyu; Maday, Yvon; Piquemal, Jean-Philip
2014-02-28
In this paper, we present a scalable and efficient implementation of point dipole-based polarizable force fields for molecular dynamics (MD) simulations with periodic boundary conditions (PBC). The Smooth Particle-Mesh Ewald technique is combined with two optimal iterative strategies, namely, a preconditioned conjugate gradient solver and a Jacobi solver in conjunction with the Direct Inversion in the Iterative Subspace for convergence acceleration, to solve the polarization equations. We show that both solvers exhibit very good parallel performances and overall very competitive timings in an energy-force computation needed to perform a MD step. Various tests on large systems are provided in the context of the polarizable AMOEBA force field as implemented in the newly developed Tinker-HP package which is the first implementation for a polarizable model making large scale experiments for massively parallel PBC point dipole models possible. We show that using a large number of cores offers a significant acceleration of the overall process involving the iterative methods within the context of spme and a noticeable improvement of the memory management giving access to very large systems (hundreds of thousands of atoms) as the algorithm naturally distributes the data on different cores. Coupled with advanced MD techniques, gains ranging from 2 to 3 orders of magnitude in time are now possible compared to non-optimized, sequential implementations giving new directions for polarizable molecular dynamics in periodic boundary conditions using massively parallel implementations.
Lagardère, Louis; Lipparini, Filippo; Polack, Étienne; Stamm, Benjamin; Cancès, Éric; Schnieders, Michael; Ren, Pengyu; Maday, Yvon; Piquemal, Jean-Philip
2015-01-01
In this paper, we present a scalable and efficient implementation of point dipole-based polarizable force fields for molecular dynamics (MD) simulations with periodic boundary conditions (PBC). The Smooth Particle-Mesh Ewald technique is combined with two optimal iterative strategies, namely, a preconditioned conjugate gradient solver and a Jacobi solver in conjunction with the Direct Inversion in the Iterative Subspace for convergence acceleration, to solve the polarization equations. We show that both solvers exhibit very good parallel performances and overall very competitive timings in an energy-force computation needed to perform a MD step. Various tests on large systems are provided in the context of the polarizable AMOEBA force field as implemented in the newly developed Tinker-HP package which is the first implementation for a polarizable model making large scale experiments for massively parallel PBC point dipole models possible. We show that using a large number of cores offers a significant acceleration of the overall process involving the iterative methods within the context of spme and a noticeable improvement of the memory management giving access to very large systems (hundreds of thousands of atoms) as the algorithm naturally distributes the data on different cores. Coupled with advanced MD techniques, gains ranging from 2 to 3 orders of magnitude in time are now possible compared to non-optimized, sequential implementations giving new directions for polarizable molecular dynamics in periodic boundary conditions using massively parallel implementations. PMID:26512230
MOOSE: A parallel computational framework for coupled systems of nonlinear equations.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Derek Gaston; Chris Newman; Glen Hansen
Systems of coupled, nonlinear partial differential equations (PDEs) often arise in simulation of nuclear processes. MOOSE: Multiphysics Object Oriented Simulation Environment, a parallel computational framework targeted at the solution of such systems, is presented. As opposed to traditional data-flow oriented computational frameworks, MOOSE is instead founded on the mathematical principle of Jacobian-free Newton-Krylov (JFNK) solution methods. Utilizing the mathematical structure present in JFNK, physics expressions are modularized into `Kernels,'' allowing for rapid production of new simulation tools. In addition, systems are solved implicitly and fully coupled, employing physics based preconditioning, which provides great flexibility even with large variance in timemore » scales. A summary of the mathematics, an overview of the structure of MOOSE, and several representative solutions from applications built on the framework are presented.« less
Early experiences in developing and managing the neuroscience gateway.
Sivagnanam, Subhashini; Majumdar, Amit; Yoshimoto, Kenneth; Astakhov, Vadim; Bandrowski, Anita; Martone, MaryAnn; Carnevale, Nicholas T
2015-02-01
The last few decades have seen the emergence of computational neuroscience as a mature field where researchers are interested in modeling complex and large neuronal systems and require access to high performance computing machines and associated cyber infrastructure to manage computational workflow and data. The neuronal simulation tools, used in this research field, are also implemented for parallel computers and suitable for high performance computing machines. But using these tools on complex high performance computing machines remains a challenge because of issues with acquiring computer time on these machines located at national supercomputer centers, dealing with complex user interface of these machines, dealing with data management and retrieval. The Neuroscience Gateway is being developed to alleviate and/or hide these barriers to entry for computational neuroscientists. It hides or eliminates, from the point of view of the users, all the administrative and technical barriers and makes parallel neuronal simulation tools easily available and accessible on complex high performance computing machines. It handles the running of jobs and data management and retrieval. This paper shares the early experiences in bringing up this gateway and describes the software architecture it is based on, how it is implemented, and how users can use this for computational neuroscience research using high performance computing at the back end. We also look at parallel scaling of some publicly available neuronal models and analyze the recent usage data of the neuroscience gateway.
Early experiences in developing and managing the neuroscience gateway
Sivagnanam, Subhashini; Majumdar, Amit; Yoshimoto, Kenneth; Astakhov, Vadim; Bandrowski, Anita; Martone, MaryAnn; Carnevale, Nicholas. T.
2015-01-01
SUMMARY The last few decades have seen the emergence of computational neuroscience as a mature field where researchers are interested in modeling complex and large neuronal systems and require access to high performance computing machines and associated cyber infrastructure to manage computational workflow and data. The neuronal simulation tools, used in this research field, are also implemented for parallel computers and suitable for high performance computing machines. But using these tools on complex high performance computing machines remains a challenge because of issues with acquiring computer time on these machines located at national supercomputer centers, dealing with complex user interface of these machines, dealing with data management and retrieval. The Neuroscience Gateway is being developed to alleviate and/or hide these barriers to entry for computational neuroscientists. It hides or eliminates, from the point of view of the users, all the administrative and technical barriers and makes parallel neuronal simulation tools easily available and accessible on complex high performance computing machines. It handles the running of jobs and data management and retrieval. This paper shares the early experiences in bringing up this gateway and describes the software architecture it is based on, how it is implemented, and how users can use this for computational neuroscience research using high performance computing at the back end. We also look at parallel scaling of some publicly available neuronal models and analyze the recent usage data of the neuroscience gateway. PMID:26523124
DOE Office of Scientific and Technical Information (OSTI.GOV)
Shimojo, Fuyuki; Hattori, Shinnosuke; Department of Physics, Kumamoto University, Kumamoto 860-8555
We introduce an extension of the divide-and-conquer (DC) algorithmic paradigm called divide-conquer-recombine (DCR) to perform large quantum molecular dynamics (QMD) simulations on massively parallel supercomputers, in which interatomic forces are computed quantum mechanically in the framework of density functional theory (DFT). In DCR, the DC phase constructs globally informed, overlapping local-domain solutions, which in the recombine phase are synthesized into a global solution encompassing large spatiotemporal scales. For the DC phase, we design a lean divide-and-conquer (LDC) DFT algorithm, which significantly reduces the prefactor of the O(N) computational cost for N electrons by applying a density-adaptive boundary condition at themore » peripheries of the DC domains. Our globally scalable and locally efficient solver is based on a hybrid real-reciprocal space approach that combines: (1) a highly scalable real-space multigrid to represent the global charge density; and (2) a numerically efficient plane-wave basis for local electronic wave functions and charge density within each domain. Hybrid space-band decomposition is used to implement the LDC-DFT algorithm on parallel computers. A benchmark test on an IBM Blue Gene/Q computer exhibits an isogranular parallel efficiency of 0.984 on 786 432 cores for a 50.3 × 10{sup 6}-atom SiC system. As a test of production runs, LDC-DFT-based QMD simulation involving 16 661 atoms is performed on the Blue Gene/Q to study on-demand production of hydrogen gas from water using LiAl alloy particles. As an example of the recombine phase, LDC-DFT electronic structures are used as a basis set to describe global photoexcitation dynamics with nonadiabatic QMD (NAQMD) and kinetic Monte Carlo (KMC) methods. The NAQMD simulations are based on the linear response time-dependent density functional theory to describe electronic excited states and a surface-hopping approach to describe transitions between the excited states. A series of techniques are employed for efficiently calculating the long-range exact exchange correction and excited-state forces. The NAQMD trajectories are analyzed to extract the rates of various excitonic processes, which are then used in KMC simulation to study the dynamics of the global exciton flow network. This has allowed the study of large-scale photoexcitation dynamics in 6400-atom amorphous molecular solid, reaching the experimental time scales.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Rajbhandari, Samyam; NIkam, Akshay; Lai, Pai-Wei
Tensor contractions represent the most compute-intensive core kernels in ab initio computational quantum chemistry and nuclear physics. Symmetries in these tensor contractions makes them difficult to load balance and scale to large distributed systems. In this paper, we develop an efficient and scalable algorithm to contract symmetric tensors. We introduce a novel approach that avoids data redistribution in contracting symmetric tensors while also avoiding redundant storage and maintaining load balance. We present experimental results on two parallel supercomputers for several symmetric contractions that appear in the CCSD quantum chemistry method. We also present a novel approach to tensor redistribution thatmore » can take advantage of parallel hyperplanes when the initial distribution has replicated dimensions, and use collective broadcast when the final distribution has replicated dimensions, making the algorithm very efficient.« less
Hypercube matrix computation task
NASA Technical Reports Server (NTRS)
Calalo, R.; Imbriale, W.; Liewer, P.; Lyons, J.; Manshadi, F.; Patterson, J.
1987-01-01
The Hypercube Matrix Computation (Year 1986-1987) task investigated the applicability of a parallel computing architecture to the solution of large scale electromagnetic scattering problems. Two existing electromagnetic scattering codes were selected for conversion to the Mark III Hypercube concurrent computing environment. They were selected so that the underlying numerical algorithms utilized would be different thereby providing a more thorough evaluation of the appropriateness of the parallel environment for these types of problems. The first code was a frequency domain method of moments solution, NEC-2, developed at Lawrence Livermore National Laboratory. The second code was a time domain finite difference solution of Maxwell's equations to solve for the scattered fields. Once the codes were implemented on the hypercube and verified to obtain correct solutions by comparing the results with those from sequential runs, several measures were used to evaluate the performance of the two codes. First, a comparison was provided of the problem size possible on the hypercube with 128 megabytes of memory for a 32-node configuration with that available in a typical sequential user environment of 4 to 8 megabytes. Then, the performance of the codes was anlyzed for the computational speedup attained by the parallel architecture.
NASA Technical Reports Server (NTRS)
Sobieszczanski-Sobieski, Jaroslaw
1998-01-01
The paper identifies speed, agility, human interface, generation of sensitivity information, task decomposition, and data transmission (including storage) as important attributes for a computer environment to have in order to support engineering design effectively. It is argued that when examined in terms of these attributes the presently available environment can be shown to be inadequate a radical improvement is needed, and it may be achieved by combining new methods that have recently emerged from multidisciplinary design optimization (MDO) with massively parallel processing computer technology. The caveat is that, for successful use of that technology in engineering computing, new paradigms for computing will have to be developed - specifically, innovative algorithms that are intrinsically parallel so that their performance scales up linearly with the number of processors. It may be speculated that the idea of simulating a complex behavior by interaction of a large number of very simple models may be an inspiration for the above algorithms, the cellular automata are an example. Because of the long lead time needed to develop and mature new paradigms, development should be now, even though the widespread availability of massively parallel processing is still a few years away.
Anandakrishnan, Ramu; Scogland, Tom R. W.; Fenley, Andrew T.; Gordon, John C.; Feng, Wu-chun; Onufriev, Alexey V.
2010-01-01
Tools that compute and visualize biomolecular electrostatic surface potential have been used extensively for studying biomolecular function. However, determining the surface potential for large biomolecules on a typical desktop computer can take days or longer using currently available tools and methods. Two commonly used techniques to speed up these types of electrostatic computations are approximations based on multi-scale coarse-graining and parallelization across multiple processors. This paper demonstrates that for the computation of electrostatic surface potential, these two techniques can be combined to deliver significantly greater speed-up than either one separately, something that is in general not always possible. Specifically, the electrostatic potential computation, using an analytical linearized Poisson Boltzmann (ALPB) method, is approximated using the hierarchical charge partitioning (HCP) multiscale method, and parallelized on an ATI Radeon 4870 graphical processing unit (GPU). The implementation delivers a combined 934-fold speed-up for a 476,040 atom viral capsid, compared to an equivalent non-parallel implementation on an Intel E6550 CPU without the approximation. This speed-up is significantly greater than the 42-fold speed-up for the HCP approximation alone or the 182-fold speed-up for the GPU alone. PMID:20452792
NASA Technical Reports Server (NTRS)
Sobieszczanski-Sobieski, Jaroslaw
1999-01-01
The paper identifies speed, agility, human interface, generation of sensitivity information, task decomposition, and data transmission (including storage) as important attributes for a computer environment to have in order to support engineering design effectively. It is argued that when examined in terms of these attributes the presently available environment can be shown to be inadequate. A radical improvement is needed, and it may be achieved by combining new methods that have recently emerged from multidisciplinary design optimisation (MDO) with massively parallel processing computer technology. The caveat is that, for successful use of that technology in engineering computing, new paradigms for computing will have to be developed - specifically, innovative algorithms that are intrinsically parallel so that their performance scales up linearly with the number of processors. It may be speculated that the idea of simulating a complex behaviour by interaction of a large number of very simple models may be an inspiration for the above algorithms; the cellular automata are an example. Because of the long lead time needed to develop and mature new paradigms, development should begin now, even though the widespread availability of massively parallel processing is still a few years away.
Craciun, Stefan; Brockmeier, Austin J; George, Alan D; Lam, Herman; Príncipe, José C
2011-01-01
Methods for decoding movements from neural spike counts using adaptive filters often rely on minimizing the mean-squared error. However, for non-Gaussian distribution of errors, this approach is not optimal for performance. Therefore, rather than using probabilistic modeling, we propose an alternate non-parametric approach. In order to extract more structure from the input signal (neuronal spike counts) we propose using minimum error entropy (MEE), an information-theoretic approach that minimizes the error entropy as part of an iterative cost function. However, the disadvantage of using MEE as the cost function for adaptive filters is the increase in computational complexity. In this paper we present a comparison between the decoding performance of the analytic Wiener filter and a linear filter trained with MEE, which is then mapped to a parallel architecture in reconfigurable hardware tailored to the computational needs of the MEE filter. We observe considerable speedup from the hardware design. The adaptation of filter weights for the multiple-input, multiple-output linear filters, necessary in motor decoding, is a highly parallelizable algorithm. It can be decomposed into many independent computational blocks with a parallel architecture readily mapped to a field-programmable gate array (FPGA) and scales to large numbers of neurons. By pipelining and parallelizing independent computations in the algorithm, the proposed parallel architecture has sublinear increases in execution time with respect to both window size and filter order.
Performance of the Heavy Flavor Tracker (HFT) detector in star experiment at RHIC
NASA Astrophysics Data System (ADS)
Alruwaili, Manal
With the growing technology, the number of the processors is becoming massive. Current supercomputer processing will be available on desktops in the next decade. For mass scale application software development on massive parallel computing available on desktops, existing popular languages with large libraries have to be augmented with new constructs and paradigms that exploit massive parallel computing and distributed memory models while retaining the user-friendliness. Currently, available object oriented languages for massive parallel computing such as Chapel, X10 and UPC++ exploit distributed computing, data parallel computing and thread-parallelism at the process level in the PGAS (Partitioned Global Address Space) memory model. However, they do not incorporate: 1) any extension at for object distribution to exploit PGAS model; 2) the programs lack the flexibility of migrating or cloning an object between places to exploit load balancing; and 3) lack the programming paradigms that will result from the integration of data and thread-level parallelism and object distribution. In the proposed thesis, I compare different languages in PGAS model; propose new constructs that extend C++ with object distribution and object migration; and integrate PGAS based process constructs with these extensions on distributed objects. Object cloning and object migration. Also a new paradigm MIDD (Multiple Invocation Distributed Data) is presented when different copies of the same class can be invoked, and work on different elements of a distributed data concurrently using remote method invocations. I present new constructs, their grammar and their behavior. The new constructs have been explained using simple programs utilizing these constructs.
NASA Technical Reports Server (NTRS)
Kramer, Williams T. C.; Simon, Horst D.
1994-01-01
This tutorial proposes to be a practical guide for the uninitiated to the main topics and themes of high-performance computing (HPC), with particular emphasis to distributed computing. The intent is first to provide some guidance and directions in the rapidly increasing field of scientific computing using both massively parallel and traditional supercomputers. Because of their considerable potential computational power, loosely or tightly coupled clusters of workstations are increasingly considered as a third alternative to both the more conventional supercomputers based on a small number of powerful vector processors, as well as high massively parallel processors. Even though many research issues concerning the effective use of workstation clusters and their integration into a large scale production facility are still unresolved, such clusters are already used for production computing. In this tutorial we will utilize the unique experience made at the NAS facility at NASA Ames Research Center. Over the last five years at NAS massively parallel supercomputers such as the Connection Machines CM-2 and CM-5 from Thinking Machines Corporation and the iPSC/860 (Touchstone Gamma Machine) and Paragon Machines from Intel were used in a production supercomputer center alongside with traditional vector supercomputers such as the Cray Y-MP and C90.
Oak Ridge Leadership Computing Facility Position Paper
DOE Office of Scientific and Technical Information (OSTI.GOV)
Oral, H Sarp; Hill, Jason J; Thach, Kevin G
This paper discusses the business, administration, reliability, and usability aspects of storage systems at the Oak Ridge Leadership Computing Facility (OLCF). The OLCF has developed key competencies in architecting and administration of large-scale Lustre deployments as well as HPSS archival systems. Additionally as these systems are architected, deployed, and expanded over time reliability and availability factors are a primary driver. This paper focuses on the implementation of the Spider parallel Lustre file system as well as the implementation of the HPSS archive at the OLCF.
NASA Astrophysics Data System (ADS)
Ji, X.; Shen, C.
2017-12-01
Flood inundation presents substantial societal hazards and also changes biogeochemistry for systems like the Amazon. It is often expensive to simulate high-resolution flood inundation and propagation in a long-term watershed-scale model. Due to the Courant-Friedrichs-Lewy (CFL) restriction, high resolution and large local flow velocity both demand prohibitively small time steps even for parallel codes. Here we develop a parallel surface-subsurface process-based model enhanced by multi-resolution meshes that are adaptively switched on or off. The high-resolution overland flow meshes are enabled only when the flood wave invades to floodplains. This model applies semi-implicit, semi-Lagrangian (SISL) scheme in solving dynamic wave equations, and with the assistant of the multi-mesh method, it also adaptively chooses the dynamic wave equation only in the area of deep inundation. Therefore, the model achieves a balance between accuracy and computational cost.
Large-Scale Computation of Nuclear Magnetic Resonance Shifts for Paramagnetic Solids Using CP2K.
Mondal, Arobendo; Gaultois, Michael W; Pell, Andrew J; Iannuzzi, Marcella; Grey, Clare P; Hutter, Jürg; Kaupp, Martin
2018-01-09
Large-scale computations of nuclear magnetic resonance (NMR) shifts for extended paramagnetic solids (pNMR) are reported using the highly efficient Gaussian-augmented plane-wave implementation of the CP2K code. Combining hyperfine couplings obtained with hybrid functionals with g-tensors and orbital shieldings computed using gradient-corrected functionals, contact, pseudocontact, and orbital-shift contributions to pNMR shifts are accessible. Due to the efficient and highly parallel performance of CP2K, a wide variety of materials with large unit cells can be studied with extended Gaussian basis sets. Validation of various approaches for the different contributions to pNMR shifts is done first for molecules in a large supercell in comparison with typical quantum-chemical codes. This is then extended to a detailed study of g-tensors for extended solid transition-metal fluorides and for a series of complex lithium vanadium phosphates. Finally, lithium pNMR shifts are computed for Li 3 V 2 (PO 4 ) 3 , for which detailed experimental data are available. This has allowed an in-depth study of different approaches (e.g., full periodic versus incremental cluster computations of g-tensors and different functionals and basis sets for hyperfine computations) as well as a thorough analysis of the different contributions to the pNMR shifts. This study paves the way for a more-widespread computational treatment of NMR shifts for paramagnetic materials.
Parallel/distributed direct method for solving linear systems
NASA Technical Reports Server (NTRS)
Lin, Avi
1990-01-01
A new family of parallel schemes for directly solving linear systems is presented and analyzed. It is shown that these schemes exhibit a near optimal performance and enjoy several important features: (1) For large enough linear systems, the design of the appropriate paralleled algorithm is insensitive to the number of processors as its performance grows monotonically with them; (2) It is especially good for large matrices, with dimensions large relative to the number of processors in the system; (3) It can be used in both distributed parallel computing environments and tightly coupled parallel computing systems; and (4) This set of algorithms can be mapped onto any parallel architecture without any major programming difficulties or algorithmical changes.
Graf, Daniel; Beuerle, Matthias; Schurkus, Henry F; Luenser, Arne; Savasci, Gökcen; Ochsenfeld, Christian
2018-05-08
An efficient algorithm for calculating the random phase approximation (RPA) correlation energy is presented that is as accurate as the canonical molecular orbital resolution-of-the-identity RPA (RI-RPA) with the important advantage of an effective linear-scaling behavior (instead of quartic) for large systems due to a formulation in the local atomic orbital space. The high accuracy is achieved by utilizing optimized minimax integration schemes and the local Coulomb metric attenuated by the complementary error function for the RI approximation. The memory bottleneck of former atomic orbital (AO)-RI-RPA implementations ( Schurkus, H. F.; Ochsenfeld, C. J. Chem. Phys. 2016 , 144 , 031101 and Luenser, A.; Schurkus, H. F.; Ochsenfeld, C. J. Chem. Theory Comput. 2017 , 13 , 1647 - 1655 ) is addressed by precontraction of the large 3-center integral matrix with the Cholesky factors of the ground state density reducing the memory requirements of that matrix by a factor of [Formula: see text]. Furthermore, we present a parallel implementation of our method, which not only leads to faster RPA correlation energy calculations but also to a scalable decrease in memory requirements, opening the door for investigations of large molecules even on small- to medium-sized computing clusters. Although it is known that AO methods are highly efficient for extended systems, where sparsity allows for reaching the linear-scaling regime, we show that our work also extends the applicability when considering highly delocalized systems for which no linear scaling can be achieved. As an example, the interlayer distance of two covalent organic framework pore fragments (comprising 384 atoms in total) is analyzed.
Hypercube matrix computation task
NASA Technical Reports Server (NTRS)
Calalo, Ruel H.; Imbriale, William A.; Jacobi, Nathan; Liewer, Paulett C.; Lockhart, Thomas G.; Lyzenga, Gregory A.; Lyons, James R.; Manshadi, Farzin; Patterson, Jean E.
1988-01-01
A major objective of the Hypercube Matrix Computation effort at the Jet Propulsion Laboratory (JPL) is to investigate the applicability of a parallel computing architecture to the solution of large-scale electromagnetic scattering problems. Three scattering analysis codes are being implemented and assessed on a JPL/California Institute of Technology (Caltech) Mark 3 Hypercube. The codes, which utilize different underlying algorithms, give a means of evaluating the general applicability of this parallel architecture. The three analysis codes being implemented are a frequency domain method of moments code, a time domain finite difference code, and a frequency domain finite elements code. These analysis capabilities are being integrated into an electromagnetics interactive analysis workstation which can serve as a design tool for the construction of antennas and other radiating or scattering structures. The first two years of work on the Hypercube Matrix Computation effort is summarized. It includes both new developments and results as well as work previously reported in the Hypercube Matrix Computation Task: Final Report for 1986 to 1987 (JPL Publication 87-18).
Henriques, David; González, Patricia; Doallo, Ramón; Saez-Rodriguez, Julio; Banga, Julio R.
2017-01-01
Background We consider a general class of global optimization problems dealing with nonlinear dynamic models. Although this class is relevant to many areas of science and engineering, here we are interested in applying this framework to the reverse engineering problem in computational systems biology, which yields very large mixed-integer dynamic optimization (MIDO) problems. In particular, we consider the framework of logic-based ordinary differential equations (ODEs). Methods We present saCeSS2, a parallel method for the solution of this class of problems. This method is based on an parallel cooperative scatter search metaheuristic, with new mechanisms of self-adaptation and specific extensions to handle large mixed-integer problems. We have paid special attention to the avoidance of convergence stagnation using adaptive cooperation strategies tailored to this class of problems. Results We illustrate its performance with a set of three very challenging case studies from the domain of dynamic modelling of cell signaling. The simpler case study considers a synthetic signaling pathway and has 84 continuous and 34 binary decision variables. A second case study considers the dynamic modeling of signaling in liver cancer using high-throughput data, and has 135 continuous and 109 binaries decision variables. The third case study is an extremely difficult problem related with breast cancer, involving 690 continuous and 138 binary decision variables. We report computational results obtained in different infrastructures, including a local cluster, a large supercomputer and a public cloud platform. Interestingly, the results show how the cooperation of individual parallel searches modifies the systemic properties of the sequential algorithm, achieving superlinear speedups compared to an individual search (e.g. speedups of 15 with 10 cores), and significantly improving (above a 60%) the performance with respect to a non-cooperative parallel scheme. The scalability of the method is also good (tests were performed using up to 300 cores). Conclusions These results demonstrate that saCeSS2 can be used to successfully reverse engineer large dynamic models of complex biological pathways. Further, these results open up new possibilities for other MIDO-based large-scale applications in the life sciences such as metabolic engineering, synthetic biology, drug scheduling. PMID:28813442
Penas, David R; Henriques, David; González, Patricia; Doallo, Ramón; Saez-Rodriguez, Julio; Banga, Julio R
2017-01-01
We consider a general class of global optimization problems dealing with nonlinear dynamic models. Although this class is relevant to many areas of science and engineering, here we are interested in applying this framework to the reverse engineering problem in computational systems biology, which yields very large mixed-integer dynamic optimization (MIDO) problems. In particular, we consider the framework of logic-based ordinary differential equations (ODEs). We present saCeSS2, a parallel method for the solution of this class of problems. This method is based on an parallel cooperative scatter search metaheuristic, with new mechanisms of self-adaptation and specific extensions to handle large mixed-integer problems. We have paid special attention to the avoidance of convergence stagnation using adaptive cooperation strategies tailored to this class of problems. We illustrate its performance with a set of three very challenging case studies from the domain of dynamic modelling of cell signaling. The simpler case study considers a synthetic signaling pathway and has 84 continuous and 34 binary decision variables. A second case study considers the dynamic modeling of signaling in liver cancer using high-throughput data, and has 135 continuous and 109 binaries decision variables. The third case study is an extremely difficult problem related with breast cancer, involving 690 continuous and 138 binary decision variables. We report computational results obtained in different infrastructures, including a local cluster, a large supercomputer and a public cloud platform. Interestingly, the results show how the cooperation of individual parallel searches modifies the systemic properties of the sequential algorithm, achieving superlinear speedups compared to an individual search (e.g. speedups of 15 with 10 cores), and significantly improving (above a 60%) the performance with respect to a non-cooperative parallel scheme. The scalability of the method is also good (tests were performed using up to 300 cores). These results demonstrate that saCeSS2 can be used to successfully reverse engineer large dynamic models of complex biological pathways. Further, these results open up new possibilities for other MIDO-based large-scale applications in the life sciences such as metabolic engineering, synthetic biology, drug scheduling.
NASA Technical Reports Server (NTRS)
Nguyen, Duc T.
1990-01-01
Practical engineering application can often be formulated in the form of a constrained optimization problem. There are several solution algorithms for solving a constrained optimization problem. One approach is to convert a constrained problem into a series of unconstrained problems. Furthermore, unconstrained solution algorithms can be used as part of the constrained solution algorithms. Structural optimization is an iterative process where one starts with an initial design, a finite element structure analysis is then performed to calculate the response of the system (such as displacements, stresses, eigenvalues, etc.). Based upon the sensitivity information on the objective and constraint functions, an optimizer such as ADS or IDESIGN, can be used to find the new, improved design. For the structural analysis phase, the equation solver for the system of simultaneous, linear equations plays a key role since it is needed for either static, or eigenvalue, or dynamic analysis. For practical, large-scale structural analysis-synthesis applications, computational time can be excessively large. Thus, it is necessary to have a new structural analysis-synthesis code which employs new solution algorithms to exploit both parallel and vector capabilities offered by modern, high performance computers such as the Convex, Cray-2 and Cray-YMP computers. The objective of this research project is, therefore, to incorporate the latest development in the parallel-vector equation solver, PVSOLVE into the widely popular finite-element production code, such as the SAP-4. Furthermore, several nonlinear unconstrained optimization subroutines have also been developed and tested under a parallel computer environment. The unconstrained optimization subroutines are not only useful in their own right, but they can also be incorporated into a more popular constrained optimization code, such as ADS.
Large Scale GW Calculations on the Cori System
NASA Astrophysics Data System (ADS)
Deslippe, Jack; Del Ben, Mauro; da Jornada, Felipe; Canning, Andrew; Louie, Steven
The NERSC Cori system, powered by 9000+ Intel Xeon-Phi processors, represents one of the largest HPC systems for open-science in the United States and the world. We discuss the optimization of the GW methodology for this system, including both node level and system-scale optimizations. We highlight multiple large scale (thousands of atoms) case studies and discuss both absolute application performance and comparison to calculations on more traditional HPC architectures. We find that the GW method is particularly well suited for many-core architectures due to the ability to exploit a large amount of parallelism across many layers of the system. This work was supported by the U.S. Department of Energy, Office of Science, Basic Energy Sciences, Materials Sciences and Engineering Division, as part of the Computational Materials Sciences Program.
GROMACS 4.5: a high-throughput and highly parallel open source molecular simulation toolkit
Pronk, Sander; Páll, Szilárd; Schulz, Roland; Larsson, Per; Bjelkmar, Pär; Apostolov, Rossen; Shirts, Michael R.; Smith, Jeremy C.; Kasson, Peter M.; van der Spoel, David; Hess, Berk; Lindahl, Erik
2013-01-01
Motivation: Molecular simulation has historically been a low-throughput technique, but faster computers and increasing amounts of genomic and structural data are changing this by enabling large-scale automated simulation of, for instance, many conformers or mutants of biomolecules with or without a range of ligands. At the same time, advances in performance and scaling now make it possible to model complex biomolecular interaction and function in a manner directly testable by experiment. These applications share a need for fast and efficient software that can be deployed on massive scale in clusters, web servers, distributed computing or cloud resources. Results: Here, we present a range of new simulation algorithms and features developed during the past 4 years, leading up to the GROMACS 4.5 software package. The software now automatically handles wide classes of biomolecules, such as proteins, nucleic acids and lipids, and comes with all commonly used force fields for these molecules built-in. GROMACS supports several implicit solvent models, as well as new free-energy algorithms, and the software now uses multithreading for efficient parallelization even on low-end systems, including windows-based workstations. Together with hand-tuned assembly kernels and state-of-the-art parallelization, this provides extremely high performance and cost efficiency for high-throughput as well as massively parallel simulations. Availability: GROMACS is an open source and free software available from http://www.gromacs.org. Contact: erik.lindahl@scilifelab.se Supplementary information: Supplementary data are available at Bioinformatics online. PMID:23407358
Cloud-based large-scale air traffic flow optimization
NASA Astrophysics Data System (ADS)
Cao, Yi
The ever-increasing traffic demand makes the efficient use of airspace an imperative mission, and this paper presents an effort in response to this call. Firstly, a new aggregate model, called Link Transmission Model (LTM), is proposed, which models the nationwide traffic as a network of flight routes identified by origin-destination pairs. The traversal time of a flight route is assumed to be the mode of distribution of historical flight records, and the mode is estimated by using Kernel Density Estimation. As this simplification abstracts away physical trajectory details, the complexity of modeling is drastically decreased, resulting in efficient traffic forecasting. The predicative capability of LTM is validated against recorded traffic data. Secondly, a nationwide traffic flow optimization problem with airport and en route capacity constraints is formulated based on LTM. The optimization problem aims at alleviating traffic congestions with minimal global delays. This problem is intractable due to millions of variables. A dual decomposition method is applied to decompose the large-scale problem such that the subproblems are solvable. However, the whole problem is still computational expensive to solve since each subproblem is an smaller integer programming problem that pursues integer solutions. Solving an integer programing problem is known to be far more time-consuming than solving its linear relaxation. In addition, sequential execution on a standalone computer leads to linear runtime increase when the problem size increases. To address the computational efficiency problem, a parallel computing framework is designed which accommodates concurrent executions via multithreading programming. The multithreaded version is compared with its monolithic version to show decreased runtime. Finally, an open-source cloud computing framework, Hadoop MapReduce, is employed for better scalability and reliability. This framework is an "off-the-shelf" parallel computing model that can be used for both offline historical traffic data analysis and online traffic flow optimization. It provides an efficient and robust platform for easy deployment and implementation. A small cloud consisting of five workstations was configured and used to demonstrate the advantages of cloud computing in dealing with large-scale parallelizable traffic problems.
Java Performance for Scientific Applications on LLNL Computer Systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kapfer, C; Wissink, A
2002-05-10
Languages in use for high performance computing at the laboratory--Fortran (f77 and f90), C, and C++--have many years of development behind them and are generally considered the fastest available. However, Fortran and C do not readily extend to object-oriented programming models, limiting their capability for very complex simulation software. C++ facilitates object-oriented programming but is a very complex and error-prone language. Java offers a number of capabilities that these other languages do not. For instance it implements cleaner (i.e., easier to use and less prone to errors) object-oriented models than C++. It also offers networking and security as part ofmore » the language standard, and cross-platform executables that make it architecture neutral, to name a few. These features have made Java very popular for industrial computing applications. The aim of this paper is to explain the trade-offs in using Java for large-scale scientific applications at LLNL. Despite its advantages, the computational science community has been reluctant to write large-scale computationally intensive applications in Java due to concerns over its poor performance. However, considerable progress has been made over the last several years. The Java Grande Forum [1] has been promoting the use of Java for large-scale computing. Members have introduced efficient array libraries, developed fast just-in-time (JIT) compilers, and built links to existing packages used in high performance parallel computing.« less
Software environment for implementing engineering applications on MIMD computers
NASA Technical Reports Server (NTRS)
Lopez, L. A.; Valimohamed, K. A.; Schiff, S.
1990-01-01
In this paper the concept for a software environment for developing engineering application systems for multiprocessor hardware (MIMD) is presented. The philosophy employed is to solve the largest problems possible in a reasonable amount of time, rather than solve existing problems faster. In the proposed environment most of the problems concerning parallel computation and handling of large distributed data spaces are hidden from the application program developer, thereby facilitating the development of large-scale software applications. Applications developed under the environment can be executed on a variety of MIMD hardware; it protects the application software from the effects of a rapidly changing MIMD hardware technology.
Parallel Scaling Characteristics of Selected NERSC User ProjectCodes
DOE Office of Scientific and Technical Information (OSTI.GOV)
Skinner, David; Verdier, Francesca; Anand, Harsh
This report documents parallel scaling characteristics of NERSC user project codes between Fiscal Year 2003 and the first half of Fiscal Year 2004 (Oct 2002-March 2004). The codes analyzed cover 60% of all the CPU hours delivered during that time frame on seaborg, a 6080 CPU IBM SP and the largest parallel computer at NERSC. The scale in terms of concurrency and problem size of the workload is analyzed. Drawing on batch queue logs, performance data and feedback from researchers we detail the motivations, benefits, and challenges of implementing highly parallel scientific codes on current NERSC High Performance Computing systems.more » An evaluation and outlook of the NERSC workload for Allocation Year 2005 is presented.« less
TomoMiner and TomoMinerCloud: A software platform for large-scale subtomogram structural analysis
Frazier, Zachary; Xu, Min; Alber, Frank
2017-01-01
SUMMARY Cryo-electron tomography (cryoET) captures the 3D electron density distribution of macromolecular complexes in close to native state. With the rapid advance of cryoET acquisition technologies, it is possible to generate large numbers (>100,000) of subtomograms, each containing a macromolecular complex. Often, these subtomograms represent a heterogeneous sample due to variations in structure and composition of a complex in situ form or because particles are a mixture of different complexes. In this case subtomograms must be classified. However, classification of large numbers of subtomograms is a time-intensive task and often a limiting bottleneck. This paper introduces an open source software platform, TomoMiner, for large-scale subtomogram classification, template matching, subtomogram averaging, and alignment. Its scalable and robust parallel processing allows efficient classification of tens to hundreds of thousands of subtomograms. Additionally, TomoMiner provides a pre-configured TomoMinerCloud computing service permitting users without sufficient computing resources instant access to TomoMiners high-performance features. PMID:28552576
GPU accelerated FDTD solver and its application in MRI.
Chi, J; Liu, F; Jin, J; Mason, D G; Crozier, S
2010-01-01
The finite difference time domain (FDTD) method is a popular technique for computational electromagnetics (CEM). The large computational power often required, however, has been a limiting factor for its applications. In this paper, we will present a graphics processing unit (GPU)-based parallel FDTD solver and its successful application to the investigation of a novel B1 shimming scheme for high-field magnetic resonance imaging (MRI). The optimized shimming scheme exhibits considerably improved transmit B(1) profiles. The GPU implementation dramatically shortened the runtime of FDTD simulation of electromagnetic field compared with its CPU counterpart. The acceleration in runtime has made such investigation possible, and will pave the way for other studies of large-scale computational electromagnetic problems in modern MRI which were previously impractical.
Dust Dynamics in Protoplanetary Disks: Parallel Computing with PVM
NASA Astrophysics Data System (ADS)
de La Fuente Marcos, Carlos; Barge, Pierre; de La Fuente Marcos, Raúl
2002-03-01
We describe a parallel version of our high-order-accuracy particle-mesh code for the simulation of collisionless protoplanetary disks. We use this code to carry out a massively parallel, two-dimensional, time-dependent, numerical simulation, which includes dust particles, to study the potential role of large-scale, gaseous vortices in protoplanetary disks. This noncollisional problem is easy to parallelize on message-passing multicomputer architectures. We performed the simulations on a cache-coherent nonuniform memory access Origin 2000 machine, using both the parallel virtual machine (PVM) and message-passing interface (MPI) message-passing libraries. Our performance analysis suggests that, for our problem, PVM is about 25% faster than MPI. Using PVM and MPI made it possible to reduce CPU time and increase code performance. This allows for simulations with a large number of particles (N ~ 105-106) in reasonable CPU times. The performances of our implementation of the pa! rallel code on an Origin 2000 supercomputer are presented and discussed. They exhibit very good speedup behavior and low load unbalancing. Our results confirm that giant gaseous vortices can play a dominant role in giant planet formation.
Implementing Parquet equations using HPX
NASA Astrophysics Data System (ADS)
Kellar, Samuel; Wagle, Bibek; Yang, Shuxiang; Tam, Ka-Ming; Kaiser, Hartmut; Moreno, Juana; Jarrell, Mark
A new C++ runtime system (HPX) enables simulations of complex systems to run more efficiently on parallel and heterogeneous systems. This increased efficiency allows for solutions to larger simulations of the parquet approximation for a system with impurities. The relevancy of the parquet equations depends upon the ability to solve systems which require long runs and large amounts of memory. These limitations, in addition to numerical complications arising from stability of the solutions, necessitate running on large distributed systems. As the computational resources trend towards the exascale and the limitations arising from computational resources vanish efficiency of large scale simulations becomes a focus. HPX facilitates efficient simulations through intelligent overlapping of computation and communication. Simulations such as the parquet equations which require the transfer of large amounts of data should benefit from HPX implementations. Supported by the the NSF EPSCoR Cooperative Agreement No. EPS-1003897 with additional support from the Louisiana Board of Regents.
Challenges in scaling NLO generators to leadership computers
NASA Astrophysics Data System (ADS)
Benjamin, D.; Childers, JT; Hoeche, S.; LeCompte, T.; Uram, T.
2017-10-01
Exascale computing resources are roughly a decade away and will be capable of 100 times more computing than current supercomputers. In the last year, Energy Frontier experiments crossed a milestone of 100 million core-hours used at the Argonne Leadership Computing Facility, Oak Ridge Leadership Computing Facility, and NERSC. The Fortran-based leading-order parton generator called Alpgen was successfully scaled to millions of threads to achieve this level of usage on Mira. Sherpa and MadGraph are next-to-leading order generators used heavily by LHC experiments for simulation. Integration times for high-multiplicity or rare processes can take a week or more on standard Grid machines, even using all 16-cores. We will describe our ongoing work to scale the Sherpa generator to thousands of threads on leadership-class machines and reduce run-times to less than a day. This work allows the experiments to leverage large-scale parallel supercomputers for event generation today, freeing tens of millions of grid hours for other work, and paving the way for future applications (simulation, reconstruction) on these and future supercomputers.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Newman, G.A.; Commer, M.
Three-dimensional (3D) geophysical imaging is now receiving considerable attention for electrical conductivity mapping of potential offshore oil and gas reservoirs. The imaging technology employs controlled source electromagnetic (CSEM) and magnetotelluric (MT) fields and treats geological media exhibiting transverse anisotropy. Moreover when combined with established seismic methods, direct imaging of reservoir fluids is possible. Because of the size of the 3D conductivity imaging problem, strategies are required exploiting computational parallelism and optimal meshing. The algorithm thus developed has been shown to scale to tens of thousands of processors. In one imaging experiment, 32,768 tasks/processors on the IBM Watson Research Blue Gene/Lmore » supercomputer were successfully utilized. Over a 24 hour period we were able to image a large scale field data set that previously required over four months of processing time on distributed clusters based on Intel or AMD processors utilizing 1024 tasks on an InfiniBand fabric. Electrical conductivity imaging using massively parallel computational resources produces results that cannot be obtained otherwise and are consistent with timeframes required for practical exploration problems.« less
Extraction of drainage networks from large terrain datasets using high throughput computing
NASA Astrophysics Data System (ADS)
Gong, Jianya; Xie, Jibo
2009-02-01
Advanced digital photogrammetry and remote sensing technology produces large terrain datasets (LTD). How to process and use these LTD has become a big challenge for GIS users. Extracting drainage networks, which are basic for hydrological applications, from LTD is one of the typical applications of digital terrain analysis (DTA) in geographical information applications. Existing serial drainage algorithms cannot deal with large data volumes in a timely fashion, and few GIS platforms can process LTD beyond the GB size. High throughput computing (HTC), a distributed parallel computing mode, is proposed to improve the efficiency of drainage networks extraction from LTD. Drainage network extraction using HTC involves two key issues: (1) how to decompose the large DEM datasets into independent computing units and (2) how to merge the separate outputs into a final result. A new decomposition method is presented in which the large datasets are partitioned into independent computing units using natural watershed boundaries instead of using regular 1-dimensional (strip-wise) and 2-dimensional (block-wise) decomposition. Because the distribution of drainage networks is strongly related to watershed boundaries, the new decomposition method is more effective and natural. The method to extract natural watershed boundaries was improved by using multi-scale DEMs instead of single-scale DEMs. A HTC environment is employed to test the proposed methods with real datasets.
Progress in fast, accurate multi-scale climate simulations
Collins, W. D.; Johansen, H.; Evans, K. J.; ...
2015-06-01
We present a survey of physical and computational techniques that have the potential to contribute to the next generation of high-fidelity, multi-scale climate simulations. Examples of the climate science problems that can be investigated with more depth with these computational improvements include the capture of remote forcings of localized hydrological extreme events, an accurate representation of cloud features over a range of spatial and temporal scales, and parallel, large ensembles of simulations to more effectively explore model sensitivities and uncertainties. Numerical techniques, such as adaptive mesh refinement, implicit time integration, and separate treatment of fast physical time scales are enablingmore » improved accuracy and fidelity in simulation of dynamics and allowing more complete representations of climate features at the global scale. At the same time, partnerships with computer science teams have focused on taking advantage of evolving computer architectures such as many-core processors and GPUs. As a result, approaches which were previously considered prohibitively costly have become both more efficient and scalable. In combination, progress in these three critical areas is poised to transform climate modeling in the coming decades.« less
Implementation of Parallel Dynamic Simulation on Shared-Memory vs. Distributed-Memory Environments
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jin, Shuangshuang; Chen, Yousu; Wu, Di
2015-12-09
Power system dynamic simulation computes the system response to a sequence of large disturbance, such as sudden changes in generation or load, or a network short circuit followed by protective branch switching operation. It consists of a large set of differential and algebraic equations, which is computational intensive and challenging to solve using single-processor based dynamic simulation solution. High-performance computing (HPC) based parallel computing is a very promising technology to speed up the computation and facilitate the simulation process. This paper presents two different parallel implementations of power grid dynamic simulation using Open Multi-processing (OpenMP) on shared-memory platform, and Messagemore » Passing Interface (MPI) on distributed-memory clusters, respectively. The difference of the parallel simulation algorithms and architectures of the two HPC technologies are illustrated, and their performances for running parallel dynamic simulation are compared and demonstrated.« less
Visual Analysis of Cloud Computing Performance Using Behavioral Lines.
Muelder, Chris; Zhu, Biao; Chen, Wei; Zhang, Hongxin; Ma, Kwan-Liu
2016-02-29
Cloud computing is an essential technology to Big Data analytics and services. A cloud computing system is often comprised of a large number of parallel computing and storage devices. Monitoring the usage and performance of such a system is important for efficient operations, maintenance, and security. Tracing every application on a large cloud system is untenable due to scale and privacy issues. But profile data can be collected relatively efficiently by regularly sampling the state of the system, including properties such as CPU load, memory usage, network usage, and others, creating a set of multivariate time series for each system. Adequate tools for studying such large-scale, multidimensional data are lacking. In this paper, we present a visual based analysis approach to understanding and analyzing the performance and behavior of cloud computing systems. Our design is based on similarity measures and a layout method to portray the behavior of each compute node over time. When visualizing a large number of behavioral lines together, distinct patterns often appear suggesting particular types of performance bottleneck. The resulting system provides multiple linked views, which allow the user to interactively explore the data by examining the data or a selected subset at different levels of detail. Our case studies, which use datasets collected from two different cloud systems, show that this visual based approach is effective in identifying trends and anomalies of the systems.
Adaptive implicit-explicit and parallel element-by-element iteration schemes
NASA Technical Reports Server (NTRS)
Tezduyar, T. E.; Liou, J.; Nguyen, T.; Poole, S.
1989-01-01
Adaptive implicit-explicit (AIE) and grouped element-by-element (GEBE) iteration schemes are presented for the finite element solution of large-scale problems in computational mechanics and physics. The AIE approach is based on the dynamic arrangement of the elements into differently treated groups. The GEBE procedure, which is a way of rewriting the EBE formulation to make its parallel processing potential and implementation more clear, is based on the static arrangement of the elements into groups with no inter-element coupling within each group. Various numerical tests performed demonstrate the savings in the CPU time and memory.
10-Qubit Entanglement and Parallel Logic Operations with a Superconducting Circuit
NASA Astrophysics Data System (ADS)
Song, Chao; Xu, Kai; Liu, Wuxin; Yang, Chui-ping; Zheng, Shi-Biao; Deng, Hui; Xie, Qiwei; Huang, Keqiang; Guo, Qiujiang; Zhang, Libo; Zhang, Pengfei; Xu, Da; Zheng, Dongning; Zhu, Xiaobo; Wang, H.; Chen, Y.-A.; Lu, C.-Y.; Han, Siyuan; Pan, Jian-Wei
2017-11-01
Here we report on the production and tomography of genuinely entangled Greenberger-Horne-Zeilinger states with up to ten qubits connecting to a bus resonator in a superconducting circuit, where the resonator-mediated qubit-qubit interactions are used to controllably entangle multiple qubits and to operate on different pairs of qubits in parallel. The resulting 10-qubit density matrix is probed by quantum state tomography, with a fidelity of 0.668 ±0.025 . Our results demonstrate the largest entanglement created so far in solid-state architectures and pave the way to large-scale quantum computation.
10-Qubit Entanglement and Parallel Logic Operations with a Superconducting Circuit.
Song, Chao; Xu, Kai; Liu, Wuxin; Yang, Chui-Ping; Zheng, Shi-Biao; Deng, Hui; Xie, Qiwei; Huang, Keqiang; Guo, Qiujiang; Zhang, Libo; Zhang, Pengfei; Xu, Da; Zheng, Dongning; Zhu, Xiaobo; Wang, H; Chen, Y-A; Lu, C-Y; Han, Siyuan; Pan, Jian-Wei
2017-11-03
Here we report on the production and tomography of genuinely entangled Greenberger-Horne-Zeilinger states with up to ten qubits connecting to a bus resonator in a superconducting circuit, where the resonator-mediated qubit-qubit interactions are used to controllably entangle multiple qubits and to operate on different pairs of qubits in parallel. The resulting 10-qubit density matrix is probed by quantum state tomography, with a fidelity of 0.668±0.025. Our results demonstrate the largest entanglement created so far in solid-state architectures and pave the way to large-scale quantum computation.
2015-08-01
Atomic/Molecular Massively Parallel Simulator ( LAMMPS ) Software by N Scott Weingarten and James P Larentzos Approved for...Massively Parallel Simulator ( LAMMPS ) Software by N Scott Weingarten Weapons and Materials Research Directorate, ARL James P Larentzos Engility...Shifted Periodic Boundary Conditions in the Large-Scale Atomic/Molecular Massively Parallel Simulator ( LAMMPS ) Software 5a. CONTRACT NUMBER 5b
Parallel spatial direct numerical simulations on the Intel iPSC/860 hypercube
NASA Technical Reports Server (NTRS)
Joslin, Ronald D.; Zubair, Mohammad
1993-01-01
The implementation and performance of a parallel spatial direct numerical simulation (PSDNS) approach on the Intel iPSC/860 hypercube is documented. The direct numerical simulation approach is used to compute spatially evolving disturbances associated with the laminar-to-turbulent transition in boundary-layer flows. The feasibility of using the PSDNS on the hypercube to perform transition studies is examined. The results indicate that the direct numerical simulation approach can effectively be parallelized on a distributed-memory parallel machine. By increasing the number of processors nearly ideal linear speedups are achieved with nonoptimized routines; slower than linear speedups are achieved with optimized (machine dependent library) routines. This slower than linear speedup results because the Fast Fourier Transform (FFT) routine dominates the computational cost and because the routine indicates less than ideal speedups. However with the machine-dependent routines the total computational cost decreases by a factor of 4 to 5 compared with standard FORTRAN routines. The computational cost increases linearly with spanwise wall-normal and streamwise grid refinements. The hypercube with 32 processors was estimated to require approximately twice the amount of Cray supercomputer single processor time to complete a comparable simulation; however it is estimated that a subgrid-scale model which reduces the required number of grid points and becomes a large-eddy simulation (PSLES) would reduce the computational cost and memory requirements by a factor of 10 over the PSDNS. This PSLES implementation would enable transition simulations on the hypercube at a reasonable computational cost.
A highly efficient multi-core algorithm for clustering extremely large datasets
2010-01-01
Background In recent years, the demand for computational power in computational biology has increased due to rapidly growing data sets from microarray and other high-throughput technologies. This demand is likely to increase. Standard algorithms for analyzing data, such as cluster algorithms, need to be parallelized for fast processing. Unfortunately, most approaches for parallelizing algorithms largely rely on network communication protocols connecting and requiring multiple computers. One answer to this problem is to utilize the intrinsic capabilities in current multi-core hardware to distribute the tasks among the different cores of one computer. Results We introduce a multi-core parallelization of the k-means and k-modes cluster algorithms based on the design principles of transactional memory for clustering gene expression microarray type data and categorial SNP data. Our new shared memory parallel algorithms show to be highly efficient. We demonstrate their computational power and show their utility in cluster stability and sensitivity analysis employing repeated runs with slightly changed parameters. Computation speed of our Java based algorithm was increased by a factor of 10 for large data sets while preserving computational accuracy compared to single-core implementations and a recently published network based parallelization. Conclusions Most desktop computers and even notebooks provide at least dual-core processors. Our multi-core algorithms show that using modern algorithmic concepts, parallelization makes it possible to perform even such laborious tasks as cluster sensitivity and cluster number estimation on the laboratory computer. PMID:20370922
Integrating the Apache Big Data Stack with HPC for Big Data
NASA Astrophysics Data System (ADS)
Fox, G. C.; Qiu, J.; Jha, S.
2014-12-01
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development. However, the same is not so true for data intensive computing, even though commercially clouds devote much more resources to data analytics than supercomputers devote to simulations. We look at a sample of over 50 big data applications to identify characteristics of data intensive applications and to deduce needed runtime and architectures. We suggest a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks and use these to identify a few key classes of hardware/software architectures. Our analysis builds on combining HPC and ABDS the Apache big data software stack that is well used in modern cloud computing. Initial results on clouds and HPC systems are encouraging. We propose the development of SPIDAL - Scalable Parallel Interoperable Data Analytics Library -- built on system aand data abstractions suggested by the HPC-ABDS architecture. We discuss how it can be used in several application areas including Polar Science.
NASA Technical Reports Server (NTRS)
Banks, Daniel W.; Laflin, Brenda E. Gile; Kemmerly, Guy T.; Campbell, Bryan A.
1999-01-01
The paper identifies speed, agility, human interface, generation of sensitivity information, task decomposition, and data transmission (including storage) as important attributes for a computer environment to have in order to support engineering design effectively. It is argued that when examined in terms of these attributes the presently available environment can be shown to be inadequate. A radical improvement is needed, and it may be achieved by combining new methods that have recently emerged from multidisciplinary design optimisation (MDO) with massively parallel processing computer technology. The caveat is that, for successful use of that technology in engineering computing, new paradigms for computing will have to be developed - specifically, innovative algorithms that are intrinsically parallel so that their performance scales up linearly with the number of processors. It may be speculated that the idea of simulating a complex behaviour by interaction of a large number of very simple models may be an inspiration for the above algorithms; the cellular automata are an example. Because of the long lead time needed to develop and mature new paradigms, development should begin now, even though the widespread availability of massively parallel processing is still a few years away.
Application of computational aero-acoustics to real world problems
NASA Technical Reports Server (NTRS)
Hardin, Jay C.
1996-01-01
The application of computational aeroacoustics (CAA) to real problems is discussed in relation to the analysis performed with the aim of assessing the application of the various techniques. It is considered that the applications are limited by the inability of the computational resources to resolve the large range of scales involved in high Reynolds number flows. Possible simplifications are discussed. It is considered that problems remain to be solved in relation to the efficient use of the power of parallel computers and in the development of turbulent modeling schemes. The goal of CAA is stated as being the implementation of acoustic design studies on a computer terminal with reasonable run times.
Scaling up digital circuit computation with DNA strand displacement cascades.
Qian, Lulu; Winfree, Erik
2011-06-03
To construct sophisticated biochemical circuits from scratch, one needs to understand how simple the building blocks can be and how robustly such circuits can scale up. Using a simple DNA reaction mechanism based on a reversible strand displacement process, we experimentally demonstrated several digital logic circuits, culminating in a four-bit square-root circuit that comprises 130 DNA strands. These multilayer circuits include thresholding and catalysis within every logical operation to perform digital signal restoration, which enables fast and reliable function in large circuits with roughly constant switching time and linear signal propagation delays. The design naturally incorporates other crucial elements for large-scale circuitry, such as general debugging tools, parallel circuit preparation, and an abstraction hierarchy supported by an automated circuit compiler.
Lessons Learned in Deploying the World s Largest Scale Lustre File System
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dillow, David A; Fuller, Douglas; Wang, Feiyi
2010-01-01
The Spider system at the Oak Ridge National Laboratory's Leadership Computing Facility (OLCF) is the world's largest scale Lustre parallel file system. Envisioned as a shared parallel file system capable of delivering both the bandwidth and capacity requirements of the OLCF's diverse computational environment, the project had a number of ambitious goals. To support the workloads of the OLCF's diverse computational platforms, the aggregate performance and storage capacity of Spider exceed that of our previously deployed systems by a factor of 6x - 240 GB/sec, and 17x - 10 Petabytes, respectively. Furthermore, Spider supports over 26,000 clients concurrently accessing themore » file system, which exceeds our previously deployed systems by nearly 4x. In addition to these scalability challenges, moving to a center-wide shared file system required dramatically improved resiliency and fault-tolerance mechanisms. This paper details our efforts in designing, deploying, and operating Spider. Through a phased approach of research and development, prototyping, deployment, and transition to operations, this work has resulted in a number of insights into large-scale parallel file system architectures, from both the design and the operational perspectives. We present in this paper our solutions to issues such as network congestion, performance baselining and evaluation, file system journaling overheads, and high availability in a system with tens of thousands of components. We also discuss areas of continued challenges, such as stressed metadata performance and the need for file system quality of service alongside with our efforts to address them. Finally, operational aspects of managing a system of this scale are discussed along with real-world data and observations.« less
Parallel solution of sparse one-dimensional dynamic programming problems
NASA Technical Reports Server (NTRS)
Nicol, David M.
1989-01-01
Parallel computation offers the potential for quickly solving large computational problems. However, it is often a non-trivial task to effectively use parallel computers. Solution methods must sometimes be reformulated to exploit parallelism; the reformulations are often more complex than their slower serial counterparts. We illustrate these points by studying the parallelization of sparse one-dimensional dynamic programming problems, those which do not obviously admit substantial parallelization. We propose a new method for parallelizing such problems, develop analytic models which help us to identify problems which parallelize well, and compare the performance of our algorithm with existing algorithms on a multiprocessor.
Development of mpi_EPIC model for global agroecosystem modeling
Kang, Shujiang; Wang, Dali; Jeff A. Nichols; ...
2014-12-31
Models that address policy-maker concerns about multi-scale effects of food and bioenergy production systems are computationally demanding. We integrated the message passing interface algorithm into the process-based EPIC model to accelerate computation of ecosystem effects. Simulation performance was further enhanced by applying the Vampir framework. When this enhanced mpi_EPIC model was tested, total execution time for a global 30-year simulation of a switchgrass cropping system was shortened to less than 0.5 hours on a supercomputer. The results illustrate that mpi_EPIC using parallel design can balance simulation workloads and facilitate large-scale, high-resolution analysis of agricultural production systems, management alternatives and environmentalmore » effects.« less
Neuromorphic Hardware Architecture Using the Neural Engineering Framework for Pattern Recognition.
Wang, Runchun; Thakur, Chetan Singh; Cohen, Gregory; Hamilton, Tara Julia; Tapson, Jonathan; van Schaik, Andre
2017-06-01
We present a hardware architecture that uses the neural engineering framework (NEF) to implement large-scale neural networks on field programmable gate arrays (FPGAs) for performing massively parallel real-time pattern recognition. NEF is a framework that is capable of synthesising large-scale cognitive systems from subnetworks and we have previously presented an FPGA implementation of the NEF that successfully performs nonlinear mathematical computations. That work was developed based on a compact digital neural core, which consists of 64 neurons that are instantiated by a single physical neuron using a time-multiplexing approach. We have now scaled this approach up to build a pattern recognition system by combining identical neural cores together. As a proof of concept, we have developed a handwritten digit recognition system using the MNIST database and achieved a recognition rate of 96.55%. The system is implemented on a state-of-the-art FPGA and can process 5.12 million digits per second. The architecture and hardware optimisations presented offer high-speed and resource-efficient means for performing high-speed, neuromorphic, and massively parallel pattern recognition and classification tasks.
A parallel implementation of an off-lattice individual-based model of multicellular populations
NASA Astrophysics Data System (ADS)
Harvey, Daniel G.; Fletcher, Alexander G.; Osborne, James M.; Pitt-Francis, Joe
2015-07-01
As computational models of multicellular populations include ever more detailed descriptions of biophysical and biochemical processes, the computational cost of simulating such models limits their ability to generate novel scientific hypotheses and testable predictions. While developments in microchip technology continue to increase the power of individual processors, parallel computing offers an immediate increase in available processing power. To make full use of parallel computing technology, it is necessary to develop specialised algorithms. To this end, we present a parallel algorithm for a class of off-lattice individual-based models of multicellular populations. The algorithm divides the spatial domain between computing processes and comprises communication routines that ensure the model is correctly simulated on multiple processors. The parallel algorithm is shown to accurately reproduce the results of a deterministic simulation performed using a pre-existing serial implementation. We test the scaling of computation time, memory use and load balancing as more processes are used to simulate a cell population of fixed size. We find approximate linear scaling of both speed-up and memory consumption on up to 32 processor cores. Dynamic load balancing is shown to provide speed-up for non-regular spatial distributions of cells in the case of a growing population.
Stamatakis, Alexandros
2006-11-01
RAxML-VI-HPC (randomized axelerated maximum likelihood for high performance computing) is a sequential and parallel program for inference of large phylogenies with maximum likelihood (ML). Low-level technical optimizations, a modification of the search algorithm, and the use of the GTR+CAT approximation as replacement for GTR+Gamma yield a program that is between 2.7 and 52 times faster than the previous version of RAxML. A large-scale performance comparison with GARLI, PHYML, IQPNNI and MrBayes on real data containing 1000 up to 6722 taxa shows that RAxML requires at least 5.6 times less main memory and yields better trees in similar times than the best competing program (GARLI) on datasets up to 2500 taxa. On datasets > or =4000 taxa it also runs 2-3 times faster than GARLI. RAxML has been parallelized with MPI to conduct parallel multiple bootstraps and inferences on distinct starting trees. The program has been used to compute ML trees on two of the largest alignments to date containing 25,057 (1463 bp) and 2182 (51,089 bp) taxa, respectively. icwww.epfl.ch/~stamatak
BioFVM: an efficient, parallelized diffusive transport solver for 3-D biological simulations
Ghaffarizadeh, Ahmadreza; Friedman, Samuel H.; Macklin, Paul
2016-01-01
Motivation: Computational models of multicellular systems require solving systems of PDEs for release, uptake, decay and diffusion of multiple substrates in 3D, particularly when incorporating the impact of drugs, growth substrates and signaling factors on cell receptors and subcellular systems biology. Results: We introduce BioFVM, a diffusive transport solver tailored to biological problems. BioFVM can simulate release and uptake of many substrates by cell and bulk sources, diffusion and decay in large 3D domains. It has been parallelized with OpenMP, allowing efficient simulations on desktop workstations or single supercomputer nodes. The code is stable even for large time steps, with linear computational cost scalings. Solutions are first-order accurate in time and second-order accurate in space. The code can be run by itself or as part of a larger simulator. Availability and implementation: BioFVM is written in C ++ with parallelization in OpenMP. It is maintained and available for download at http://BioFVM.MathCancer.org and http://BioFVM.sf.net under the Apache License (v2.0). Contact: paul.macklin@usc.edu. Supplementary information: Supplementary data are available at Bioinformatics online. PMID:26656933
Semantics-based distributed I/O with the ParaMEDIC framework.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Balaji, P.; Feng, W.; Lin, H.
2008-01-01
Many large-scale applications simultaneously rely on multiple resources for efficient execution. For example, such applications may require both large compute and storage resources; however, very few supercomputing centers can provide large quantities of both. Thus, data generated at the compute site oftentimes has to be moved to a remote storage site for either storage or visualization and analysis. Clearly, this is not an efficient model, especially when the two sites are distributed over a wide-area network. Thus, we present a framework called 'ParaMEDIC: Parallel Metadata Environment for Distributed I/O and Computing' which uses application-specific semantic information to convert the generatedmore » data to orders-of-magnitude smaller metadata at the compute site, transfer the metadata to the storage site, and re-process the metadata at the storage site to regenerate the output. Specifically, ParaMEDIC trades a small amount of additional computation (in the form of data post-processing) for a potentially significant reduction in data that needs to be transferred in distributed environments.« less
Optimal processor assignment for pipeline computations
NASA Technical Reports Server (NTRS)
Nicol, David M.; Simha, Rahul; Choudhury, Alok N.; Narahari, Bhagirath
1991-01-01
The availability of large scale multitasked parallel architectures introduces the following processor assignment problem for pipelined computations. Given a set of tasks and their precedence constraints, along with their experimentally determined individual responses times for different processor sizes, find an assignment of processor to tasks. Two objectives are of interest: minimal response given a throughput requirement, and maximal throughput given a response time requirement. These assignment problems differ considerably from the classical mapping problem in which several tasks share a processor; instead, it is assumed that a large number of processors are to be assigned to a relatively small number of tasks. Efficient assignment algorithms were developed for different classes of task structures. For a p processor system and a series parallel precedence graph with n constituent tasks, an O(np2) algorithm is provided that finds the optimal assignment for the response time optimization problem; it was found that the assignment optimizing the constrained throughput in O(np2log p) time. Special cases of linear, independent, and tree graphs are also considered.
Multidisciplinary Optimization Methods for Aircraft Preliminary Design
NASA Technical Reports Server (NTRS)
Kroo, Ilan; Altus, Steve; Braun, Robert; Gage, Peter; Sobieski, Ian
1994-01-01
This paper describes a research program aimed at improved methods for multidisciplinary design and optimization of large-scale aeronautical systems. The research involves new approaches to system decomposition, interdisciplinary communication, and methods of exploiting coarse-grained parallelism for analysis and optimization. A new architecture, that involves a tight coupling between optimization and analysis, is intended to improve efficiency while simplifying the structure of multidisciplinary, computation-intensive design problems involving many analysis disciplines and perhaps hundreds of design variables. Work in two areas is described here: system decomposition using compatibility constraints to simplify the analysis structure and take advantage of coarse-grained parallelism; and collaborative optimization, a decomposition of the optimization process to permit parallel design and to simplify interdisciplinary communication requirements.
A parallel data management system for large-scale NASA datasets
NASA Technical Reports Server (NTRS)
Srivastava, Jaideep
1993-01-01
The past decade has experienced a phenomenal growth in the amount of data and resultant information generated by NASA's operations and research projects. A key application is the reprocessing problem which has been identified to require data management capabilities beyond those available today (PRAT93). The Intelligent Information Fusion (IIF) system (ROEL91) is an ongoing NASA project which has similar requirements. Deriving our understanding of NASA's future data management needs based on the above, this paper describes an approach to using parallel computer systems (processor and I/O architectures) to develop an efficient parallel database management system to address the needs. Specifically, we propose to investigate issues in low-level record organizations and management, complex query processing, and query compilation and scheduling.
A Fine-Grained Pipelined Implementation for Large-Scale Matrix Inversion on FPGA
NASA Astrophysics Data System (ADS)
Zhou, Jie; Dou, Yong; Zhao, Jianxun; Xia, Fei; Lei, Yuanwu; Tang, Yuxing
Large-scale matrix inversion play an important role in many applications. However to the best of our knowledge, there is no FPGA-based implementation. In this paper, we explore the possibility of accelerating large-scale matrix inversion on FPGA. To exploit the computational potential of FPGA, we introduce a fine-grained parallel algorithm for matrix inversion. A scalable linear array processing elements (PEs), which is the core component of the FPGA accelerator, is proposed to implement this algorithm. A total of 12 PEs can be integrated into an Altera StratixII EP2S130F1020C5 FPGA on our self-designed board. Experimental results show that a factor of 2.6 speedup and the maximum power-performance of 41 can be achieved compare to Pentium Dual CPU with double SSE threads.
Frog: Asynchronous Graph Processing on GPU with Hybrid Coloring Model
DOE Office of Scientific and Technical Information (OSTI.GOV)
Shi, Xuanhua; Luo, Xuan; Liang, Junling
GPUs have been increasingly used to accelerate graph processing for complicated computational problems regarding graph theory. Many parallel graph algorithms adopt the asynchronous computing model to accelerate the iterative convergence. Unfortunately, the consistent asynchronous computing requires locking or atomic operations, leading to significant penalties/overheads when implemented on GPUs. As such, coloring algorithm is adopted to separate the vertices with potential updating conflicts, guaranteeing the consistency/correctness of the parallel processing. Common coloring algorithms, however, may suffer from low parallelism because of a large number of colors generally required for processing a large-scale graph with billions of vertices. We propose a light-weightmore » asynchronous processing framework called Frog with a preprocessing/hybrid coloring model. The fundamental idea is based on Pareto principle (or 80-20 rule) about coloring algorithms as we observed through masses of realworld graph coloring cases. We find that a majority of vertices (about 80%) are colored with only a few colors, such that they can be read and updated in a very high degree of parallelism without violating the sequential consistency. Accordingly, our solution separates the processing of the vertices based on the distribution of colors. In this work, we mainly answer three questions: (1) how to partition the vertices in a sparse graph with maximized parallelism, (2) how to process large-scale graphs that cannot fit into GPU memory, and (3) how to reduce the overhead of data transfers on PCIe while processing each partition. We conduct experiments on real-world data (Amazon, DBLP, YouTube, RoadNet-CA, WikiTalk and Twitter) to evaluate our approach and make comparisons with well-known non-preprocessed (such as Totem, Medusa, MapGraph and Gunrock) and preprocessed (Cusha) approaches, by testing four classical algorithms (BFS, PageRank, SSSP and CC). On all the tested applications and datasets, Frog is able to significantly outperform existing GPU-based graph processing systems except Gunrock and MapGraph. MapGraph gets better performance than Frog when running BFS on RoadNet-CA. The comparison between Gunrock and Frog is inconclusive. Frog can outperform Gunrock more than 1.04X when running PageRank and SSSP, while the advantage of Frog is not obvious when running BFS and CC on some datasets especially for RoadNet-CA.« less
Improved treatment of exact exchange in Quantum ESPRESSO
Barnes, Taylor A.; Kurth, Thorsten; Carrier, Pierre; ...
2017-01-18
Here, we present an algorithm and implementation for the parallel computation of exact exchange in Quantum ESPRESSO (QE) that exhibits greatly improved strong scaling. QE is an open-source software package for electronic structure calculations using plane wave density functional theory, and supports the use of local, semi-local, and hybrid DFT functionals. Wider application of hybrid functionals is desirable for the improved simulation of electronic band energy alignments and thermodynamic properties, but the computational complexity of evaluating the exact exchange potential limits the practical application of hybrid functionals to large systems and requires efficient implementations. We demonstrate that existing implementations ofmore » hybrid DFT that utilize a single data structure for both the local and exact exchange regions of the code are significantly limited in the degree of parallelization achievable. We present a band-pair parallelization approach, in which the calculation of exact exchange is parallelized and evaluated independently from the parallelization of the remainder of the calculation, with the wavefunction data being efficiently transformed on-the-fly into a form that is optimal for each part of the calculation. For a 64 water molecule supercell, our new algorithm reduces the overall time to solution by nearly an order of magnitude.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Barnes, Taylor A.; Kurth, Thorsten; Carrier, Pierre
Here, we present an algorithm and implementation for the parallel computation of exact exchange in Quantum ESPRESSO (QE) that exhibits greatly improved strong scaling. QE is an open-source software package for electronic structure calculations using plane wave density functional theory, and supports the use of local, semi-local, and hybrid DFT functionals. Wider application of hybrid functionals is desirable for the improved simulation of electronic band energy alignments and thermodynamic properties, but the computational complexity of evaluating the exact exchange potential limits the practical application of hybrid functionals to large systems and requires efficient implementations. We demonstrate that existing implementations ofmore » hybrid DFT that utilize a single data structure for both the local and exact exchange regions of the code are significantly limited in the degree of parallelization achievable. We present a band-pair parallelization approach, in which the calculation of exact exchange is parallelized and evaluated independently from the parallelization of the remainder of the calculation, with the wavefunction data being efficiently transformed on-the-fly into a form that is optimal for each part of the calculation. For a 64 water molecule supercell, our new algorithm reduces the overall time to solution by nearly an order of magnitude.« less
Equation solvers for distributed-memory computers
NASA Technical Reports Server (NTRS)
Storaasli, Olaf O.
1994-01-01
A large number of scientific and engineering problems require the rapid solution of large systems of simultaneous equations. The performance of parallel computers in this area now dwarfs traditional vector computers by nearly an order of magnitude. This talk describes the major issues involved in parallel equation solvers with particular emphasis on the Intel Paragon, IBM SP-1 and SP-2 processors.
Parallel multiscale simulations of a brain aneurysm
DOE Office of Scientific and Technical Information (OSTI.GOV)
Grinberg, Leopold; Fedosov, Dmitry A.; Karniadakis, George Em, E-mail: george_karniadakis@brown.edu
2013-07-01
Cardiovascular pathologies, such as a brain aneurysm, are affected by the global blood circulation as well as by the local microrheology. Hence, developing computational models for such cases requires the coupling of disparate spatial and temporal scales often governed by diverse mathematical descriptions, e.g., by partial differential equations (continuum) and ordinary differential equations for discrete particles (atomistic). However, interfacing atomistic-based with continuum-based domain discretizations is a challenging problem that requires both mathematical and computational advances. We present here a hybrid methodology that enabled us to perform the first multiscale simulations of platelet depositions on the wall of a brain aneurysm.more » The large scale flow features in the intracranial network are accurately resolved by using the high-order spectral element Navier–Stokes solver NεκTαr. The blood rheology inside the aneurysm is modeled using a coarse-grained stochastic molecular dynamics approach (the dissipative particle dynamics method) implemented in the parallel code LAMMPS. The continuum and atomistic domains overlap with interface conditions provided by effective forces computed adaptively to ensure continuity of states across the interface boundary. A two-way interaction is allowed with the time-evolving boundary of the (deposited) platelet clusters tracked by an immersed boundary method. The corresponding heterogeneous solvers (NεκTαr and LAMMPS) are linked together by a computational multilevel message passing interface that facilitates modularity and high parallel efficiency. Results of multiscale simulations of clot formation inside the aneurysm in a patient-specific arterial tree are presented. We also discuss the computational challenges involved and present scalability results of our coupled solver on up to 300 K computer processors. Validation of such coupled atomistic-continuum models is a main open issue that has to be addressed in future work.« less
A parallel orbital-updating based plane-wave basis method for electronic structure calculations
NASA Astrophysics Data System (ADS)
Pan, Yan; Dai, Xiaoying; de Gironcoli, Stefano; Gong, Xin-Gao; Rignanese, Gian-Marco; Zhou, Aihui
2017-11-01
Motivated by the recently proposed parallel orbital-updating approach in real space method [1], we propose a parallel orbital-updating based plane-wave basis method for electronic structure calculations, for solving the corresponding eigenvalue problems. In addition, we propose two new modified parallel orbital-updating methods. Compared to the traditional plane-wave methods, our methods allow for two-level parallelization, which is particularly interesting for large scale parallelization. Numerical experiments show that these new methods are more reliable and efficient for large scale calculations on modern supercomputers.
Wall Modeled Large Eddy Simulation of Airfoil Trailing Edge Noise
NASA Astrophysics Data System (ADS)
Kocheemoolayil, Joseph; Lele, Sanjiva
2014-11-01
Large eddy simulation (LES) of airfoil trailing edge noise has largely been restricted to low Reynolds numbers due to prohibitive computational cost. Wall modeled LES (WMLES) is a computationally cheaper alternative that makes full-scale Reynolds numbers relevant to large wind turbines accessible. A systematic investigation of trailing edge noise prediction using WMLES is conducted. Detailed comparisons are made with experimental data. The stress boundary condition from a wall model does not constrain the fluctuating velocity to vanish at the wall. This limitation has profound implications for trailing edge noise prediction. The simulation over-predicts the intensity of fluctuating wall pressure and far-field noise. An improved wall model formulation that minimizes the over-prediction of fluctuating wall pressure is proposed and carefully validated. The flow configurations chosen for the study are from the workshop on benchmark problems for airframe noise computations. The large eddy simulation database is used to examine the adequacy of scaling laws that quantify the dependence of trailing edge noise on Mach number, Reynolds number and angle of attack. Simplifying assumptions invoked in engineering approaches towards predicting trailing edge noise are critically evaluated. We gratefully acknowledge financial support from GE Global Research and thank Cascade Technologies Inc. for providing access to their massively-parallel large eddy simulation framework.
Parallelized modelling and solution scheme for hierarchically scaled simulations
NASA Technical Reports Server (NTRS)
Padovan, Joe
1995-01-01
This two-part paper presents the results of a benchmarked analytical-numerical investigation into the operational characteristics of a unified parallel processing strategy for implicit fluid mechanics formulations. This hierarchical poly tree (HPT) strategy is based on multilevel substructural decomposition. The Tree morphology is chosen to minimize memory, communications and computational effort. The methodology is general enough to apply to existing finite difference (FD), finite element (FEM), finite volume (FV) or spectral element (SE) based computer programs without an extensive rewrite of code. In addition to finding large reductions in memory, communications, and computational effort associated with a parallel computing environment, substantial reductions are generated in the sequential mode of application. Such improvements grow with increasing problem size. Along with a theoretical development of general 2-D and 3-D HPT, several techniques for expanding the problem size that the current generation of computers are capable of solving, are presented and discussed. Among these techniques are several interpolative reduction methods. It was found that by combining several of these techniques that a relatively small interpolative reduction resulted in substantial performance gains. Several other unique features/benefits are discussed in this paper. Along with Part 1's theoretical development, Part 2 presents a numerical approach to the HPT along with four prototype CFD applications. These demonstrate the potential of the HPT strategy.
McrEngine: A Scalable Checkpointing System Using Data-Aware Aggregation and Compression
Islam, Tanzima Zerin; Mohror, Kathryn; Bagchi, Saurabh; ...
2013-01-01
High performance computing (HPC) systems use checkpoint-restart to tolerate failures. Typically, applications store their states in checkpoints on a parallel file system (PFS). As applications scale up, checkpoint-restart incurs high overheads due to contention for PFS resources. The high overheads force large-scale applications to reduce checkpoint frequency, which means more compute time is lost in the event of failure. We alleviate this problem through a scalable checkpoint-restart system, mcrEngine. McrEngine aggregates checkpoints from multiple application processes with knowledge of the data semantics available through widely-used I/O libraries, e.g., HDF5 and netCDF, and compresses them. Our novel scheme improves compressibility ofmore » checkpoints up to 115% over simple concatenation and compression. Our evaluation with large-scale application checkpoints show that mcrEngine reduces checkpointing overhead by up to 87% and restart overhead by up to 62% over a baseline with no aggregation or compression.« less
NASA Technical Reports Server (NTRS)
Sohn, Andrew; Biswas, Rupak
1996-01-01
Solving the hard Satisfiability Problem is time consuming even for modest-sized problem instances. Solving the Random L-SAT Problem is especially difficult due to the ratio of clauses to variables. This report presents a parallel synchronous simulated annealing method for solving the Random L-SAT Problem on a large-scale distributed-memory multiprocessor. In particular, we use a parallel synchronous simulated annealing procedure, called Generalized Speculative Computation, which guarantees the same decision sequence as sequential simulated annealing. To demonstrate the performance of the parallel method, we have selected problem instances varying in size from 100-variables/425-clauses to 5000-variables/21,250-clauses. Experimental results on the AP1000 multiprocessor indicate that our approach can satisfy 99.9 percent of the clauses while giving almost a 70-fold speedup on 500 processors.
Design and analysis of a global sub-mesoscale and tidal dynamics admitting virtual ocean.
NASA Astrophysics Data System (ADS)
Menemenlis, D.; Hill, C. N.
2016-02-01
We will describe the techniques used to realize a global kilometerscale ocean model configuration that includes representation of sea-ice and tidal excitation, and spans scales from planetary gyres to internal tides. A simulation using this model configuration provides a virtual ocean that admits some sub-mesoscale dynamics and tidal energetics not normally represented in global calculations. This extends simulated ocean behavior beyond broadly quasi-geostrophic flows and provides a preliminary example of a next generation computational approach to explicitly probing the interactions between instabilities that are usually parameterized and dominant energetic scales in the ocean. From previous process studies we have ascertained that this can lead to a qualitative improvement in the realism of many significant processes including geostrophic eddy dynamics, shelf-break exchange and topographic mixing. Computationally we exploit high-degrees of parallelism in both numerical evaluation and in recording model state to persistent disk storage. Together this allows us to compute and record a full three-dimensional model trajectory at hourly frequency for a timeperiod of 5 months with less than 9 million core hours of parallel computer time, using the present generation NASA Ames Research Center facilities. We have used this capability to create a 5 month trajectory archive, sampled at high spatial and temporal frequency for an ocean configuration that is initialized from a realistic data-assimilated state and driven with reanalysis surface forcing from ECMWF. The resulting database of model state provides a novel virtual laboratory for exploring coupling across scales in the ocean, and for testing ideas on the relationship between small scale fluxes and large scale state. The computation is complemented by counterpart computations that are coarsened two and four times respectively. In this presentation we will review the computational and numerical technologies employed and show how the high spatio-temporal frequency archive of model state can provide a new and promising tool for researching richer ocean dynamics at scale. We will also outline how computations of this nature could be combined with next generation computer hardware plans to help inform important climate process questions.
BioPig: a Hadoop-based analytic toolkit for large-scale sequence data.
Nordberg, Henrik; Bhatia, Karan; Wang, Kai; Wang, Zhong
2013-12-01
The recent revolution in sequencing technologies has led to an exponential growth of sequence data. As a result, most of the current bioinformatics tools become obsolete as they fail to scale with data. To tackle this 'data deluge', here we introduce the BioPig sequence analysis toolkit as one of the solutions that scale to data and computation. We built BioPig on the Apache's Hadoop MapReduce system and the Pig data flow language. Compared with traditional serial and MPI-based algorithms, BioPig has three major advantages: first, BioPig's programmability greatly reduces development time for parallel bioinformatics applications; second, testing BioPig with up to 500 Gb sequences demonstrates that it scales automatically with size of data; and finally, BioPig can be ported without modification on many Hadoop infrastructures, as tested with Magellan system at National Energy Research Scientific Computing Center and the Amazon Elastic Compute Cloud. In summary, BioPig represents a novel program framework with the potential to greatly accelerate data-intensive bioinformatics analysis.
Design for Run-Time Monitor on Cloud Computing
NASA Astrophysics Data System (ADS)
Kang, Mikyung; Kang, Dong-In; Yun, Mira; Park, Gyung-Leen; Lee, Junghoon
Cloud computing is a new information technology trend that moves computing and data away from desktops and portable PCs into large data centers. The basic principle of cloud computing is to deliver applications as services over the Internet as well as infrastructure. A cloud is the type of a parallel and distributed system consisting of a collection of inter-connected and virtualized computers that are dynamically provisioned and presented as one or more unified computing resources. The large-scale distributed applications on a cloud require adaptive service-based software, which has the capability of monitoring the system status change, analyzing the monitored information, and adapting its service configuration while considering tradeoffs among multiple QoS features simultaneously. In this paper, we design Run-Time Monitor (RTM) which is a system software to monitor the application behavior at run-time, analyze the collected information, and optimize resources on cloud computing. RTM monitors application software through library instrumentation as well as underlying hardware through performance counter optimizing its computing configuration based on the analyzed data.
Static analysis techniques for semiautomatic synthesis of message passing software skeletons
Sottile, Matthew; Dagit, Jason; Zhang, Deli; ...
2015-06-29
The design of high-performance computing architectures demands performance analysis of large-scale parallel applications to derive various parameters concerning hardware design and software development. The process of performance analysis and benchmarking an application can be done in several ways with varying degrees of fidelity. One of the most cost-effective ways is to do a coarse-grained study of large-scale parallel applications through the use of program skeletons. The concept of a “program skeleton” that we discuss in this article is an abstracted program that is derived from a larger program where source code that is determined to be irrelevant is removed formore » the purposes of the skeleton. In this work, we develop a semiautomatic approach for extracting program skeletons based on compiler program analysis. Finally, we demonstrate correctness of our skeleton extraction process by comparing details from communication traces, as well as show the performance speedup of using skeletons by running simulations in the SST/macro simulator.« less
Shi, Yulin; Veidenbaum, Alexander V; Nicolau, Alex; Xu, Xiangmin
2015-01-15
Modern neuroscience research demands computing power. Neural circuit mapping studies such as those using laser scanning photostimulation (LSPS) produce large amounts of data and require intensive computation for post hoc processing and analysis. Here we report on the design and implementation of a cost-effective desktop computer system for accelerated experimental data processing with recent GPU computing technology. A new version of Matlab software with GPU enabled functions is used to develop programs that run on Nvidia GPUs to harness their parallel computing power. We evaluated both the central processing unit (CPU) and GPU-enabled computational performance of our system in benchmark testing and practical applications. The experimental results show that the GPU-CPU co-processing of simulated data and actual LSPS experimental data clearly outperformed the multi-core CPU with up to a 22× speedup, depending on computational tasks. Further, we present a comparison of numerical accuracy between GPU and CPU computation to verify the precision of GPU computation. In addition, we show how GPUs can be effectively adapted to improve the performance of commercial image processing software such as Adobe Photoshop. To our best knowledge, this is the first demonstration of GPU application in neural circuit mapping and electrophysiology-based data processing. Together, GPU enabled computation enhances our ability to process large-scale data sets derived from neural circuit mapping studies, allowing for increased processing speeds while retaining data precision. Copyright © 2014 Elsevier B.V. All rights reserved.
Shi, Yulin; Veidenbaum, Alexander V.; Nicolau, Alex; Xu, Xiangmin
2014-01-01
Background Modern neuroscience research demands computing power. Neural circuit mapping studies such as those using laser scanning photostimulation (LSPS) produce large amounts of data and require intensive computation for post-hoc processing and analysis. New Method Here we report on the design and implementation of a cost-effective desktop computer system for accelerated experimental data processing with recent GPU computing technology. A new version of Matlab software with GPU enabled functions is used to develop programs that run on Nvidia GPUs to harness their parallel computing power. Results We evaluated both the central processing unit (CPU) and GPU-enabled computational performance of our system in benchmark testing and practical applications. The experimental results show that the GPU-CPU co-processing of simulated data and actual LSPS experimental data clearly outperformed the multi-core CPU with up to a 22x speedup, depending on computational tasks. Further, we present a comparison of numerical accuracy between GPU and CPU computation to verify the precision of GPU computation. In addition, we show how GPUs can be effectively adapted to improve the performance of commercial image processing software such as Adobe Photoshop. Comparison with Existing Method(s) To our best knowledge, this is the first demonstration of GPU application in neural circuit mapping and electrophysiology-based data processing. Conclusions Together, GPU enabled computation enhances our ability to process large-scale data sets derived from neural circuit mapping studies, allowing for increased processing speeds while retaining data precision. PMID:25277633
Wilkinson, Karl A; Hine, Nicholas D M; Skylaris, Chris-Kriton
2014-11-11
We present a hybrid MPI-OpenMP implementation of Linear-Scaling Density Functional Theory within the ONETEP code. We illustrate its performance on a range of high performance computing (HPC) platforms comprising shared-memory nodes with fast interconnect. Our work has focused on applying OpenMP parallelism to the routines which dominate the computational load, attempting where possible to parallelize different loops from those already parallelized within MPI. This includes 3D FFT box operations, sparse matrix algebra operations, calculation of integrals, and Ewald summation. While the underlying numerical methods are unchanged, these developments represent significant changes to the algorithms used within ONETEP to distribute the workload across CPU cores. The new hybrid code exhibits much-improved strong scaling relative to the MPI-only code and permits calculations with a much higher ratio of cores to atoms. These developments result in a significantly shorter time to solution than was possible using MPI alone and facilitate the application of the ONETEP code to systems larger than previously feasible. We illustrate this with benchmark calculations from an amyloid fibril trimer containing 41,907 atoms. We use the code to study the mechanism of delamination of cellulose nanofibrils when undergoing sonification, a process which is controlled by a large number of interactions that collectively determine the structural properties of the fibrils. Many energy evaluations were needed for these simulations, and as these systems comprise up to 21,276 atoms this would not have been feasible without the developments described here.
Automated Performance Prediction of Message-Passing Parallel Programs
NASA Technical Reports Server (NTRS)
Block, Robert J.; Sarukkai, Sekhar; Mehra, Pankaj; Woodrow, Thomas S. (Technical Monitor)
1995-01-01
The increasing use of massively parallel supercomputers to solve large-scale scientific problems has generated a need for tools that can predict scalability trends of applications written for these machines. Much work has been done to create simple models that represent important characteristics of parallel programs, such as latency, network contention, and communication volume. But many of these methods still require substantial manual effort to represent an application in the model's format. The NIK toolkit described in this paper is the result of an on-going effort to automate the formation of analytic expressions of program execution time, with a minimum of programmer assistance. In this paper we demonstrate the feasibility of our approach, by extending previous work to detect and model communication patterns automatically, with and without overlapped computations. The predictions derived from these models agree, within reasonable limits, with execution times of programs measured on the Intel iPSC/860 and Paragon. Further, we demonstrate the use of MK in selecting optimal computational grain size and studying various scalability metrics.
Developing science gateways for drug discovery in a grid environment.
Pérez-Sánchez, Horacio; Rezaei, Vahid; Mezhuyev, Vitaliy; Man, Duhu; Peña-García, Jorge; den-Haan, Helena; Gesing, Sandra
2016-01-01
Methods for in silico screening of large databases of molecules increasingly complement and replace experimental techniques to discover novel compounds to combat diseases. As these techniques become more complex and computationally costly we are faced with an increasing problem to provide the research community of life sciences with a convenient tool for high-throughput virtual screening on distributed computing resources. To this end, we recently integrated the biophysics-based drug-screening program FlexScreen into a service, applicable for large-scale parallel screening and reusable in the context of scientific workflows. Our implementation is based on Pipeline Pilot and Simple Object Access Protocol and provides an easy-to-use graphical user interface to construct complex workflows, which can be executed on distributed computing resources, thus accelerating the throughput by several orders of magnitude.
Multi-source Geospatial Data Analysis with Google Earth Engine
NASA Astrophysics Data System (ADS)
Erickson, T.
2014-12-01
The Google Earth Engine platform is a cloud computing environment for data analysis that combines a public data catalog with a large-scale computational facility optimized for parallel processing of geospatial data. The data catalog is a multi-petabyte archive of georeferenced datasets that include images from Earth observing satellite and airborne sensors (examples: USGS Landsat, NASA MODIS, USDA NAIP), weather and climate datasets, and digital elevation models. Earth Engine supports both a just-in-time computation model that enables real-time preview and debugging during algorithm development for open-ended data exploration, and a batch computation mode for applying algorithms over large spatial and temporal extents. The platform automatically handles many traditionally-onerous data management tasks, such as data format conversion, reprojection, and resampling, which facilitates writing algorithms that combine data from multiple sensors and/or models. Although the primary use of Earth Engine, to date, has been the analysis of large Earth observing satellite datasets, the computational platform is generally applicable to a wide variety of use cases that require large-scale geospatial data analyses. This presentation will focus on how Earth Engine facilitates the analysis of geospatial data streams that originate from multiple separate sources (and often communities) and how it enables collaboration during algorithm development and data exploration. The talk will highlight current projects/analyses that are enabled by this functionality.https://earthengine.google.org
Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce.
Aji, Ablimit; Wang, Fusheng; Vo, Hoang; Lee, Rubao; Liu, Qiaoling; Zhang, Xiaodong; Saltz, Joel
2013-08-01
Support of high performance queries on large volumes of spatial data becomes increasingly important in many application domains, including geospatial problems in numerous fields, location based services, and emerging scientific applications that are increasingly data- and compute-intensive. The emergence of massive scale spatial data is due to the proliferation of cost effective and ubiquitous positioning technologies, development of high resolution imaging technologies, and contribution from a large number of community users. There are two major challenges for managing and querying massive spatial data to support spatial queries: the explosion of spatial data, and the high computational complexity of spatial queries. In this paper, we present Hadoop-GIS - a scalable and high performance spatial data warehousing system for running large scale spatial queries on Hadoop. Hadoop-GIS supports multiple types of spatial queries on MapReduce through spatial partitioning, customizable spatial query engine RESQUE, implicit parallel spatial query execution on MapReduce, and effective methods for amending query results through handling boundary objects. Hadoop-GIS utilizes global partition indexing and customizable on demand local spatial indexing to achieve efficient query processing. Hadoop-GIS is integrated into Hive to support declarative spatial queries with an integrated architecture. Our experiments have demonstrated the high efficiency of Hadoop-GIS on query response and high scalability to run on commodity clusters. Our comparative experiments have showed that performance of Hadoop-GIS is on par with parallel SDBMS and outperforms SDBMS for compute-intensive queries. Hadoop-GIS is available as a set of library for processing spatial queries, and as an integrated software package in Hive.
Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce
Aji, Ablimit; Wang, Fusheng; Vo, Hoang; Lee, Rubao; Liu, Qiaoling; Zhang, Xiaodong; Saltz, Joel
2013-01-01
Support of high performance queries on large volumes of spatial data becomes increasingly important in many application domains, including geospatial problems in numerous fields, location based services, and emerging scientific applications that are increasingly data- and compute-intensive. The emergence of massive scale spatial data is due to the proliferation of cost effective and ubiquitous positioning technologies, development of high resolution imaging technologies, and contribution from a large number of community users. There are two major challenges for managing and querying massive spatial data to support spatial queries: the explosion of spatial data, and the high computational complexity of spatial queries. In this paper, we present Hadoop-GIS – a scalable and high performance spatial data warehousing system for running large scale spatial queries on Hadoop. Hadoop-GIS supports multiple types of spatial queries on MapReduce through spatial partitioning, customizable spatial query engine RESQUE, implicit parallel spatial query execution on MapReduce, and effective methods for amending query results through handling boundary objects. Hadoop-GIS utilizes global partition indexing and customizable on demand local spatial indexing to achieve efficient query processing. Hadoop-GIS is integrated into Hive to support declarative spatial queries with an integrated architecture. Our experiments have demonstrated the high efficiency of Hadoop-GIS on query response and high scalability to run on commodity clusters. Our comparative experiments have showed that performance of Hadoop-GIS is on par with parallel SDBMS and outperforms SDBMS for compute-intensive queries. Hadoop-GIS is available as a set of library for processing spatial queries, and as an integrated software package in Hive. PMID:24187650
Parallelization of Finite Element Analysis Codes Using Heterogeneous Distributed Computing
NASA Technical Reports Server (NTRS)
Ozguner, Fusun
1996-01-01
Performance gains in computer design are quickly consumed as users seek to analyze larger problems to a higher degree of accuracy. Innovative computational methods, such as parallel and distributed computing, seek to multiply the power of existing hardware technology to satisfy the computational demands of large applications. In the early stages of this project, experiments were performed using two large, coarse-grained applications, CSTEM and METCAN. These applications were parallelized on an Intel iPSC/860 hypercube. It was found that the overall speedup was very low, due to large, inherently sequential code segments present in the applications. The overall execution time T(sub par), of the application is dependent on these sequential segments. If these segments make up a significant fraction of the overall code, the application will have a poor speedup measure.
Computational methods and software systems for dynamics and control of large space structures
NASA Technical Reports Server (NTRS)
Park, K. C.; Felippa, C. A.; Farhat, C.; Pramono, E.
1990-01-01
Two key areas of crucial importance to the computer-based simulation of large space structures are discussed. The first area involves multibody dynamics (MBD) of flexible space structures, with applications directed to deployment, construction, and maneuvering. The second area deals with advanced software systems, with emphasis on parallel processing. The latest research thrust in the second area involves massively parallel computers.
Collen, M F
1994-01-01
This article summarizes the origins of informatics, which is based on the science, engineering, and technology of computer hardware, software, and communications. In just four decades, from the 1950s to the 1990s, computer technology has progressed from slow, first-generation vacuum tubes, through the invention of the transistor and its incorporation into microprocessor chips, and ultimately, to fast, fourth-generation very-large-scale-integrated silicon chips. Programming has undergone a parallel transformation, from cumbersome, first-generation, machine languages to efficient, fourth-generation application-oriented languages. Communication has evolved from simple copper wires to complex fiberoptic cables in computer-linked networks. The digital computer has profound implications for the development and practice of clinical medicine. PMID:7719803
NASA Astrophysics Data System (ADS)
Imamura, Seigo; Ono, Kenji; Yokokawa, Mitsuo
2016-07-01
Ensemble computing, which is an instance of capacity computing, is an effective computing scenario for exascale parallel supercomputers. In ensemble computing, there are multiple linear systems associated with a common coefficient matrix. We improve the performance of iterative solvers for multiple vectors by solving them at the same time, that is, by solving for the product of the matrices. We implemented several iterative methods and compared their performance. The maximum performance on Sparc VIIIfx was 7.6 times higher than that of a naïve implementation. Finally, to deal with the different convergence processes of linear systems, we introduced a control method to eliminate the calculation of already converged vectors.
Ordering Unstructured Meshes for Sparse Matrix Computations on Leading Parallel Systems
NASA Technical Reports Server (NTRS)
Oliker, Leonid; Li, Xiaoye; Heber, Gerd; Biswas, Rupak
2000-01-01
The ability of computers to solve hitherto intractable problems and simulate complex processes using mathematical models makes them an indispensable part of modern science and engineering. Computer simulations of large-scale realistic applications usually require solving a set of non-linear partial differential equations (PDES) over a finite region. For example, one thrust area in the DOE Grand Challenge projects is to design future accelerators such as the SpaHation Neutron Source (SNS). Our colleagues at SLAC need to model complex RFQ cavities with large aspect ratios. Unstructured grids are currently used to resolve the small features in a large computational domain; dynamic mesh adaptation will be added in the future for additional efficiency. The PDEs for electromagnetics are discretized by the FEM method, which leads to a generalized eigenvalue problem Kx = AMx, where K and M are the stiffness and mass matrices, and are very sparse. In a typical cavity model, the number of degrees of freedom is about one million. For such large eigenproblems, direct solution techniques quickly reach the memory limits. Instead, the most widely-used methods are Krylov subspace methods, such as Lanczos or Jacobi-Davidson. In all the Krylov-based algorithms, sparse matrix-vector multiplication (SPMV) must be performed repeatedly. Therefore, the efficiency of SPMV usually determines the eigensolver speed. SPMV is also one of the most heavily used kernels in large-scale numerical simulations.
O'Donnell, Michael
2015-01-01
State-and-transition simulation modeling relies on knowledge of vegetation composition and structure (states) that describe community conditions, mechanistic feedbacks such as fire that can affect vegetation establishment, and ecological processes that drive community conditions as well as the transitions between these states. However, as the need for modeling larger and more complex landscapes increase, a more advanced awareness of computing resources becomes essential. The objectives of this study include identifying challenges of executing state-and-transition simulation models, identifying common bottlenecks of computing resources, developing a workflow and software that enable parallel processing of Monte Carlo simulations, and identifying the advantages and disadvantages of different computing resources. To address these objectives, this study used the ApexRMS® SyncroSim software and embarrassingly parallel tasks of Monte Carlo simulations on a single multicore computer and on distributed computing systems. The results demonstrated that state-and-transition simulation models scale best in distributed computing environments, such as high-throughput and high-performance computing, because these environments disseminate the workloads across many compute nodes, thereby supporting analysis of larger landscapes, higher spatial resolution vegetation products, and more complex models. Using a case study and five different computing environments, the top result (high-throughput computing versus serial computations) indicated an approximate 96.6% decrease of computing time. With a single, multicore compute node (bottom result), the computing time indicated an 81.8% decrease relative to using serial computations. These results provide insight into the tradeoffs of using different computing resources when research necessitates advanced integration of ecoinformatics incorporating large and complicated data inputs and models. - See more at: http://aimspress.com/aimses/ch/reader/view_abstract.aspx?file_no=Environ2015030&flag=1#sthash.p1XKDtF8.dpuf
A k-space method for acoustic propagation using coupled first-order equations in three dimensions.
Tillett, Jason C; Daoud, Mohammad I; Lacefield, James C; Waag, Robert C
2009-09-01
A previously described two-dimensional k-space method for large-scale calculation of acoustic wave propagation in tissues is extended to three dimensions. The three-dimensional method contains all of the two-dimensional method features that allow accurate and stable calculation of propagation. These features are spectral calculation of spatial derivatives, temporal correction that produces exact propagation in a homogeneous medium, staggered spatial and temporal grids, and a perfectly matched boundary layer. Spectral evaluation of spatial derivatives is accomplished using a fast Fourier transform in three dimensions. This computational bottleneck requires all-to-all communication; execution time in a parallel implementation is therefore sensitive to node interconnect latency and bandwidth. Accuracy of the three-dimensional method is evaluated through comparisons with exact solutions for media having spherical inhomogeneities. Large-scale calculations in three dimensions were performed by distributing the nearly 50 variables per voxel that are used to implement the method over a cluster of computers. Two computer clusters used to evaluate method accuracy are compared. Comparisons of k-space calculations with exact methods including absorption highlight the need to model accurately the medium dispersion relationships, especially in large-scale media. Accurately modeled media allow the k-space method to calculate acoustic propagation in tissues over hundreds of wavelengths.
Parallel Computational Protein Design.
Zhou, Yichao; Donald, Bruce R; Zeng, Jianyang
2017-01-01
Computational structure-based protein design (CSPD) is an important problem in computational biology, which aims to design or improve a prescribed protein function based on a protein structure template. It provides a practical tool for real-world protein engineering applications. A popular CSPD method that guarantees to find the global minimum energy solution (GMEC) is to combine both dead-end elimination (DEE) and A* tree search algorithms. However, in this framework, the A* search algorithm can run in exponential time in the worst case, which may become the computation bottleneck of large-scale computational protein design process. To address this issue, we extend and add a new module to the OSPREY program that was previously developed in the Donald lab (Gainza et al., Methods Enzymol 523:87, 2013) to implement a GPU-based massively parallel A* algorithm for improving protein design pipeline. By exploiting the modern GPU computational framework and optimizing the computation of the heuristic function for A* search, our new program, called gOSPREY, can provide up to four orders of magnitude speedups in large protein design cases with a small memory overhead comparing to the traditional A* search algorithm implementation, while still guaranteeing the optimality. In addition, gOSPREY can be configured to run in a bounded-memory mode to tackle the problems in which the conformation space is too large and the global optimal solution cannot be computed previously. Furthermore, the GPU-based A* algorithm implemented in the gOSPREY program can be combined with the state-of-the-art rotamer pruning algorithms such as iMinDEE (Gainza et al., PLoS Comput Biol 8:e1002335, 2012) and DEEPer (Hallen et al., Proteins 81:18-39, 2013) to also consider continuous backbone and side-chain flexibility.
Scalable Molecular Dynamics with NAMD
Phillips, James C.; Braun, Rosemary; Wang, Wei; Gumbart, James; Tajkhorshid, Emad; Villa, Elizabeth; Chipot, Christophe; Skeel, Robert D.; Kalé, Laxmikant; Schulten, Klaus
2008-01-01
NAMD is a parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems. NAMD scales to hundreds of processors on high-end parallel platforms, as well as tens of processors on low-cost commodity clusters, and also runs on individual desktop and laptop computers. NAMD works with AMBER and CHARMM potential functions, parameters, and file formats. This paper, directed to novices as well as experts, first introduces concepts and methods used in the NAMD program, describing the classical molecular dynamics force field, equations of motion, and integration methods along with the efficient electrostatics evaluation algorithms employed and temperature and pressure controls used. Features for steering the simulation across barriers and for calculating both alchemical and conformational free energy differences are presented. The motivations for and a roadmap to the internal design of NAMD, implemented in C++ and based on Charm++ parallel objects, are outlined. The factors affecting the serial and parallel performance of a simulation are discussed. Next, typical NAMD use is illustrated with representative applications to a small, a medium, and a large biomolecular system, highlighting particular features of NAMD, e.g., the Tcl scripting language. Finally, the paper provides a list of the key features of NAMD and discusses the benefits of combining NAMD with the molecular graphics/sequence analysis software VMD and the grid computing/collaboratory software BioCoRE. NAMD is distributed free of charge with source code at www.ks.uiuc.edu. PMID:16222654
GROMACS 4: Algorithms for Highly Efficient, Load-Balanced, and Scalable Molecular Simulation.
Hess, Berk; Kutzner, Carsten; van der Spoel, David; Lindahl, Erik
2008-03-01
Molecular simulation is an extremely useful, but computationally very expensive tool for studies of chemical and biomolecular systems. Here, we present a new implementation of our molecular simulation toolkit GROMACS which now both achieves extremely high performance on single processors from algorithmic optimizations and hand-coded routines and simultaneously scales very well on parallel machines. The code encompasses a minimal-communication domain decomposition algorithm, full dynamic load balancing, a state-of-the-art parallel constraint solver, and efficient virtual site algorithms that allow removal of hydrogen atom degrees of freedom to enable integration time steps up to 5 fs for atomistic simulations also in parallel. To improve the scaling properties of the common particle mesh Ewald electrostatics algorithms, we have in addition used a Multiple-Program, Multiple-Data approach, with separate node domains responsible for direct and reciprocal space interactions. Not only does this combination of algorithms enable extremely long simulations of large systems but also it provides that simulation performance on quite modest numbers of standard cluster nodes.
Performance of low-rank QR approximation of the finite element Biot-Savart law
DOE Office of Scientific and Technical Information (OSTI.GOV)
White, D; Fasenfest, B
2006-10-16
In this paper we present a low-rank QR method for evaluating the discrete Biot-Savart law. Our goal is to develop an algorithm that is easily implemented on parallel computers. It is assumed that the known current density and the unknown magnetic field are both expressed in a finite element expansion, and we wish to compute the degrees-of-freedom (DOF) in the basis function expansion of the magnetic field. The matrix that maps the current DOF to the field DOF is full, but if the spatial domain is properly partitioned the matrix can be written as a block matrix, with blocks representingmore » distant interactions being low rank and having a compressed QR representation. While an octree partitioning of the matrix may be ideal, for ease of parallel implementation we employ a partitioning based on number of processors. The rank of each block (i.e. the compression) is determined by the specific geometry and is computed dynamically. In this paper we provide the algorithmic details and present computational results for large-scale computations.« less
Efficient partitioning and assignment on programs for multiprocessor execution
NASA Technical Reports Server (NTRS)
Standley, Hilda M.
1993-01-01
The general problem studied is that of segmenting or partitioning programs for distribution across a multiprocessor system. Efficient partitioning and the assignment of program elements are of great importance since the time consumed in this overhead activity may easily dominate the computation, effectively eliminating any gains made by the use of the parallelism. In this study, the partitioning of sequentially structured programs (written in FORTRAN) is evaluated. Heuristics, developed for similar applications are examined. Finally, a model for queueing networks with finite queues is developed which may be used to analyze multiprocessor system architectures with a shared memory approach to the problem of partitioning. The properties of sequentially written programs form obstacles to large scale (at the procedure or subroutine level) parallelization. Data dependencies of even the minutest nature, reflecting the sequential development of the program, severely limit parallelism. The design of heuristic algorithms is tied to the experience gained in the parallel splitting. Parallelism obtained through the physical separation of data has seen some success, especially at the data element level. Data parallelism on a grander scale requires models that accurately reflect the effects of blocking caused by finite queues. A model for the approximation of the performance of finite queueing networks is developed. This model makes use of the decomposition approach combined with the efficiency of product form solutions.
Fast parallel molecular algorithms for DNA-based computation: factoring integers.
Chang, Weng-Long; Guo, Minyi; Ho, Michael Shan-Hui
2005-06-01
The RSA public-key cryptosystem is an algorithm that converts input data to an unrecognizable encryption and converts the unrecognizable data back into its original decryption form. The security of the RSA public-key cryptosystem is based on the difficulty of factoring the product of two large prime numbers. This paper demonstrates to factor the product of two large prime numbers, and is a breakthrough in basic biological operations using a molecular computer. In order to achieve this, we propose three DNA-based algorithms for parallel subtractor, parallel comparator, and parallel modular arithmetic that formally verify our designed molecular solutions for factoring the product of two large prime numbers. Furthermore, this work indicates that the cryptosystems using public-key are perhaps insecure and also presents clear evidence of the ability of molecular computing to perform complicated mathematical operations.
The Center for Multiscale Plasma Dynamics, Final Report
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gombosi, Tamas I.
The University of Michigan participated in the joint UCLA/Maryland fusion science center focused on plasma physics problems for which the traditional separation of the dynamics into microscale and macroscale processes breaks down. These processes involve large scale flows and magnetic fields tightly coupled to the small scale, kinetic dynamics of turbulence, particle acceleration and energy cascade. The interaction between these vastly disparate scales controls the evolution of the system. The enormous range of temporal and spatial scales associated with these problems renders direct simulation intractable even in computations that use the largest existing parallel computers. Our efforts focused on twomore » main problems: the development of Hall MHD solvers on solution adaptive grids and the development of solution adaptive grids using generalized coordinates so that the proper geometry of inertial confinement can be taken into account and efficient refinement strategies can be obtained.« less
A CPU/MIC Collaborated Parallel Framework for GROMACS on Tianhe-2 Supercomputer.
Peng, Shaoliang; Yang, Shunyun; Su, Wenhe; Zhang, Xiaoyu; Zhang, Tenglilang; Liu, Weiguo; Zhao, Xingming
2017-06-16
Molecular Dynamics (MD) is the simulation of the dynamic behavior of atoms and molecules. As the most popular software for molecular dynamics, GROMACS cannot work on large-scale data because of limit computing resources. In this paper, we propose a CPU and Intel® Xeon Phi Many Integrated Core (MIC) collaborated parallel framework to accelerate GROMACS using the offload mode on a MIC coprocessor, with which the performance of GROMACS is improved significantly, especially with the utility of Tianhe-2 supercomputer. Furthermore, we optimize GROMACS so that it can run on both the CPU and MIC at the same time. In addition, we accelerate multi-node GROMACS so that it can be used in practice. Benchmarking on real data, our accelerated GROMACS performs very well and reduces computation time significantly. Source code: https://github.com/tianhe2/gromacs-mic.
An Analysis of Performance Enhancement Techniques for Overset Grid Applications
NASA Technical Reports Server (NTRS)
Djomehri, J. J.; Biswas, R.; Potsdam, M.; Strawn, R. C.; Biegel, Bryan (Technical Monitor)
2002-01-01
The overset grid methodology has significantly reduced time-to-solution of high-fidelity computational fluid dynamics (CFD) simulations about complex aerospace configurations. The solution process resolves the geometrical complexity of the problem domain by using separately generated but overlapping structured discretization grids that periodically exchange information through interpolation. However, high performance computations of such large-scale realistic applications must be handled efficiently on state-of-the-art parallel supercomputers. This paper analyzes the effects of various performance enhancement techniques on the parallel efficiency of an overset grid Navier-Stokes CFD application running on an SGI Origin2000 machine. Specifically, the role of asynchronous communication, grid splitting, and grid grouping strategies are presented and discussed. Results indicate that performance depends critically on the level of latency hiding and the quality of load balancing across the processors.
Structure preserving parallel algorithms for solving the Bethe–Salpeter eigenvalue problem
Shao, Meiyue; da Jornada, Felipe H.; Yang, Chao; ...
2015-10-02
The Bethe–Salpeter eigenvalue problem is a dense structured eigenvalue problem arising from discretized Bethe–Salpeter equation in the context of computing exciton energies and states. A computational challenge is that at least half of the eigenvalues and the associated eigenvectors are desired in practice. In this paper, we establish the equivalence between Bethe–Salpeter eigenvalue problems and real Hamiltonian eigenvalue problems. Based on theoretical analysis, structure preserving algorithms for a class of Bethe–Salpeter eigenvalue problems are proposed. We also show that for this class of problems all eigenvalues obtained from the Tamm–Dancoff approximation are overestimated. In order to solve large scale problemsmore » of practical interest, we discuss parallel implementations of our algorithms targeting distributed memory systems. Finally, several numerical examples are presented to demonstrate the efficiency and accuracy of our algorithms.« less
Associative Pattern Recognition In Analog VLSI Circuits
NASA Technical Reports Server (NTRS)
Tawel, Raoul
1995-01-01
Winner-take-all circuit selects best-match stored pattern. Prototype cascadable very-large-scale integrated (VLSI) circuit chips built and tested to demonstrate concept of electronic associative pattern recognition. Based on low-power, sub-threshold analog complementary oxide/semiconductor (CMOS) VLSI circuitry, each chip can store 128 sets (vectors) of 16 analog values (vector components), vectors representing known patterns as diverse as spectra, histograms, graphs, or brightnesses of pixels in images. Chips exploit parallel nature of vector quantization architecture to implement highly parallel processing in relatively simple computational cells. Through collective action, cells classify input pattern in fraction of microsecond while consuming power of few microwatts.
A Scalable O(N) Algorithm for Large-Scale Parallel First-Principles Molecular Dynamics Simulations
DOE Office of Scientific and Technical Information (OSTI.GOV)
Osei-Kuffuor, Daniel; Fattebert, Jean-Luc
2014-01-01
Traditional algorithms for first-principles molecular dynamics (FPMD) simulations only gain a modest capability increase from current petascale computers, due to their O(N 3) complexity and their heavy use of global communications. To address this issue, we are developing a truly scalable O(N) complexity FPMD algorithm, based on density functional theory (DFT), which avoids global communications. The computational model uses a general nonorthogonal orbital formulation for the DFT energy functional, which requires knowledge of selected elements of the inverse of the associated overlap matrix. We present a scalable algorithm for approximately computing selected entries of the inverse of the overlap matrix,more » based on an approximate inverse technique, by inverting local blocks corresponding to principal submatrices of the global overlap matrix. The new FPMD algorithm exploits sparsity and uses nearest neighbor communication to provide a computational scheme capable of extreme scalability. Accuracy is controlled by the mesh spacing of the finite difference discretization, the size of the localization regions in which the electronic orbitals are confined, and a cutoff beyond which the entries of the overlap matrix can be omitted when computing selected entries of its inverse. We demonstrate the algorithm's excellent parallel scaling for up to O(100K) atoms on O(100K) processors, with a wall-clock time of O(1) minute per molecular dynamics time step.« less
Real science at the petascale.
Saksena, Radhika S; Boghosian, Bruce; Fazendeiro, Luis; Kenway, Owain A; Manos, Steven; Mazzeo, Marco D; Sadiq, S Kashif; Suter, James L; Wright, David; Coveney, Peter V
2009-06-28
We describe computational science research that uses petascale resources to achieve scientific results at unprecedented scales and resolution. The applications span a wide range of domains, from investigation of fundamental problems in turbulence through computational materials science research to biomedical applications at the forefront of HIV/AIDS research and cerebrovascular haemodynamics. This work was mainly performed on the US TeraGrid 'petascale' resource, Ranger, at Texas Advanced Computing Center, in the first half of 2008 when it was the largest computing system in the world available for open scientific research. We have sought to use this petascale supercomputer optimally across application domains and scales, exploiting the excellent parallel scaling performance found on up to at least 32 768 cores for certain of our codes in the so-called 'capability computing' category as well as high-throughput intermediate-scale jobs for ensemble simulations in the 32-512 core range. Furthermore, this activity provides evidence that conventional parallel programming with MPI should be successful at the petascale in the short to medium term. We also report on the parallel performance of some of our codes on up to 65 636 cores on the IBM Blue Gene/P system at the Argonne Leadership Computing Facility, which has recently been named the fastest supercomputer in the world for open science.
RT DDA: A hybrid method for predicting the scattering properties by densely packed media
NASA Astrophysics Data System (ADS)
Ramezan Pour, B.; Mackowski, D.
2017-12-01
The most accurate approaches to predicting the scattering properties of particulate media are based on exact solutions of the Maxwell's equations (MEs), such as the T-matrix and discrete dipole methods. Applying these techniques for optically thick targets is challenging problem due to the large-scale computations and are usually substituted by phenomenological radiative transfer (RT) methods. On the other hand, the RT technique is of questionable validity in media with large particle packing densities. In recent works, we used numerically exact ME solvers to examine the effects of particle concentration on the polarized reflection properties of plane parallel random media. The simulations were performed for plane parallel layers of wavelength-sized spherical particles, and results were compared with RT predictions. We have shown that RTE results monotonically converge to the exact solution as the particle volume fraction becomes smaller and one can observe a nearly perfect fit for packing densities of 2%-5%. This study describes the hybrid technique composed of exact and numerical scalar RT methods. The exact methodology in this work is the plane parallel discrete dipole approximation whereas the numerical method is based on the adding and doubling method. This approach not only decreases the computational time owing to the RT method but also includes the interference and multiple scattering effects, so it may be applicable to large particle density conditions.
Solving very large, sparse linear systems on mesh-connected parallel computers
NASA Technical Reports Server (NTRS)
Opsahl, Torstein; Reif, John
1987-01-01
The implementation of Pan and Reif's Parallel Nested Dissection (PND) algorithm on mesh connected parallel computers is described. This is the first known algorithm that allows very large, sparse linear systems of equations to be solved efficiently in polylog time using a small number of processors. How the processor bound of PND can be matched to the number of processors available on a given parallel computer by slowing down the algorithm by constant factors is described. Also, for the important class of problems where G(A) is a grid graph, a unique memory mapping that reduces the inter-processor communication requirements of PND to those that can be executed on mesh connected parallel machines is detailed. A description of an implementation on the Goodyear Massively Parallel Processor (MPP), located at Goddard is given. Also, a detailed discussion of data mappings and performance issues is given.
Parallel-Processing Test Bed For Simulation Software
NASA Technical Reports Server (NTRS)
Blech, Richard; Cole, Gary; Townsend, Scott
1996-01-01
Second-generation Hypercluster computing system is multiprocessor test bed for research on parallel algorithms for simulation in fluid dynamics, electromagnetics, chemistry, and other fields with large computational requirements but relatively low input/output requirements. Built from standard, off-shelf hardware readily upgraded as improved technology becomes available. System used for experiments with such parallel-processing concepts as message-passing algorithms, debugging software tools, and computational steering. First-generation Hypercluster system described in "Hypercluster Parallel Processor" (LEW-15283).
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chin, George; Marquez, Andres; Choudhury, Sutanay
2012-09-01
Triadic analysis encompasses a useful set of graph mining methods that is centered on the concept of a triad, which is a subgraph of three nodes and the configuration of directed edges across the nodes. Such methods are often applied in the social sciences as well as many other diverse fields. Triadic methods commonly operate on a triad census that counts the number of triads of every possible edge configuration in a graph. Like other graph algorithms, triadic census algorithms do not scale well when graphs reach tens of millions to billions of nodes. To enable the triadic analysis ofmore » large-scale graphs, we developed and optimized a triad census algorithm to efficiently execute on shared memory architectures. We will retrace the development and evolution of a parallel triad census algorithm. Over the course of several versions, we continually adapted the code’s data structures and program logic to expose more opportunities to exploit parallelism on shared memory that would translate into improved computational performance. We will recall the critical steps and modifications that occurred during code development and optimization. Furthermore, we will compare the performances of triad census algorithm versions on three specific systems: Cray XMT, HP Superdome, and AMD multi-core NUMA machine. These three systems have shared memory architectures but with markedly different hardware capabilities to manage parallelism.« less
Parallel heuristics for scalable community detection
Lu, Hao; Halappanavar, Mahantesh; Kalyanaraman, Ananth
2015-08-14
Community detection has become a fundamental operation in numerous graph-theoretic applications. Despite its potential for application, there is only limited support for community detection on large-scale parallel computers, largely owing to the irregular and inherently sequential nature of the underlying heuristics. In this paper, we present parallelization heuristics for fast community detection using the Louvain method as the serial template. The Louvain method is an iterative heuristic for modularity optimization. Originally developed in 2008, the method has become increasingly popular owing to its ability to detect high modularity community partitions in a fast and memory-efficient manner. However, the method ismore » also inherently sequential, thereby limiting its scalability. Here, we observe certain key properties of this method that present challenges for its parallelization, and consequently propose heuristics that are designed to break the sequential barrier. For evaluation purposes, we implemented our heuristics using OpenMP multithreading, and tested them over real world graphs derived from multiple application domains. Compared to the serial Louvain implementation, our parallel implementation is able to produce community outputs with a higher modularity for most of the inputs tested, in comparable number or fewer iterations, while providing real speedups of up to 16x using 32 threads.« less
GPU accelerated fuzzy connected image segmentation by using CUDA.
Zhuge, Ying; Cao, Yong; Miller, Robert W
2009-01-01
Image segmentation techniques using fuzzy connectedness principles have shown their effectiveness in segmenting a variety of objects in several large applications in recent years. However, one problem of these algorithms has been their excessive computational requirements when processing large image datasets. Nowadays commodity graphics hardware provides high parallel computing power. In this paper, we present a parallel fuzzy connected image segmentation algorithm on Nvidia's Compute Unified Device Architecture (CUDA) platform for segmenting large medical image data sets. Our experiments based on three data sets with small, medium, and large data size demonstrate the efficiency of the parallel algorithm, which achieves a speed-up factor of 7.2x, 7.3x, and 14.4x, correspondingly, for the three data sets over the sequential implementation of fuzzy connected image segmentation algorithm on CPU.
A new tool called DISSECT for analysing large genomic data sets using a Big Data approach
Canela-Xandri, Oriol; Law, Andy; Gray, Alan; Woolliams, John A.; Tenesa, Albert
2015-01-01
Large-scale genetic and genomic data are increasingly available and the major bottleneck in their analysis is a lack of sufficiently scalable computational tools. To address this problem in the context of complex traits analysis, we present DISSECT. DISSECT is a new and freely available software that is able to exploit the distributed-memory parallel computational architectures of compute clusters, to perform a wide range of genomic and epidemiologic analyses, which currently can only be carried out on reduced sample sizes or under restricted conditions. We demonstrate the usefulness of our new tool by addressing the challenge of predicting phenotypes from genotype data in human populations using mixed-linear model analysis. We analyse simulated traits from 470,000 individuals genotyped for 590,004 SNPs in ∼4 h using the combined computational power of 8,400 processor cores. We find that prediction accuracies in excess of 80% of the theoretical maximum could be achieved with large sample sizes. PMID:26657010
A Parallel Finite Set Statistical Simulator for Multi-Target Detection and Tracking
NASA Astrophysics Data System (ADS)
Hussein, I.; MacMillan, R.
2014-09-01
Finite Set Statistics (FISST) is a powerful Bayesian inference tool for the joint detection, classification and tracking of multi-target environments. FISST is capable of handling phenomena such as clutter, misdetections, and target birth and decay. Implicit within the approach are solutions to the data association and target label-tracking problems. Finally, FISST provides generalized information measures that can be used for sensor allocation across different types of tasks such as: searching for new targets, and classification and tracking of known targets. These FISST capabilities have been demonstrated on several small-scale illustrative examples. However, for implementation in a large-scale system as in the Space Situational Awareness problem, these capabilities require a lot of computational power. In this paper, we implement FISST in a parallel environment for the joint detection and tracking of multi-target systems. In this implementation, false alarms and misdetections will be modeled. Target birth and decay will not be modeled in the present paper. We will demonstrate the success of the method for as many targets as we possibly can in a desktop parallel environment. Performance measures will include: number of targets in the simulation, certainty of detected target tracks, computational time as a function of clutter returns and number of targets, among other factors.
NASA Astrophysics Data System (ADS)
Casu, F.; de Luca, C.; Lanari, R.; Manunta, M.; Zinno, I.
2016-12-01
A methodology for computing surface deformation time series and mean velocity maps of large areas is presented. Our approach relies on the availability of a multi-temporal set of Synthetic Aperture Radar (SAR) data collected from ascending and descending orbits over an area of interest, and also permits to estimate the vertical and horizontal (East-West) displacement components of the Earth's surface. The adopted methodology is based on an advanced Cloud Computing implementation of the Differential SAR Interferometry (DInSAR) Parallel Small Baseline Subset (P-SBAS) processing chain which allows the unsupervised processing of large SAR data volumes, from the raw data (level-0) imagery up to the generation of DInSAR time series and maps. The presented solution, which is highly scalable, has been tested on the ascending and descending ENVISAT SAR archives, which have been acquired over a large area of Southern California (US) that extends for about 90.000 km2. Such an input dataset has been processed in parallel by exploiting 280 computing nodes of the Amazon Web Services Cloud environment. Moreover, to produce the final mean deformation velocity maps of the vertical and East-West displacement components of the whole investigated area, we took also advantage of the information available from external GPS measurements that permit to account for possible regional trends not easily detectable by DInSAR and to refer the P-SBAS measurements to an external geodetic datum. The presented results clearly demonstrate the effectiveness of the proposed approach that paves the way to the extensive use of the available ERS and ENVISAT SAR data archives. Furthermore, the proposed methodology can be particularly suitable to deal with the very huge data flow provided by the Sentinel-1 constellation, thus permitting to extend the DInSAR analyses at a nearly global scale. This work is partially supported by: the DPC-CNR agreement, the EPOS-IP project and the ESA GEP project.
2017-10-01
Facility is a large-scale cascade that allows detailed flow field surveys and blade surface measurements.10–12 The facility has a continuous run ...structured grids at 2 flow conditions, cruise and takeoff, of the VSPT blade . Computations were run in parallel on a Department of Defense...RANS/LES) and Unsteady RANS Predictions of Separated Flow for a Variable-Speed Power- Turbine Blade Operating with Low Inlet Turbulence Levels
Shared versus distributed memory multiprocessors
NASA Technical Reports Server (NTRS)
Jordan, Harry F.
1991-01-01
The question of whether multiprocessors should have shared or distributed memory has attracted a great deal of attention. Some researchers argue strongly for building distributed memory machines, while others argue just as strongly for programming shared memory multiprocessors. A great deal of research is underway on both types of parallel systems. Special emphasis is placed on systems with a very large number of processors for computation intensive tasks and considers research and implementation trends. It appears that the two types of systems will likely converge to a common form for large scale multiprocessors.
A Large number of fast cosmological simulations
NASA Astrophysics Data System (ADS)
Koda, Jun; Kazin, E.; Blake, C.
2014-01-01
Mock galaxy catalogs are essential tools to analyze large-scale structure data. Many independent realizations of mock catalogs are necessary to evaluate the uncertainties in the measurements. We perform 3600 cosmological simulations for the WiggleZ Dark Energy Survey to obtain the new improved Baron Acoustic Oscillation (BAO) cosmic distance measurements using the density field "reconstruction" technique. We use 1296^3 particles in a periodic box of 600/h Mpc on a side, which is the minimum requirement from the survey volume and observed galaxies. In order to perform such large number of simulations, we developed a parallel code using the COmoving Lagrangian Acceleration (COLA) method, which can simulate cosmological large-scale structure reasonably well with only 10 time steps. Our simulation is more than 100 times faster than conventional N-body simulations; one COLA simulation takes only 15 minutes with 216 computing cores. We have completed the 3600 simulations with a reasonable computation time of 200k core hours. We also present the results of the revised WiggleZ BAO distance measurement, which are significantly improved by the reconstruction technique.
CERN data services for LHC computing
NASA Astrophysics Data System (ADS)
Espinal, X.; Bocchi, E.; Chan, B.; Fiorot, A.; Iven, J.; Lo Presti, G.; Lopez, J.; Gonzalez, H.; Lamanna, M.; Mascetti, L.; Moscicki, J.; Pace, A.; Peters, A.; Ponce, S.; Rousseau, H.; van der Ster, D.
2017-10-01
Dependability, resilience, adaptability and efficiency. Growing requirements require tailoring storage services and novel solutions. Unprecedented volumes of data coming from the broad number of experiments at CERN need to be quickly available in a highly scalable way for large-scale processing and data distribution while in parallel they are routed to tape for long-term archival. These activities are critical for the success of HEP experiments. Nowadays we operate at high incoming throughput (14GB/s during 2015 LHC Pb-Pb run and 11PB in July 2016) and with concurrent complex production work-loads. In parallel our systems provide the platform for the continuous user and experiment driven work-loads for large-scale data analysis, including end-user access and sharing. The storage services at CERN cover the needs of our community: EOS and CASTOR as a large-scale storage; CERNBox for end-user access and sharing; Ceph as data back-end for the CERN OpenStack infrastructure, NFS services and S3 functionality; AFS for legacy distributed-file-system services. In this paper we will summarise the experience in supporting LHC experiments and the transition of our infrastructure from static monolithic systems to flexible components providing a more coherent environment with pluggable protocols, tuneable QoS, sharing capabilities and fine grained ACLs management while continuing to guarantee dependable and robust services.
Parallel Computing:. Some Activities in High Energy Physics
NASA Astrophysics Data System (ADS)
Willers, Ian
This paper examines some activities in High Energy Physics that utilise parallel computing. The topic includes all computing from the proposed SIMD front end detectors, the farming applications, high-powered RISC processors and the large machines in the computer centers. We start by looking at the motivation behind using parallelism for general purpose computing. The developments around farming are then described from its simplest form to the more complex system in Fermilab. Finally, there is a list of some developments that are happening close to the experiments.
Parallel fuzzy connected image segmentation on GPU
Zhuge, Ying; Cao, Yong; Udupa, Jayaram K.; Miller, Robert W.
2011-01-01
Purpose: Image segmentation techniques using fuzzy connectedness (FC) principles have shown their effectiveness in segmenting a variety of objects in several large applications. However, one challenge in these algorithms has been their excessive computational requirements when processing large image datasets. Nowadays, commodity graphics hardware provides a highly parallel computing environment. In this paper, the authors present a parallel fuzzy connected image segmentation algorithm implementation on NVIDIA’s compute unified device Architecture (cuda) platform for segmenting medical image data sets. Methods: In the FC algorithm, there are two major computational tasks: (i) computing the fuzzy affinity relations and (ii) computing the fuzzy connectedness relations. These two tasks are implemented as cuda kernels and executed on GPU. A dramatic improvement in speed for both tasks is achieved as a result. Results: Our experiments based on three data sets of small, medium, and large data size demonstrate the efficiency of the parallel algorithm, which achieves a speed-up factor of 24.4x, 18.1x, and 10.3x, correspondingly, for the three data sets on the NVIDIA Tesla C1060 over the implementation of the algorithm on CPU, and takes 0.25, 0.72, and 15.04 s, correspondingly, for the three data sets. Conclusions: The authors developed a parallel algorithm of the widely used fuzzy connected image segmentation method on the NVIDIA GPUs, which are far more cost- and speed-effective than both cluster of workstations and multiprocessing systems. A near-interactive speed of segmentation has been achieved, even for the large data set. PMID:21859037
Parallel fuzzy connected image segmentation on GPU.
Zhuge, Ying; Cao, Yong; Udupa, Jayaram K; Miller, Robert W
2011-07-01
Image segmentation techniques using fuzzy connectedness (FC) principles have shown their effectiveness in segmenting a variety of objects in several large applications. However, one challenge in these algorithms has been their excessive computational requirements when processing large image datasets. Nowadays, commodity graphics hardware provides a highly parallel computing environment. In this paper, the authors present a parallel fuzzy connected image segmentation algorithm implementation on NVIDIA's compute unified device Architecture (CUDA) platform for segmenting medical image data sets. In the FC algorithm, there are two major computational tasks: (i) computing the fuzzy affinity relations and (ii) computing the fuzzy connectedness relations. These two tasks are implemented as CUDA kernels and executed on GPU. A dramatic improvement in speed for both tasks is achieved as a result. Our experiments based on three data sets of small, medium, and large data size demonstrate the efficiency of the parallel algorithm, which achieves a speed-up factor of 24.4x, 18.1x, and 10.3x, correspondingly, for the three data sets on the NVIDIA Tesla C1060 over the implementation of the algorithm on CPU, and takes 0.25, 0.72, and 15.04 s, correspondingly, for the three data sets. The authors developed a parallel algorithm of the widely used fuzzy connected image segmentation method on the NVIDIA GPUs, which are far more cost- and speed-effective than both cluster of workstations and multiprocessing systems. A near-interactive speed of segmentation has been achieved, even for the large data set.
NASA Astrophysics Data System (ADS)
Huhn, William Paul; Lange, Björn; Yu, Victor; Blum, Volker; Lee, Seyong; Yoon, Mina
Density-functional theory has been well established as the dominant quantum-mechanical computational method in the materials community. Large accurate simulations become very challenging on small to mid-scale computers and require high-performance compute platforms to succeed. GPU acceleration is one promising approach. In this talk, we present a first implementation of all-electron density-functional theory in the FHI-aims code for massively parallel GPU-based platforms. Special attention is paid to the update of the density and to the integration of the Hamiltonian and overlap matrices, realized in a domain decomposition scheme on non-uniform grids. The initial implementation scales well across nodes on ORNL's Titan Cray XK7 supercomputer (8 to 64 nodes, 16 MPI ranks/node) and shows an overall speed up in runtime due to utilization of the K20X Tesla GPUs on each Titan node of 1.4x, with the charge density update showing a speed up of 2x. Further acceleration opportunities will be discussed. Work supported by the LDRD Program of ORNL managed by UT-Battle, LLC, for the U.S. DOE and by the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725.
DOE Office of Scientific and Technical Information (OSTI.GOV)
G.A. Pope; K. Sephernoori; D.C. McKinney
1996-03-15
This report describes the application of distributed-memory parallel programming techniques to a compositional simulator called UTCHEM. The University of Texas Chemical Flooding reservoir simulator (UTCHEM) is a general-purpose vectorized chemical flooding simulator that models the transport of chemical species in three-dimensional, multiphase flow through permeable media. The parallel version of UTCHEM addresses solving large-scale problems by reducing the amount of time that is required to obtain the solution as well as providing a flexible and portable programming environment. In this work, the original parallel version of UTCHEM was modified and ported to CRAY T3D and CRAY T3E, distributed-memory, multiprocessor computersmore » using CRAY-PVM as the interprocessor communication library. Also, the data communication routines were modified such that the portability of the original code across different computer architectures was mad possible.« less
Challenges and Opportunities in Modeling of the Global Atmosphere
NASA Astrophysics Data System (ADS)
Janjic, Zavisa; Djurdjevic, Vladimir; Vasic, Ratko
2016-04-01
Modeling paradigms on global scales may need to be reconsidered in order to better utilize the power of massively parallel processing. For high computational efficiency with distributed memory, each core should work on a small subdomain of the full integration domain, and exchange only few rows of halo data with the neighbouring cores. Note that the described scenario strongly favors horizontally local discretizations. This is relatively easy to achieve in regional models. However, the spherical geometry complicates the problem. The latitude-longitude grid with local in space and explicit in time differencing has been an early choice and remained in use ever since. The problem with this method is that the grid size in the longitudinal direction tends to zero as the poles are approached. So, in addition to having unnecessarily high resolution near the poles, polar filtering has to be applied in order to use a time step of a reasonable size. However, the polar filtering requires transpositions involving extra communications as well as more computations. The spectral transform method and the semi-implicit semi-Lagrangian schemes opened the way for application of spectral representation. With some variations, such techniques are currently dominating in global models. Unfortunately, the horizontal non-locality is inherent to the spectral representation and implicit time differencing, which inhibits scaling on a large number of cores. In this respect the lat-lon grid with polar filtering is a step in the right direction, particularly at high resolutions where the Legendre transforms become increasingly expensive. Other grids with reduced variability of grid distances, such as various versions of the cubed sphere and the hexagonal/pentagonal ("soccer ball") grids, were proposed almost fifty years ago. However, on these grids, large-scale (wavenumber 4 and 5) fictitious solutions ("grid imprinting") with significant amplitudes can develop. Due to their large scales, that are comparable to the scales of the dominant Rossby waves, such fictitious solutions are hard to identify and remove. Another new challenge on the global scale is that the limit of validity of the hydrostatic approximation is rapidly being approached. Relaxing the hydrostatic approximation requieres careful reformulation of the model dynamics and more computations and communications. The unified Non-hydrostatic Multi-scale Model (NMMB) will be briefly discussed as an example. The non-hydrostatic dynamics were designed in such a way as to avoid over-specification. The global version is run on the latitude-longitude grid, and the polar filter selectively slows down the waves that would otherwise be unstable without modifying their amplitudes. The model has been successfully tested on various scales. The skill of the medium range forecasts produced by the NMMB is comparable to that of other major medium range models, and its computational efficiency on parallel computers is good.
Testing New Programming Paradigms with NAS Parallel Benchmarks
NASA Technical Reports Server (NTRS)
Jin, H.; Frumkin, M.; Schultz, M.; Yan, J.
2000-01-01
Over the past decade, high performance computing has evolved rapidly, not only in hardware architectures but also with increasing complexity of real applications. Technologies have been developing to aim at scaling up to thousands of processors on both distributed and shared memory systems. Development of parallel programs on these computers is always a challenging task. Today, writing parallel programs with message passing (e.g. MPI) is the most popular way of achieving scalability and high performance. However, writing message passing programs is difficult and error prone. Recent years new effort has been made in defining new parallel programming paradigms. The best examples are: HPF (based on data parallelism) and OpenMP (based on shared memory parallelism). Both provide simple and clear extensions to sequential programs, thus greatly simplify the tedious tasks encountered in writing message passing programs. HPF is independent of memory hierarchy, however, due to the immaturity of compiler technology its performance is still questionable. Although use of parallel compiler directives is not new, OpenMP offers a portable solution in the shared-memory domain. Another important development involves the tremendous progress in the internet and its associated technology. Although still in its infancy, Java promisses portability in a heterogeneous environment and offers possibility to "compile once and run anywhere." In light of testing these new technologies, we implemented new parallel versions of the NAS Parallel Benchmarks (NPBs) with HPF and OpenMP directives, and extended the work with Java and Java-threads. The purpose of this study is to examine the effectiveness of alternative programming paradigms. NPBs consist of five kernels and three simulated applications that mimic the computation and data movement of large scale computational fluid dynamics (CFD) applications. We started with the serial version included in NPB2.3. Optimization of memory and cache usage was applied to several benchmarks, noticeably BT and SP, resulting in better sequential performance. In order to overcome the lack of an HPF performance model and guide the development of the HPF codes, we employed an empirical performance model for several primitives found in the benchmarks. We encountered a few limitations of HPF, such as lack of supporting the "REDISTRIBUTION" directive and no easy way to handle irregular computation. The parallelization with OpenMP directives was done at the outer-most loop level to achieve the largest granularity. The performance of six HPF and OpenMP benchmarks is compared with their MPI counterparts for the Class-A problem size in the figure in next page. These results were obtained on an SGI Origin2000 (195MHz) with MIPSpro-f77 compiler 7.2.1 for OpenMP and MPI codes and PGI pghpf-2.4.3 compiler with MPI interface for HPF programs.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chiang, Patrick
2014-01-31
The research goal of this CAREER proposal is to develop energy-efficient, VLSI interconnect circuits and systems that will facilitate future massively-parallel, high-performance computing. Extreme-scale computing will exhibit massive parallelism on multiple vertical levels, from thou sands of computational units on a single processor to thousands of processors in a single data center. Unfortunately, the energy required to communicate between these units at every level (on chip, off-chip, off-rack) will be the critical limitation to energy efficiency. Therefore, the PI's career goal is to become a leading researcher in the design of energy-efficient VLSI interconnect for future computing systems.
Exact parallel algorithms for some members of the traveling salesman problem family
DOE Office of Scientific and Technical Information (OSTI.GOV)
Pekny, J.F.
1989-01-01
The traveling salesman problem and its many generalizations comprise one of the best known combinatorial optimization problem families. Most members of the family are NP-complete problems so that exact algorithms require an unpredictable and sometimes large computational effort. Parallel computers offer hope for providing the power required to meet these demands. A major barrier to applying parallel computers is the lack of parallel algorithms. The contributions presented in this thesis center around new exact parallel algorithms for the asymmetric traveling salesman problem (ATSP), prize collecting traveling salesman problem (PCTSP), and resource constrained traveling salesman problem (RCTSP). The RCTSP is amore » particularly difficult member of the family since finding a feasible solution is an NP-complete problem. An exact sequential algorithm is also presented for the directed hamiltonian cycle problem (DHCP). The DHCP algorithm is superior to current heuristic approaches and represents the first exact method applicable to large graphs. Computational results presented for each of the algorithms demonstrates the effectiveness of combining efficient algorithms with parallel computing methods. Performance statistics are reported for randomly generated ATSPs with 7,500 cities, PCTSPs with 200 cities, RCTSPs with 200 cities, DHCPs with 3,500 vertices, and assignment problems of size 10,000. Sequential results were collected on a Sun 4/260 engineering workstation, while parallel results were collected using a 14 and 100 processor BBN Butterfly Plus computer. The computational results represent the largest instances ever solved to optimality on any type of computer.« less
PREMER: a Tool to Infer Biological Networks.
Villaverde, Alejandro F; Becker, Kolja; Banga, Julio R
2017-10-04
Inferring the structure of unknown cellular networks is a main challenge in computational biology. Data-driven approaches based on information theory can determine the existence of interactions among network nodes automatically. However, the elucidation of certain features - such as distinguishing between direct and indirect interactions or determining the direction of a causal link - requires estimating information-theoretic quantities in a multidimensional space. This can be a computationally demanding task, which acts as a bottleneck for the application of elaborate algorithms to large-scale network inference problems. The computational cost of such calculations can be alleviated by the use of compiled programs and parallelization. To this end we have developed PREMER (Parallel Reverse Engineering with Mutual information & Entropy Reduction), a software toolbox that can run in parallel and sequential environments. It uses information theoretic criteria to recover network topology and determine the strength and causality of interactions, and allows incorporating prior knowledge, imputing missing data, and correcting outliers. PREMER is a free, open source software tool that does not require any commercial software. Its core algorithms are programmed in FORTRAN 90 and implement OpenMP directives. It has user interfaces in Python and MATLAB/Octave, and runs on Windows, Linux and OSX (https://sites.google.com/site/premertoolbox/).
al3c: high-performance software for parameter inference using Approximate Bayesian Computation.
Stram, Alexander H; Marjoram, Paul; Chen, Gary K
2015-11-01
The development of Approximate Bayesian Computation (ABC) algorithms for parameter inference which are both computationally efficient and scalable in parallel computing environments is an important area of research. Monte Carlo rejection sampling, a fundamental component of ABC algorithms, is trivial to distribute over multiple processors but is inherently inefficient. While development of algorithms such as ABC Sequential Monte Carlo (ABC-SMC) help address the inherent inefficiencies of rejection sampling, such approaches are not as easily scaled on multiple processors. As a result, current Bayesian inference software offerings that use ABC-SMC lack the ability to scale in parallel computing environments. We present al3c, a C++ framework for implementing ABC-SMC in parallel. By requiring only that users define essential functions such as the simulation model and prior distribution function, al3c abstracts the user from both the complexities of parallel programming and the details of the ABC-SMC algorithm. By using the al3c framework, the user is able to scale the ABC-SMC algorithm in parallel computing environments for his or her specific application, with minimal programming overhead. al3c is offered as a static binary for Linux and OS-X computing environments. The user completes an XML configuration file and C++ plug-in template for the specific application, which are used by al3c to obtain the desired results. Users can download the static binaries, source code, reference documentation and examples (including those in this article) by visiting https://github.com/ahstram/al3c. astram@usc.edu Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Biocellion: accelerating computer simulation of multicellular biological system models
Kang, Seunghwa; Kahan, Simon; McDermott, Jason; Flann, Nicholas; Shmulevich, Ilya
2014-01-01
Motivation: Biological system behaviors are often the outcome of complex interactions among a large number of cells and their biotic and abiotic environment. Computational biologists attempt to understand, predict and manipulate biological system behavior through mathematical modeling and computer simulation. Discrete agent-based modeling (in combination with high-resolution grids to model the extracellular environment) is a popular approach for building biological system models. However, the computational complexity of this approach forces computational biologists to resort to coarser resolution approaches to simulate large biological systems. High-performance parallel computers have the potential to address the computing challenge, but writing efficient software for parallel computers is difficult and time-consuming. Results: We have developed Biocellion, a high-performance software framework, to solve this computing challenge using parallel computers. To support a wide range of multicellular biological system models, Biocellion asks users to provide their model specifics by filling the function body of pre-defined model routines. Using Biocellion, modelers without parallel computing expertise can efficiently exploit parallel computers with less effort than writing sequential programs from scratch. We simulate cell sorting, microbial patterning and a bacterial system in soil aggregate as case studies. Availability and implementation: Biocellion runs on x86 compatible systems with the 64 bit Linux operating system and is freely available for academic use. Visit http://biocellion.com for additional information. Contact: seunghwa.kang@pnnl.gov PMID:25064572
Hybrid massively parallel fast sweeping method for static Hamilton-Jacobi equations
NASA Astrophysics Data System (ADS)
Detrixhe, Miles; Gibou, Frédéric
2016-10-01
The fast sweeping method is a popular algorithm for solving a variety of static Hamilton-Jacobi equations. Fast sweeping algorithms for parallel computing have been developed, but are severely limited. In this work, we present a multilevel, hybrid parallel algorithm that combines the desirable traits of two distinct parallel methods. The fine and coarse grained components of the algorithm take advantage of heterogeneous computer architecture common in high performance computing facilities. We present the algorithm and demonstrate its effectiveness on a set of example problems including optimal control, dynamic games, and seismic wave propagation. We give results for convergence, parallel scaling, and show state-of-the-art speedup values for the fast sweeping method.
A depth-first search algorithm to compute elementary flux modes by linear programming.
Quek, Lake-Ee; Nielsen, Lars K
2014-07-30
The decomposition of complex metabolic networks into elementary flux modes (EFMs) provides a useful framework for exploring reaction interactions systematically. Generating a complete set of EFMs for large-scale models, however, is near impossible. Even for moderately-sized models (<400 reactions), existing approaches based on the Double Description method must iterate through a large number of combinatorial candidates, thus imposing an immense processor and memory demand. Based on an alternative elementarity test, we developed a depth-first search algorithm using linear programming (LP) to enumerate EFMs in an exhaustive fashion. Constraints can be introduced to directly generate a subset of EFMs satisfying the set of constraints. The depth-first search algorithm has a constant memory overhead. Using flux constraints, a large LP problem can be massively divided and parallelized into independent sub-jobs for deployment into computing clusters. Since the sub-jobs do not overlap, the approach scales to utilize all available computing nodes with minimal coordination overhead or memory limitations. The speed of the algorithm was comparable to efmtool, a mainstream Double Description method, when enumerating all EFMs; the attrition power gained from performing flux feasibility tests offsets the increased computational demand of running an LP solver. Unlike the Double Description method, the algorithm enables accelerated enumeration of all EFMs satisfying a set of constraints.
Chessa, Manuela; Bianchi, Valentina; Zampetti, Massimo; Sabatini, Silvio P; Solari, Fabio
2012-01-01
The intrinsic parallelism of visual neural architectures based on distributed hierarchical layers is well suited to be implemented on the multi-core architectures of modern graphics cards. The design strategies that allow us to optimally take advantage of such parallelism, in order to efficiently map on GPU the hierarchy of layers and the canonical neural computations, are proposed. Specifically, the advantages of a cortical map-like representation of the data are exploited. Moreover, a GPU implementation of a novel neural architecture for the computation of binocular disparity from stereo image pairs, based on populations of binocular energy neurons, is presented. The implemented neural model achieves good performances in terms of reliability of the disparity estimates and a near real-time execution speed, thus demonstrating the effectiveness of the devised design strategies. The proposed approach is valid in general, since the neural building blocks we implemented are a common basis for the modeling of visual neural functionalities.
NASA Astrophysics Data System (ADS)
Konduri, Aditya
Many natural and engineering systems are governed by nonlinear partial differential equations (PDEs) which result in a multiscale phenomena, e.g. turbulent flows. Numerical simulations of these problems are computationally very expensive and demand for extreme levels of parallelism. At realistic conditions, simulations are being carried out on massively parallel computers with hundreds of thousands of processing elements (PEs). It has been observed that communication between PEs as well as their synchronization at these extreme scales take up a significant portion of the total simulation time and result in poor scalability of codes. This issue is likely to pose a bottleneck in scalability of codes on future Exascale systems. In this work, we propose an asynchronous computing algorithm based on widely used finite difference methods to solve PDEs in which synchronization between PEs due to communication is relaxed at a mathematical level. We show that while stability is conserved when schemes are used asynchronously, accuracy is greatly degraded. Since message arrivals at PEs are random processes, so is the behavior of the error. We propose a new statistical framework in which we show that average errors drop always to first-order regardless of the original scheme. We propose new asynchrony-tolerant schemes that maintain accuracy when synchronization is relaxed. The quality of the solution is shown to depend, not only on the physical phenomena and numerical schemes, but also on the characteristics of the computing machine. A novel algorithm using remote memory access communications has been developed to demonstrate excellent scalability of the method for large-scale computing. Finally, we present a path to extend this method in solving complex multi-scale problems on Exascale machines.
Quaglio, Pietro; Yegenoglu, Alper; Torre, Emiliano; Endres, Dominik M; Grün, Sonja
2017-01-01
Repeated, precise sequences of spikes are largely considered a signature of activation of cell assemblies. These repeated sequences are commonly known under the name of spatio-temporal patterns (STPs). STPs are hypothesized to play a role in the communication of information in the computational process operated by the cerebral cortex. A variety of statistical methods for the detection of STPs have been developed and applied to electrophysiological recordings, but such methods scale poorly with the current size of available parallel spike train recordings (more than 100 neurons). In this work, we introduce a novel method capable of overcoming the computational and statistical limits of existing analysis techniques in detecting repeating STPs within massively parallel spike trains (MPST). We employ advanced data mining techniques to efficiently extract repeating sequences of spikes from the data. Then, we introduce and compare two alternative approaches to distinguish statistically significant patterns from chance sequences. The first approach uses a measure known as conceptual stability, of which we investigate a computationally cheap approximation for applications to such large data sets. The second approach is based on the evaluation of pattern statistical significance. In particular, we provide an extension to STPs of a method we recently introduced for the evaluation of statistical significance of synchronous spike patterns. The performance of the two approaches is evaluated in terms of computational load and statistical power on a variety of artificial data sets that replicate specific features of experimental data. Both methods provide an effective and robust procedure for detection of STPs in MPST data. The method based on significance evaluation shows the best overall performance, although at a higher computational cost. We name the novel procedure the spatio-temporal Spike PAttern Detection and Evaluation (SPADE) analysis.
Quaglio, Pietro; Yegenoglu, Alper; Torre, Emiliano; Endres, Dominik M.; Grün, Sonja
2017-01-01
Repeated, precise sequences of spikes are largely considered a signature of activation of cell assemblies. These repeated sequences are commonly known under the name of spatio-temporal patterns (STPs). STPs are hypothesized to play a role in the communication of information in the computational process operated by the cerebral cortex. A variety of statistical methods for the detection of STPs have been developed and applied to electrophysiological recordings, but such methods scale poorly with the current size of available parallel spike train recordings (more than 100 neurons). In this work, we introduce a novel method capable of overcoming the computational and statistical limits of existing analysis techniques in detecting repeating STPs within massively parallel spike trains (MPST). We employ advanced data mining techniques to efficiently extract repeating sequences of spikes from the data. Then, we introduce and compare two alternative approaches to distinguish statistically significant patterns from chance sequences. The first approach uses a measure known as conceptual stability, of which we investigate a computationally cheap approximation for applications to such large data sets. The second approach is based on the evaluation of pattern statistical significance. In particular, we provide an extension to STPs of a method we recently introduced for the evaluation of statistical significance of synchronous spike patterns. The performance of the two approaches is evaluated in terms of computational load and statistical power on a variety of artificial data sets that replicate specific features of experimental data. Both methods provide an effective and robust procedure for detection of STPs in MPST data. The method based on significance evaluation shows the best overall performance, although at a higher computational cost. We name the novel procedure the spatio-temporal Spike PAttern Detection and Evaluation (SPADE) analysis. PMID:28596729
Xyce Parallel Electronic Simulator : users' guide, version 2.0.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hoekstra, Robert John; Waters, Lon J.; Rankin, Eric Lamont
2004-06-01
This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator capable of simulating electrical circuits at a variety of abstraction levels. Primarily, Xyce has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability the current state-of-the-art in the following areas: {sm_bullet} Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). Note that this includes support for most popular parallel and serial computers. {sm_bullet} Improved performance for allmore » numerical kernels (e.g., time integrator, nonlinear and linear solvers) through state-of-the-art algorithms and novel techniques. {sm_bullet} Device models which are specifically tailored to meet Sandia's needs, including many radiation-aware devices. {sm_bullet} A client-server or multi-tiered operating model wherein the numerical kernel can operate independently of the graphical user interface (GUI). {sm_bullet} Object-oriented code design and implementation using modern coding practices that ensure that the Xyce Parallel Electronic Simulator will be maintainable and extensible far into the future. Xyce is a parallel code in the most general sense of the phrase - a message passing of computing platforms. These include serial, shared-memory and distributed-memory parallel implementation - which allows it to run efficiently on the widest possible number parallel as well as heterogeneous platforms. Careful attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows. One feature required by designers is the ability to add device models, many specific to the needs of Sandia, to the code. To this end, the device package in the Xyce These input formats include standard analytical models, behavioral models look-up Parallel Electronic Simulator is designed to support a variety of device model inputs. tables, and mesh-level PDE device models. Combined with this flexible interface is an architectural design that greatly simplifies the addition of circuit models. One of the most important feature of Xyce is in providing a platform for computational research and development aimed specifically at the needs of the Laboratory. With Xyce, Sandia now has an 'in-house' capability with which both new electrical (e.g., device model development) and algorithmic (e.g., faster time-integration methods) research and development can be performed. Ultimately, these capabilities are migrated to end users.« less
Machine learning for Big Data analytics in plants.
Ma, Chuang; Zhang, Hao Helen; Wang, Xiangfeng
2014-12-01
Rapid advances in high-throughput genomic technology have enabled biology to enter the era of 'Big Data' (large datasets). The plant science community not only needs to build its own Big-Data-compatible parallel computing and data management infrastructures, but also to seek novel analytical paradigms to extract information from the overwhelming amounts of data. Machine learning offers promising computational and analytical solutions for the integrative analysis of large, heterogeneous and unstructured datasets on the Big-Data scale, and is gradually gaining popularity in biology. This review introduces the basic concepts and procedures of machine-learning applications and envisages how machine learning could interface with Big Data technology to facilitate basic research and biotechnology in the plant sciences. Copyright © 2014 Elsevier Ltd. All rights reserved.
Hybrid MPI+OpenMP Programming of an Overset CFD Solver and Performance Investigations
NASA Technical Reports Server (NTRS)
Djomehri, M. Jahed; Jin, Haoqiang H.; Biegel, Bryan (Technical Monitor)
2002-01-01
This report describes a two level parallelization of a Computational Fluid Dynamic (CFD) solver with multi-zone overset structured grids. The approach is based on a hybrid MPI+OpenMP programming model suitable for shared memory and clusters of shared memory machines. The performance investigations of the hybrid application on an SGI Origin2000 (O2K) machine is reported using medium and large scale test problems.
Parallel VLSI architecture emulation and the organization of APSA/MPP
NASA Technical Reports Server (NTRS)
Odonnell, John T.
1987-01-01
The Applicative Programming System Architecture (APSA) combines an applicative language interpreter with a novel parallel computer architecture that is well suited for Very Large Scale Integration (VLSI) implementation. The Massively Parallel Processor (MPP) can simulate VLSI circuits by allocating one processing element in its square array to an area on a square VLSI chip. As long as there are not too many long data paths, the MPP can simulate a VLSI clock cycle very rapidly. The APSA circuit contains a binary tree with a few long paths and many short ones. A skewed H-tree layout allows every processing element to simulate a leaf cell and up to four tree nodes, with no loss in parallelism. Emulation of a key APSA algorithm on the MPP resulted in performance 16,000 times faster than a Vax. This speed will make it possible for the APSA language interpreter to run fast enough to support research in parallel list processing algorithms.
NASA Technical Reports Server (NTRS)
Dongarra, Jack (Editor); Messina, Paul (Editor); Sorensen, Danny C. (Editor); Voigt, Robert G. (Editor)
1990-01-01
Attention is given to such topics as an evaluation of block algorithm variants in LAPACK and presents a large-grain parallel sparse system solver, a multiprocessor method for the solution of the generalized Eigenvalue problem on an interval, and a parallel QR algorithm for iterative subspace methods on the CM2. A discussion of numerical methods includes the topics of asynchronous numerical solutions of PDEs on parallel computers, parallel homotopy curve tracking on a hypercube, and solving Navier-Stokes equations on the Cedar Multi-Cluster system. A section on differential equations includes a discussion of a six-color procedure for the parallel solution of elliptic systems using the finite quadtree structure, data parallel algorithms for the finite element method, and domain decomposition methods in aerodynamics. Topics dealing with massively parallel computing include hypercube vs. 2-dimensional meshes and massively parallel computation of conservation laws. Performance and tools are also discussed.
Multilevel summation method for electrostatic force evaluation.
Hardy, David J; Wu, Zhe; Phillips, James C; Stone, John E; Skeel, Robert D; Schulten, Klaus
2015-02-10
The multilevel summation method (MSM) offers an efficient algorithm utilizing convolution for evaluating long-range forces arising in molecular dynamics simulations. Shifting the balance of computation and communication, MSM provides key advantages over the ubiquitous particle–mesh Ewald (PME) method, offering better scaling on parallel computers and permitting more modeling flexibility, with support for periodic systems as does PME but also for semiperiodic and nonperiodic systems. The version of MSM available in the simulation program NAMD is described, and its performance and accuracy are compared with the PME method. The accuracy feasible for MSM in practical applications reproduces PME results for water property calculations of density, diffusion constant, dielectric constant, surface tension, radial distribution function, and distance-dependent Kirkwood factor, even though the numerical accuracy of PME is higher than that of MSM. Excellent agreement between MSM and PME is found also for interface potentials of air–water and membrane–water interfaces, where long-range Coulombic interactions are crucial. Applications demonstrate also the suitability of MSM for systems with semiperiodic and nonperiodic boundaries. For this purpose, simulations have been performed with periodic boundaries along directions parallel to a membrane surface but not along the surface normal, yielding membrane pore formation induced by an imbalance of charge across the membrane. Using a similar semiperiodic boundary condition, ion conduction through a graphene nanopore driven by an ion gradient has been simulated. Furthermore, proteins have been simulated inside a single spherical water droplet. Finally, parallel scalability results show the ability of MSM to outperform PME when scaling a system of modest size (less than 100 K atoms) to over a thousand processors, demonstrating the suitability of MSM for large-scale parallel simulation.
Future in biomolecular computation
NASA Astrophysics Data System (ADS)
Wimmer, E.
1988-01-01
Large-scale computations for biomolecules are dominated by three levels of theory: rigorous quantum mechanical calculations for molecules with up to about 30 atoms, semi-empirical quantum mechanical calculations for systems with up to several hundred atoms, and force-field molecular dynamics studies of biomacromolecules with 10,000 atoms and more including surrounding solvent molecules. It can be anticipated that increased computational power will allow the treatment of larger systems of ever growing complexity. Due to the scaling of the computational requirements with increasing number of atoms, the force-field approaches will benefit the most from increased computational power. On the other hand, progress in methodologies such as density functional theory will enable us to treat larger systems on a fully quantum mechanical level and a combination of molecular dynamics and quantum mechanics can be envisioned. One of the greatest challenges in biomolecular computation is the protein folding problem. It is unclear at this point, if an approach with current methodologies will lead to a satisfactory answer or if unconventional, new approaches will be necessary. In any event, due to the complexity of biomolecular systems, a hierarchy of approaches will have to be established and used in order to capture the wide ranges of length-scales and time-scales involved in biological processes. In terms of hardware development, speed and power of computers will increase while the price/performance ratio will become more and more favorable. Parallelism can be anticipated to become an integral architectural feature in a range of computers. It is unclear at this point, how fast massively parallel systems will become easy enough to use so that new methodological developments can be pursued on such computers. Current trends show that distributed processing such as the combination of convenient graphics workstations and powerful general-purpose supercomputers will lead to a new style of computing in which the calculations are monitored and manipulated as they proceed. The combination of a numeric approach with artificial-intelligence approaches can be expected to open up entirely new possibilities. Ultimately, the most exciding aspect of the future in biomolecular computing will be the unexpected discoveries.
Parallelization Issues and Particle-In Codes.
NASA Astrophysics Data System (ADS)
Elster, Anne Cathrine
1994-01-01
"Everything should be made as simple as possible, but not simpler." Albert Einstein. The field of parallel scientific computing has concentrated on parallelization of individual modules such as matrix solvers and factorizers. However, many applications involve several interacting modules. Our analyses of a particle-in-cell code modeling charged particles in an electric field, show that these accompanying dependencies affect data partitioning and lead to new parallelization strategies concerning processor, memory and cache utilization. Our test-bed, a KSR1, is a distributed memory machine with a globally shared addressing space. However, most of the new methods presented hold generally for hierarchical and/or distributed memory systems. We introduce a novel approach that uses dual pointers on the local particle arrays to keep the particle locations automatically partially sorted. Complexity and performance analyses with accompanying KSR benchmarks, have been included for both this scheme and for the traditional replicated grids approach. The latter approach maintains load-balance with respect to particles. However, our results demonstrate it fails to scale properly for problems with large grids (say, greater than 128-by-128) running on as few as 15 KSR nodes, since the extra storage and computation time associated with adding the grid copies, becomes significant. Our grid partitioning scheme, although harder to implement, does not need to replicate the whole grid. Consequently, it scales well for large problems on highly parallel systems. It may, however, require load balancing schemes for non-uniform particle distributions. Our dual pointer approach may facilitate this through dynamically partitioned grids. We also introduce hierarchical data structures that store neighboring grid-points within the same cache -line by reordering the grid indexing. This alignment produces a 25% savings in cache-hits for a 4-by-4 cache. A consideration of the input data's effect on the simulation may lead to further improvements. For example, in the case of mean particle drift, it is often advantageous to partition the grid primarily along the direction of the drift. The particle-in-cell codes for this study were tested using physical parameters, which lead to predictable phenomena including plasma oscillations and two-stream instabilities. An overview of the most central references related to parallel particle codes is also given.
NASA Astrophysics Data System (ADS)
Sourbier, Florent; Operto, Stéphane; Virieux, Jean; Amestoy, Patrick; L'Excellent, Jean-Yves
2009-03-01
This is the first paper in a two-part series that describes a massively parallel code that performs 2D frequency-domain full-waveform inversion of wide-aperture seismic data for imaging complex structures. Full-waveform inversion methods, namely quantitative seismic imaging methods based on the resolution of the full wave equation, are computationally expensive. Therefore, designing efficient algorithms which take advantage of parallel computing facilities is critical for the appraisal of these approaches when applied to representative case studies and for further improvements. Full-waveform modelling requires the resolution of a large sparse system of linear equations which is performed with the massively parallel direct solver MUMPS for efficient multiple-shot simulations. Efficiency of the multiple-shot solution phase (forward/backward substitutions) is improved by using the BLAS3 library. The inverse problem relies on a classic local optimization approach implemented with a gradient method. The direct solver returns the multiple-shot wavefield solutions distributed over the processors according to a domain decomposition driven by the distribution of the LU factors. The domain decomposition of the wavefield solutions is used to compute in parallel the gradient of the objective function and the diagonal Hessian, this latter providing a suitable scaling of the gradient. The algorithm allows one to test different strategies for multiscale frequency inversion ranging from successive mono-frequency inversion to simultaneous multifrequency inversion. These different inversion strategies will be illustrated in the following companion paper. The parallel efficiency and the scalability of the code will also be quantified.
Implementation of DFT application on ternary optical computer
NASA Astrophysics Data System (ADS)
Junjie, Peng; Youyi, Fu; Xiaofeng, Zhang; Shuai, Kong; Xinyu, Wei
2018-03-01
As its characteristics of huge number of data bits and low energy consumption, optical computing may be used in the applications such as DFT etc. which needs a lot of computation and can be implemented in parallel. According to this, DFT implementation methods in full parallel as well as in partial parallel are presented. Based on resources ternary optical computer (TOC), extensive experiments were carried out. Experimental results show that the proposed schemes are correct and feasible. They provide a foundation for further exploration of the applications on TOC that needs a large amount calculation and can be processed in parallel.
NASA Astrophysics Data System (ADS)
Li, Gen; Tang, Chun-An; Liang, Zheng-Zhao
2017-01-01
Multi-scale high-resolution modeling of rock failure process is a powerful means in modern rock mechanics studies to reveal the complex failure mechanism and to evaluate engineering risks. However, multi-scale continuous modeling of rock, from deformation, damage to failure, has raised high requirements on the design, implementation scheme and computation capacity of the numerical software system. This study is aimed at developing the parallel finite element procedure, a parallel rock failure process analysis (RFPA) simulator that is capable of modeling the whole trans-scale failure process of rock. Based on the statistical meso-damage mechanical method, the RFPA simulator is able to construct heterogeneous rock models with multiple mechanical properties, deal with and represent the trans-scale propagation of cracks, in which the stress and strain fields are solved for the damage evolution analysis of representative volume element by the parallel finite element method (FEM) solver. This paper describes the theoretical basis of the approach and provides the details of the parallel implementation on a Windows - Linux interactive platform. A numerical model is built to test the parallel performance of FEM solver. Numerical simulations are then carried out on a laboratory-scale uniaxial compression test, and field-scale net fracture spacing and engineering-scale rock slope examples, respectively. The simulation results indicate that relatively high speedup and computation efficiency can be achieved by the parallel FEM solver with a reasonable boot process. In laboratory-scale simulation, the well-known physical phenomena, such as the macroscopic fracture pattern and stress-strain responses, can be reproduced. In field-scale simulation, the formation process of net fracture spacing from initiation, propagation to saturation can be revealed completely. In engineering-scale simulation, the whole progressive failure process of the rock slope can be well modeled. It is shown that the parallel FE simulator developed in this study is an efficient tool for modeling the whole trans-scale failure process of rock from meso- to engineering-scale.
NASA Astrophysics Data System (ADS)
Shao, Meiyue; Aktulga, H. Metin; Yang, Chao; Ng, Esmond G.; Maris, Pieter; Vary, James P.
2018-01-01
We describe a number of recently developed techniques for improving the performance of large-scale nuclear configuration interaction calculations on high performance parallel computers. We show the benefit of using a preconditioned block iterative method to replace the Lanczos algorithm that has traditionally been used to perform this type of computation. The rapid convergence of the block iterative method is achieved by a proper choice of starting guesses of the eigenvectors and the construction of an effective preconditioner. These acceleration techniques take advantage of special structure of the nuclear configuration interaction problem which we discuss in detail. The use of a block method also allows us to improve the concurrency of the computation, and take advantage of the memory hierarchy of modern microprocessors to increase the arithmetic intensity of the computation relative to data movement. We also discuss the implementation details that are critical to achieving high performance on massively parallel multi-core supercomputers, and demonstrate that the new block iterative solver is two to three times faster than the Lanczos based algorithm for problems of moderate sizes on a Cray XC30 system.
Spatial adaptive sampling in multiscale simulation
NASA Astrophysics Data System (ADS)
Rouet-Leduc, Bertrand; Barros, Kipton; Cieren, Emmanuel; Elango, Venmugil; Junghans, Christoph; Lookman, Turab; Mohd-Yusof, Jamaludin; Pavel, Robert S.; Rivera, Axel Y.; Roehm, Dominic; McPherson, Allen L.; Germann, Timothy C.
2014-07-01
In a common approach to multiscale simulation, an incomplete set of macroscale equations must be supplemented with constitutive data provided by fine-scale simulation. Collecting statistics from these fine-scale simulations is typically the overwhelming computational cost. We reduce this cost by interpolating the results of fine-scale simulation over the spatial domain of the macro-solver. Unlike previous adaptive sampling strategies, we do not interpolate on the potentially very high dimensional space of inputs to the fine-scale simulation. Our approach is local in space and time, avoids the need for a central database, and is designed to parallelize well on large computer clusters. To demonstrate our method, we simulate one-dimensional elastodynamic shock propagation using the Heterogeneous Multiscale Method (HMM); we find that spatial adaptive sampling requires only ≈ 50 ×N0.14 fine-scale simulations to reconstruct the stress field at all N grid points. Related multiscale approaches, such as Equation Free methods, may also benefit from spatial adaptive sampling.
Menzies, Kevin
2014-08-13
The growth in simulation capability over the past 20 years has led to remarkable changes in the design process for gas turbines. The availability of relatively cheap computational power coupled to improvements in numerical methods and physical modelling in simulation codes have enabled the development of aircraft propulsion systems that are more powerful and yet more efficient than ever before. However, the design challenges are correspondingly greater, especially to reduce environmental impact. The simulation requirements to achieve a reduced environmental impact are described along with the implications of continued growth in available computational power. It is concluded that achieving the environmental goals will demand large-scale multi-disciplinary simulations requiring significantly increased computational power, to enable optimization of the airframe and propulsion system over the entire operational envelope. However even with massive parallelization, the limits imposed by communications latency will constrain the time required to achieve a solution, and therefore the position of such large-scale calculations in the industrial design process. © 2014 The Author(s) Published by the Royal Society. All rights reserved.
Large-Scale NASA Science Applications on the Columbia Supercluster
NASA Technical Reports Server (NTRS)
Brooks, Walter
2005-01-01
Columbia, NASA's newest 61 teraflops supercomputer that became operational late last year, is a highly integrated Altix cluster of 10,240 processors, and was named to honor the crew of the Space Shuttle lost in early 2003. Constructed in just four months, Columbia increased NASA's computing capability ten-fold, and revitalized the Agency's high-end computing efforts. Significant cutting-edge science and engineering simulations in the areas of space and Earth sciences, as well as aeronautics and space operations, are already occurring on this largest operational Linux supercomputer, demonstrating its capacity and capability to accelerate NASA's space exploration vision. The presentation will describe how an integrated environment consisting not only of next-generation systems, but also modeling and simulation, high-speed networking, parallel performance optimization, and advanced data analysis and visualization, is being used to reduce design cycle time, accelerate scientific discovery, conduct parametric analysis of multiple scenarios, and enhance safety during the life cycle of NASA missions. The talk will conclude by discussing how NAS partnered with various NASA centers, other government agencies, computer industry, and academia, to create a national resource in large-scale modeling and simulation.
Gallicchio, Emilio; Deng, Nanjie; He, Peng; Wickstrom, Lauren; Perryman, Alexander L.; Santiago, Daniel N.; Forli, Stefano; Olson, Arthur J.; Levy, Ronald M.
2014-01-01
As part of the SAMPL4 blind challenge, filtered AutoDock Vina ligand docking predictions and large scale binding energy distribution analysis method binding free energy calculations have been applied to the virtual screening of a focused library of candidate binders to the LEDGF site of the HIV integrase protein. The computational protocol leveraged docking and high level atomistic models to improve enrichment. The enrichment factor of our blind predictions ranked best among all of the computational submissions, and second best overall. This work represents to our knowledge the first example of the application of an all-atom physics-based binding free energy model to large scale virtual screening. A total of 285 parallel Hamiltonian replica exchange molecular dynamics absolute protein-ligand binding free energy simulations were conducted starting from docked poses. The setup of the simulations was fully automated, calculations were distributed on multiple computing resources and were completed in a 6-weeks period. The accuracy of the docked poses and the inclusion of intramolecular strain and entropic losses in the binding free energy estimates were the major factors behind the success of the method. Lack of sufficient time and computing resources to investigate additional protonation states of the ligands was a major cause of mispredictions. The experiment demonstrated the applicability of binding free energy modeling to improve hit rates in challenging virtual screening of focused ligand libraries during lead optimization. PMID:24504704
Scaling Semantic Graph Databases in Size and Performance
DOE Office of Scientific and Technical Information (OSTI.GOV)
Morari, Alessandro; Castellana, Vito G.; Villa, Oreste
In this paper we present SGEM, a full software system for accelerating large-scale semantic graph databases on commodity clusters. Unlike current approaches, SGEM addresses semantic graph databases by only employing graph methods at all the levels of the stack. On one hand, this allows exploiting the space efficiency of graph data structures and the inherent parallelism of graph algorithms. These features adapt well to the increasing system memory and core counts of modern commodity clusters. On the other hand, however, these systems are optimized for regular computation and batched data transfers, while graph methods usually are irregular and generate fine-grainedmore » data accesses with poor spatial and temporal locality. Our framework comprises a SPARQL to data parallel C compiler, a library of parallel graph methods and a custom, multithreaded runtime system. We introduce our stack, motivate its advantages with respect to other solutions and show how we solved the challenges posed by irregular behaviors. We present the result of our software stack on the Berlin SPARQL benchmarks with datasets up to 10 billion triples (a triple corresponds to a graph edge), demonstrating scaling in dataset size and in performance as more nodes are added to the cluster.« less
paraGSEA: a scalable approach for large-scale gene expression profiling
Peng, Shaoliang; Yang, Shunyun
2017-01-01
Abstract More studies have been conducted using gene expression similarity to identify functional connections among genes, diseases and drugs. Gene Set Enrichment Analysis (GSEA) is a powerful analytical method for interpreting gene expression data. However, due to its enormous computational overhead in the estimation of significance level step and multiple hypothesis testing step, the computation scalability and efficiency are poor on large-scale datasets. We proposed paraGSEA for efficient large-scale transcriptome data analysis. By optimization, the overall time complexity of paraGSEA is reduced from O(mn) to O(m+n), where m is the length of the gene sets and n is the length of the gene expression profiles, which contributes more than 100-fold increase in performance compared with other popular GSEA implementations such as GSEA-P, SAM-GS and GSEA2. By further parallelization, a near-linear speed-up is gained on both workstations and clusters in an efficient manner with high scalability and performance on large-scale datasets. The analysis time of whole LINCS phase I dataset (GSE92742) was reduced to nearly half hour on a 1000 node cluster on Tianhe-2, or within 120 hours on a 96-core workstation. The source code of paraGSEA is licensed under the GPLv3 and available at http://github.com/ysycloud/paraGSEA. PMID:28973463
GW/Bethe-Salpeter calculations for charged and model systems from real-space DFT
NASA Astrophysics Data System (ADS)
Strubbe, David A.
GW and Bethe-Salpeter (GW/BSE) calculations use mean-field input from density-functional theory (DFT) calculations to compute excited states of a condensed-matter system. Many parts of a GW/BSE calculation are efficiently performed in a plane-wave basis, and extensive effort has gone into optimizing and parallelizing plane-wave GW/BSE codes for large-scale computations. Most straightforwardly, plane-wave DFT can be used as a starting point, but real-space DFT is also an attractive starting point: it is systematically convergeable like plane waves, can take advantage of efficient domain parallelization for large systems, and is well suited physically for finite and especially charged systems. The flexibility of a real-space grid also allows convenient calculations on non-atomic model systems. I will discuss the interfacing of a real-space (TD)DFT code (Octopus, www.tddft.org/programs/octopus) with a plane-wave GW/BSE code (BerkeleyGW, www.berkeleygw.org), consider performance issues and accuracy, and present some applications to simple and paradigmatic systems that illuminate fundamental properties of these approximations in many-body perturbation theory.
Efficient Predictions of Excited State for Nanomaterials Using Aces 3 and 4
2017-12-20
by first-principle methods in the software package ACES by using large parallel computers, growing tothe exascale. 15. SUBJECT TERMS Computer...modeling, excited states, optical properties, structure, stability, activation barriers first principle methods , parallel computing 16. SECURITY...2 Progress with new density functional methods