parallel processing version: Topics by Science.gov

Sample records for parallel processing version

GWM-VI: groundwater management with parallel processing for multiple MODFLOW versions

USGS Publications Warehouse

Banta, Edward R.; Ahlfeld, David P.

2013-01-01

Groundwater Management–Version Independent (GWM–VI) is a new version of the Groundwater Management Process of MODFLOW. The Groundwater Management Process couples groundwater-flow simulation with a capability to optimize stresses on the simulated aquifer based on an objective function and constraints imposed on stresses and aquifer state. GWM–VI extends prior versions of Groundwater Management in two significant ways—(1) it can be used with any version of MODFLOW that meets certain requirements on input and output, and (2) it is structured to allow parallel processing of the repeated runs of the MODFLOW model that are required to solve the optimization problem. GWM–VI uses the same input structure for files that describe the management problem as that used by prior versions of Groundwater Management. GWM–VI requires only minor changes to the input files used by the MODFLOW model. GWM–VI uses the Joint Universal Parameter IdenTification and Evaluation of Reliability Application Programming Interface (JUPITER-API) to implement both version independence and parallel processing. GWM–VI communicates with the MODFLOW model by manipulating certain input files and interpreting results from the MODFLOW listing file and binary output files. Nearly all capabilities of prior versions of Groundwater Management are available in GWM–VI. GWM–VI has been tested with MODFLOW-2005, MODFLOW-NWT (a Newton formulation for MODFLOW-2005), MF2005-FMP2 (the Farm Process for MODFLOW-2005), SEAWAT, and CFP (Conduit Flow Process for MODFLOW-2005). This report provides sample problems that demonstrate a range of applications of GWM–VI and the directory structure and input information required to use the parallel-processing capability.
PyPele Rewritten To Use MPI

NASA Technical Reports Server (NTRS)

Hockney, George; Lee, Seungwon

2008-01-01

A computer program known as PyPele, originally written as a Pythonlanguage extension module of a C++ language program, has been rewritten in pure Python language. The original version of PyPele dispatches and coordinates parallel-processing tasks on cluster computers and provides a conceptual framework for spacecraft-mission- design and -analysis software tools to run in an embarrassingly parallel mode. The original version of PyPele uses SSH (Secure Shell a set of standards and an associated network protocol for establishing a secure channel between a local and a remote computer) to coordinate parallel processing. Instead of SSH, the present Python version of PyPele uses Message Passing Interface (MPI) [an unofficial de-facto standard language-independent application programming interface for message- passing on a parallel computer] while keeping the same user interface. The use of MPI instead of SSH and the preservation of the original PyPele user interface make it possible for parallel application programs written previously for the original version of PyPele to run on MPI-based cluster computers. As a result, engineers using the previously written application programs can take advantage of embarrassing parallelism without need to rewrite those programs.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Barrett, Brian; Brightwell, Ronald B.; Grant, Ryan

This report presents a specification for the Portals 4 networ k programming interface. Portals 4 is intended to allow scalable, high-performance network communication betwee n nodes of a parallel computing system. Portals 4 is well suited to massively parallel processing and embedded syste ms. Portals 4 represents an adaption of the data movement layer developed for massively parallel processing platfor ms, such as the 4500-node Intel TeraFLOPS machine. Sandia's Cplant cluster project motivated the development of Version 3.0, which was later extended to Version 3.3 as part of the Cray Red Storm machine and XT line. Version 4 is tarmore » geted to the next generation of machines employing advanced network interface architectures that support enh anced offload capabilities.« less
The Portals 4.0 network programming interface.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Barrett, Brian W.; Brightwell, Ronald Brian; Pedretti, Kevin

2012-11-01

This report presents a specification for the Portals 4.0 network programming interface. Portals 4.0 is intended to allow scalable, high-performance network communication between nodes of a parallel computing system. Portals 4.0 is well suited to massively parallel processing and embedded systems. Portals 4.0 represents an adaption of the data movement layer developed for massively parallel processing platforms, such as the 4500-node Intel TeraFLOPS machine. Sandias Cplant cluster project motivated the development of Version 3.0, which was later extended to Version 3.3 as part of the Cray Red Storm machine and XT line. Version 4.0 is targeted to the next generationmore » of machines employing advanced network interface architectures that support enhanced offload capabilities.« less
The portals 4.0.1 network programming interface.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Barrett, Brian W.; Brightwell, Ronald Brian; Pedretti, Kevin

2013-04-01

This report presents a specification for the Portals 4.0 network programming interface. Portals 4.0 is intended to allow scalable, high-performance network communication between nodes of a parallel computing system. Portals 4.0 is well suited to massively parallel processing and embedded systems. Portals 4.0 represents an adaption of the data movement layer developed for massively parallel processing platforms, such as the 4500-node Intel TeraFLOPS machine. Sandias Cplant cluster project motivated the development of Version 3.0, which was later extended to Version 3.3 as part of the Cray Red Storm machine and XT line. Version 4.0 is targeted to the next generationmore » of machines employing advanced network interface architectures that support enhanced offload capabilities. 3« less
Optics Program Modified for Multithreaded Parallel Computing

NASA Technical Reports Server (NTRS)

Lou, John; Bedding, Dave; Basinger, Scott

2006-01-01

A powerful high-performance computer program for simulating and analyzing adaptive and controlled optical systems has been developed by modifying the serial version of the Modeling and Analysis for Controlled Optical Systems (MACOS) program to impart capabilities for multithreaded parallel processing on computing systems ranging from supercomputers down to Symmetric Multiprocessing (SMP) personal computers. The modifications included the incorporation of OpenMP, a portable and widely supported application interface software, that can be used to explicitly add multithreaded parallelism to an application program under a shared-memory programming model. OpenMP was applied to parallelize ray-tracing calculations, one of the major computing components in MACOS. Multithreading is also used in the diffraction propagation of light in MACOS based on pthreads [POSIX Thread, (where "POSIX" signifies a portable operating system for UNIX)]. In tests of the parallelized version of MACOS, the speedup in ray-tracing calculations was found to be linear, or proportional to the number of processors, while the speedup in diffraction calculations ranged from 50 to 60 percent, depending on the type and number of processors. The parallelized version of MACOS is portable, and, to the user, its interface is basically the same as that of the original serial version of MACOS.
Accelerating adaptive inverse distance weighting interpolation algorithm on a graphics processing unit

PubMed Central

Xu, Liangliang; Xu, Nengxiong

2017-01-01

This paper focuses on designing and implementing parallel adaptive inverse distance weighting (AIDW) interpolation algorithms by using the graphics processing unit (GPU). The AIDW is an improved version of the standard IDW, which can adaptively determine the power parameter according to the data points’ spatial distribution pattern and achieve more accurate predictions than those predicted by IDW. In this paper, we first present two versions of the GPU-accelerated AIDW, i.e. the naive version without profiting from the shared memory and the tiled version taking advantage of the shared memory. We also implement the naive version and the tiled version using two data layouts, structure of arrays and array of aligned structures, on both single and double precision. We then evaluate the performance of parallel AIDW by comparing it with its corresponding serial algorithm on three different machines equipped with the GPUs GT730M, M5000 and K40c. The experimental results indicate that: (i) there is no significant difference in the computational efficiency when different data layouts are employed; (ii) the tiled version is always slightly faster than the naive version; and (iii) on single precision the achieved speed-up can be up to 763 (on the GPU M5000), while on double precision the obtained highest speed-up is 197 (on the GPU K40c). To benefit the community, all source code and testing data related to the presented parallel AIDW algorithm are publicly available. PMID:28989754
Accelerating adaptive inverse distance weighting interpolation algorithm on a graphics processing unit.

PubMed

Mei, Gang; Xu, Liangliang; Xu, Nengxiong

2017-09-01

This paper focuses on designing and implementing parallel adaptive inverse distance weighting (AIDW) interpolation algorithms by using the graphics processing unit (GPU). The AIDW is an improved version of the standard IDW, which can adaptively determine the power parameter according to the data points' spatial distribution pattern and achieve more accurate predictions than those predicted by IDW. In this paper, we first present two versions of the GPU-accelerated AIDW, i.e. the naive version without profiting from the shared memory and the tiled version taking advantage of the shared memory. We also implement the naive version and the tiled version using two data layouts, structure of arrays and array of aligned structures, on both single and double precision. We then evaluate the performance of parallel AIDW by comparing it with its corresponding serial algorithm on three different machines equipped with the GPUs GT730M, M5000 and K40c. The experimental results indicate that: (i) there is no significant difference in the computational efficiency when different data layouts are employed; (ii) the tiled version is always slightly faster than the naive version; and (iii) on single precision the achieved speed-up can be up to 763 (on the GPU M5000), while on double precision the obtained highest speed-up is 197 (on the GPU K40c). To benefit the community, all source code and testing data related to the presented parallel AIDW algorithm are publicly available.
User's Guide for TOUGH2-MP - A Massively Parallel Version of the TOUGH2 Code

DOE Office of Scientific and Technical Information (OSTI.GOV)

Earth Sciences Division; Zhang, Keni; Zhang, Keni

TOUGH2-MP is a massively parallel (MP) version of the TOUGH2 code, designed for computationally efficient parallel simulation of isothermal and nonisothermal flows of multicomponent, multiphase fluids in one, two, and three-dimensional porous and fractured media. In recent years, computational requirements have become increasingly intensive in large or highly nonlinear problems for applications in areas such as radioactive waste disposal, CO2 geological sequestration, environmental assessment and remediation, reservoir engineering, and groundwater hydrology. The primary objective of developing the parallel-simulation capability is to significantly improve the computational performance of the TOUGH2 family of codes. The particular goal for the parallel simulator ismore » to achieve orders-of-magnitude improvement in computational time for models with ever-increasing complexity. TOUGH2-MP is designed to perform parallel simulation on multi-CPU computational platforms. An earlier version of TOUGH2-MP (V1.0) was based on the TOUGH2 Version 1.4 with EOS3, EOS9, and T2R3D modules, a software previously qualified for applications in the Yucca Mountain project, and was designed for execution on CRAY T3E and IBM SP supercomputers. The current version of TOUGH2-MP (V2.0) includes all fluid property modules of the standard version TOUGH2 V2.0. It provides computationally efficient capabilities using supercomputers, Linux clusters, or multi-core PCs, and also offers many user-friendly features. The parallel simulator inherits all process capabilities from V2.0 together with additional capabilities for handling fractured media from V1.4. This report provides a quick starting guide on how to set up and run the TOUGH2-MP program for users with a basic knowledge of running the (standard) version TOUGH2 code, The report also gives a brief technical description of the code, including a discussion of parallel methodology, code structure, as well as mathematical and numerical methods used. To familiarize users with the parallel code, illustrative sample problems are presented.« less
NDL-v2.0: A new version of the numerical differentiation library for parallel architectures

NASA Astrophysics Data System (ADS)

Hadjidoukas, P. E.; Angelikopoulos, P.; Voglis, C.; Papageorgiou, D. G.; Lagaris, I. E.

2014-07-01

We present a new version of the numerical differentiation library (NDL) used for the numerical estimation of first and second order partial derivatives of a function by finite differencing. In this version we have restructured the serial implementation of the code so as to achieve optimal task-based parallelization. The pure shared-memory parallelization of the library has been based on the lightweight OpenMP tasking model allowing for the full extraction of the available parallelism and efficient scheduling of multiple concurrent library calls. On multicore clusters, parallelism is exploited by means of TORC, an MPI-based multi-threaded tasking library. The new MPI implementation of NDL provides optimal performance in terms of function calls and, furthermore, supports asynchronous execution of multiple library calls within legacy MPI programs. In addition, a Python interface has been implemented for all cases, exporting the functionality of our library to sequential Python codes. Catalog identifier: AEDG_v2_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEDG_v2_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 63036 No. of bytes in distributed program, including test data, etc.: 801872 Distribution format: tar.gz Programming language: ANSI Fortran-77, ANSI C, Python. Computer: Distributed systems (clusters), shared memory systems. Operating system: Linux, Unix. Has the code been vectorized or parallelized?: Yes. RAM: The library uses O(N) internal storage, N being the dimension of the problem. It can use up to O(N2) internal storage for Hessian calculations, if a task throttling factor has not been set by the user. Classification: 4.9, 4.14, 6.5. Catalog identifier of previous version: AEDG_v1_0 Journal reference of previous version: Comput. Phys. Comm. 180(2009)1404 Does the new version supersede the previous version?: Yes Nature of problem: The numerical estimation of derivatives at several accuracy levels is a common requirement in many computational tasks, such as optimization, solution of nonlinear systems, and sensitivity analysis. For a large number of scientific and engineering applications, the underlying functions correspond to simulation codes for which analytical estimation of derivatives is difficult or almost impossible. A parallel implementation that exploits systems with multiple CPUs is very important for large scale and computationally expensive problems. Solution method: Finite differencing is used with a carefully chosen step that minimizes the sum of the truncation and round-off errors. The parallel versions employ both OpenMP and MPI libraries. Reasons for new version: The updated version was motivated by our endeavors to extend a parallel Bayesian uncertainty quantification framework [1], by incorporating higher order derivative information as in most state-of-the-art stochastic simulation methods such as Stochastic Newton MCMC [2] and Riemannian Manifold Hamiltonian MC [3]. The function evaluations are simulations with significant time-to-solution, which also varies with the input parameters such as in [1, 4]. The runtime of the N-body-type of problem changes considerably with the introduction of a longer cut-off between the bodies. In the first version of the library, the OpenMP-parallel subroutines spawn a new team of threads and distribute the function evaluations with a PARALLEL DO directive. This limits the functionality of the library as multiple concurrent calls require nested parallelism support from the OpenMP environment. Therefore, either their function evaluations will be serialized or processor oversubscription is likely to occur due to the increased number of OpenMP threads. In addition, the Hessian calculations include two explicit parallel regions that compute first the diagonal and then the off-diagonal elements of the array. Due to the barrier between the two regions, the parallelism of the calculations is not fully exploited. These issues have been addressed in the new version by first restructuring the serial code and then running the function evaluations in parallel using OpenMP tasks. Although the MPI-parallel implementation of the first version is capable of fully exploiting the task parallelism of the PNDL routines, it does not utilize the caching mechanism of the serial code and, therefore, performs some redundant function evaluations in the Hessian and Jacobian calculations. This can lead to: (a) higher execution times if the number of available processors is lower than the total number of tasks, and (b) significant energy consumption due to wasted processor cycles. Overcoming these drawbacks, which become critical as the time of a single function evaluation increases, was the primary goal of this new version. Due to the code restructure, the MPI-parallel implementation (and the OpenMP-parallel in accordance) avoids redundant calls, providing optimal performance in terms of the number of function evaluations. Another limitation of the library was that the library subroutines were collective and synchronous calls. In the new version, each MPI process can issue any number of subroutines for asynchronous execution. We introduce two library calls that provide global and local task synchronizations, similarly to the BARRIER and TASKWAIT directives of OpenMP. The new MPI-implementation is based on TORC, a new tasking library for multicore clusters [5-7]. TORC improves the portability of the software, as it relies exclusively on the POSIX-Threads and MPI programming interfaces. It allows MPI processes to utilize multiple worker threads, offering a hybrid programming and execution environment similar to MPI+OpenMP, in a completely transparent way. Finally, to further improve the usability of our software, a Python interface has been implemented on top of both the OpenMP and MPI versions of the library. This allows sequential Python codes to exploit shared and distributed memory systems. Summary of revisions: The revised code improves the performance of both parallel (OpenMP and MPI) implementations. The functionality and the user-interface of the MPI-parallel version have been extended to support the asynchronous execution of multiple PNDL calls, issued by one or multiple MPI processes. A new underlying tasking library increases portability and allows MPI processes to have multiple worker threads. For both implementations, an interface to the Python programming language has been added. Restrictions: The library uses only double precision arithmetic. The MPI implementation assumes the homogeneity of the execution environment provided by the operating system. Specifically, the processes of a single MPI application must have identical address space and a user function resides at the same virtual address. In addition, address space layout randomization should not be used for the application. Unusual features: The software takes into account bound constraints, in the sense that only feasible points are used to evaluate the derivatives, and given the level of the desired accuracy, the proper formula is automatically employed. Running time: Running time depends on the function's complexity. The test run took 23 ms for the serial distribution, 25 ms for the OpenMP with 2 threads, 53 ms and 1.01 s for the MPI parallel distribution using 2 threads and 2 processes respectively and yield-time for idle workers equal to 10 ms. References: [1] P. Angelikopoulos, C. Paradimitriou, P. Koumoutsakos, Bayesian uncertainty quantification and propagation in molecular dynamics simulations: a high performance computing framework, J. Chem. Phys 137 (14). [2] H.P. Flath, L.C. Wilcox, V. Akcelik, J. Hill, B. van Bloemen Waanders, O. Ghattas, Fast algorithms for Bayesian uncertainty quantification in large-scale linear inverse problems based on low-rank partial Hessian approximations, SIAM J. Sci. Comput. 33 (1) (2011) 407-432. [3] M. Girolami, B. Calderhead, Riemann manifold Langevin and Hamiltonian Monte Carlo methods, J. R. Stat. Soc. Ser. B (Stat. Methodol.) 73 (2) (2011) 123-214. [4] P. Angelikopoulos, C. Paradimitriou, P. Koumoutsakos, Data driven, predictive molecular dynamics for nanoscale flow simulations under uncertainty, J. Phys. Chem. B 117 (47) (2013) 14808-14816. [5] P.E. Hadjidoukas, E. Lappas, V.V. Dimakopoulos, A runtime library for platform-independent task parallelism, in: PDP, IEEE, 2012, pp. 229-236. [6] C. Voglis, P.E. Hadjidoukas, D.G. Papageorgiou, I. Lagaris, A parallel hybrid optimization algorithm for fitting interatomic potentials, Appl. Soft Comput. 13 (12) (2013) 4481-4492. [7] P.E. Hadjidoukas, C. Voglis, V.V. Dimakopoulos, I. Lagaris, D.G. Papageorgiou, Supporting adaptive and irregular parallelism for non-linear numerical optimization, Appl. Math. Comput. 231 (2014) 544-559.
Visualization Co-Processing of a CFD Simulation

NASA Technical Reports Server (NTRS)

Vaziri, Arsi

1999-01-01

OVERFLOW, a widely used CFD simulation code, is combined with a visualization system, pV3, to experiment with an environment for simulation/visualization co-processing on a SGI Origin 2000 computer(O2K) system. The shared memory version of the solver is used with the O2K 'pfa' preprocessor invoked to automatically discover parallelism in the source code. No other explicit parallelism is enabled. In order to study the scaling and performance of the visualization co-processing system, sample runs are made with different processor groups in the range of 1 to 254 processors. The data exchange between the visualization system and the simulation system is rapid enough for user interactivity when the problem size is small. This shared memory version of OVERFLOW, with minimal parallelization, does not scale well to an increasing number of available processors. The visualization task takes about 18 to 30% of the total processing time and does not appear to be a major contributor to the poor scaling. Improper load balancing and inter-processor communication overhead are contributors to this poor performance. Work is in progress which is aimed at obtaining improved parallel performance of the solver and removing the limitations of serial data transfer to pV3 by examining various parallelization/communication strategies, including the use of the explicit message passing.
MPI implementation of PHOENICS: A general purpose computational fluid dynamics code

NASA Astrophysics Data System (ADS)

Simunovic, S.; Zacharia, T.; Baltas, N.; Spalding, D. B.

1995-03-01

PHOENICS is a suite of computational analysis programs that are used for simulation of fluid flow, heat transfer, and dynamical reaction processes. The parallel version of the solver EARTH for the Computational Fluid Dynamics (CFD) program PHOENICS has been implemented using Message Passing Interface (MPI) standard. Implementation of MPI version of PHOENICS makes this computational tool portable to a wide range of parallel machines and enables the use of high performance computing for large scale computational simulations. MPI libraries are available on several parallel architectures making the program usable across different architectures as well as on heterogeneous computer networks. The Intel Paragon NX and MPI versions of the program have been developed and tested on massively parallel supercomputers Intel Paragon XP/S 5, XP/S 35, and Kendall Square Research, and on the multiprocessor SGI Onyx computer at Oak Ridge National Laboratory. The preliminary testing results of the developed program have shown scalable performance for reasonably sized computational domains.
MPI implementation of PHOENICS: A general purpose computational fluid dynamics code

DOE Office of Scientific and Technical Information (OSTI.GOV)

Simunovic, S.; Zacharia, T.; Baltas, N.

1995-04-01

PHOENICS is a suite of computational analysis programs that are used for simulation of fluid flow, heat transfer, and dynamical reaction processes. The parallel version of the solver EARTH for the Computational Fluid Dynamics (CFD) program PHOENICS has been implemented using Message Passing Interface (MPI) standard. Implementation of MPI version of PHOENICS makes this computational tool portable to a wide range of parallel machines and enables the use of high performance computing for large scale computational simulations. MPI libraries are available on several parallel architectures making the program usable across different architectures as well as on heterogeneous computer networks. Themore » Intel Paragon NX and MPI versions of the program have been developed and tested on massively parallel supercomputers Intel Paragon XP/S 5, XP/S 35, and Kendall Square Research, and on the multiprocessor SGI Onyx computer at Oak Ridge National Laboratory. The preliminary testing results of the developed program have shown scalable performance for reasonably sized computational domains.« less
Parallel approaches to composite production: interfaces that behave contrary to expectation.

PubMed

Frowd, Charlie D; Bruce, Vicki; Ness, Hayley; Bowie, Leslie; Paterson, Jenny; Thomson-Bogner, Claire; McIntyre, Alexander; Hancock, Peter J B

2007-04-01

This paper examines two facial composite systems that present multiple faces during construction to more closely resemble natural face processing. A 'parallel' version of PRO-fit was evaluated, which presents facial features in sets of six or twelve, and EvoFIT, a system in development, which contains a holistic face model and an evolutionary interface. The PRO-fit parallel interface turned out not to be quite as good as the 'serial' version as it appeared to interfere with holistic face processing. Composites from EvoFIT were named almost three times better than PRO-fit, but a benefit emerged under feature encoding, suggesting that recall has a greater role for EvoFIT than was previously thought. In general, an advantage was found for feature encoding, replicating a previous finding in this area, and also for a novel 'holistic' interview.
Parallel computation for biological sequence comparison: comparing a portable model to the native model for the Intel Hypercube.

PubMed

Nadkarni, P M; Miller, P L

1991-01-01

A parallel program for inter-database sequence comparison was developed on the Intel Hypercube using two models of parallel programming. One version was built using machine-specific Hypercube parallel programming commands. The other version was built using Linda, a machine-independent parallel programming language. The two versions of the program provide a case study comparing these two approaches to parallelization in an important biological application area. Benchmark tests with both programs gave comparable results with a small number of processors. As the number of processors was increased, the Linda version was somewhat less efficient. The Linda version was also run without change on Network Linda, a virtual parallel machine running on a network of desktop workstations.
Comparison of multihardware parallel implementations for a phase unwrapping algorithm

NASA Astrophysics Data System (ADS)

Hernandez-Lopez, Francisco Javier; Rivera, Mariano; Salazar-Garibay, Adan; Legarda-Sáenz, Ricardo

2018-04-01

Phase unwrapping is an important problem in the areas of optical metrology, synthetic aperture radar (SAR) image analysis, and magnetic resonance imaging (MRI) analysis. These images are becoming larger in size and, particularly, the availability and need for processing of SAR and MRI data have increased significantly with the acquisition of remote sensing data and the popularization of magnetic resonators in clinical diagnosis. Therefore, it is important to develop faster and accurate phase unwrapping algorithms. We propose a parallel multigrid algorithm of a phase unwrapping method named accumulation of residual maps, which builds on a serial algorithm that consists of the minimization of a cost function; minimization achieved by means of a serial Gauss-Seidel kind algorithm. Our algorithm also optimizes the original cost function, but unlike the original work, our algorithm is a parallel Jacobi class with alternated minimizations. This strategy is known as the chessboard type, where red pixels can be updated in parallel at same iteration since they are independent. Similarly, black pixels can be updated in parallel in an alternating iteration. We present parallel implementations of our algorithm for different parallel multicore architecture such as CPU-multicore, Xeon Phi coprocessor, and Nvidia graphics processing unit. In all the cases, we obtain a superior performance of our parallel algorithm when compared with the original serial version. In addition, we present a detailed comparative performance of the developed parallel versions.
Parallel computation for biological sequence comparison: comparing a portable model to the native model for the Intel Hypercube.

PubMed Central

Nadkarni, P. M.; Miller, P. L.

1991-01-01

A parallel program for inter-database sequence comparison was developed on the Intel Hypercube using two models of parallel programming. One version was built using machine-specific Hypercube parallel programming commands. The other version was built using Linda, a machine-independent parallel programming language. The two versions of the program provide a case study comparing these two approaches to parallelization in an important biological application area. Benchmark tests with both programs gave comparable results with a small number of processors. As the number of processors was increased, the Linda version was somewhat less efficient. The Linda version was also run without change on Network Linda, a virtual parallel machine running on a network of desktop workstations. PMID:1807632
Performance and Application of Parallel OVERFLOW Codes on Distributed and Shared Memory Platforms

NASA Technical Reports Server (NTRS)

Djomehri, M. Jahed; Rizk, Yehia M.

1999-01-01

The presentation discusses recent studies on the performance of the two parallel versions of the aerodynamics CFD code, OVERFLOW_MPI and _MLP. Developed at NASA Ames, the serial version, OVERFLOW, is a multidimensional Navier-Stokes flow solver based on overset (Chimera) grid technology. The code has recently been parallelized in two ways. One is based on the explicit message-passing interface (MPI) across processors and uses the _MPI communication package. This approach is primarily suited for distributed memory systems and workstation clusters. The second, termed the multi-level parallel (MLP) method, is simple and uses shared memory for all communications. The _MLP code is suitable on distributed-shared memory systems. For both methods, the message passing takes place across the processors or processes at the advancement of each time step. This procedure is, in effect, the Chimera boundary conditions update, which is done in an explicit "Jacobi" style. In contrast, the update in the serial code is done in more of the "Gauss-Sidel" fashion. The programming efforts for the _MPI code is more complicated than for the _MLP code; the former requires modification of the outer and some inner shells of the serial code, whereas the latter focuses only on the outer shell of the code. The _MPI version offers a great deal of flexibility in distributing grid zones across a specified number of processors in order to achieve load balancing. The approach is capable of partitioning zones across multiple processors or sending each zone and/or cluster of several zones into a single processor. The message passing across the processors consists of Chimera boundary and/or an overlap of "halo" boundary points for each partitioned zone. The MLP version is a new coarse-grain parallel concept at the zonal and intra-zonal levels. A grouping strategy is used to distribute zones into several groups forming sub-processes which will run in parallel. The total volume of grid points in each group are approximately balanced. A proper number of threads are initially allocated to each group, and in subsequent iterations during the run-time, the number of threads are adjusted to achieve load balancing across the processes. Each process exploits the multitasking directives already established in Overflow.
Density-based parallel skin lesion border detection with webCL

PubMed Central

2015-01-01

Background Dermoscopy is a highly effective and noninvasive imaging technique used in diagnosis of melanoma and other pigmented skin lesions. Many aspects of the lesion under consideration are defined in relation to the lesion border. This makes border detection one of the most important steps in dermoscopic image analysis. In current practice, dermatologists often delineate borders through a hand drawn representation based upon visual inspection. Due to the subjective nature of this technique, intra- and inter-observer variations are common. Because of this, the automated assessment of lesion borders in dermoscopic images has become an important area of study. Methods Fast density based skin lesion border detection method has been implemented in parallel with a new parallel technology called WebCL. WebCL utilizes client side computing capabilities to use available hardware resources such as multi cores and GPUs. Developed WebCL-parallel density based skin lesion border detection method runs efficiently from internet browsers. Results Previous research indicates that one of the highest accuracy rates can be achieved using density based clustering techniques for skin lesion border detection. While these algorithms do have unfavorable time complexities, this effect could be mitigated when implemented in parallel. In this study, density based clustering technique for skin lesion border detection is parallelized and redesigned to run very efficiently on the heterogeneous platforms (e.g. tablets, SmartPhones, multi-core CPUs, GPUs, and fully-integrated Accelerated Processing Units) by transforming the technique into a series of independent concurrent operations. Heterogeneous computing is adopted to support accessibility, portability and multi-device use in the clinical settings. For this, we used WebCL, an emerging technology that enables a HTML5 Web browser to execute code in parallel for heterogeneous platforms. We depicted WebCL and our parallel algorithm design. In addition, we tested parallel code on 100 dermoscopy images and showed the execution speedups with respect to the serial version. Results indicate that parallel (WebCL) version and serial version of density based lesion border detection methods generate the same accuracy rates for 100 dermoscopy images, in which mean of border error is 6.94%, mean of recall is 76.66%, and mean of precision is 99.29% respectively. Moreover, WebCL version's speedup factor for 100 dermoscopy images' lesion border detection averages around ~491.2. Conclusions When large amount of high resolution dermoscopy images considered in a usual clinical setting along with the critical importance of early detection and diagnosis of melanoma before metastasis, the importance of fast processing dermoscopy images become obvious. In this paper, we introduce WebCL and the use of it for biomedical image processing applications. WebCL is a javascript binding of OpenCL, which takes advantage of GPU computing from a web browser. Therefore, WebCL parallel version of density based skin lesion border detection introduced in this study can supplement expert dermatologist, and aid them in early diagnosis of skin lesions. While WebCL is currently an emerging technology, a full adoption of WebCL into the HTML5 standard would allow for this implementation to run on a very large set of hardware and software systems. WebCL takes full advantage of parallel computational resources including multi-cores and GPUs on a local machine, and allows for compiled code to run directly from the Web Browser. PMID:26423836
Density-based parallel skin lesion border detection with webCL.

PubMed

Lemon, James; Kockara, Sinan; Halic, Tansel; Mete, Mutlu

2015-01-01

Dermoscopy is a highly effective and noninvasive imaging technique used in diagnosis of melanoma and other pigmented skin lesions. Many aspects of the lesion under consideration are defined in relation to the lesion border. This makes border detection one of the most important steps in dermoscopic image analysis. In current practice, dermatologists often delineate borders through a hand drawn representation based upon visual inspection. Due to the subjective nature of this technique, intra- and inter-observer variations are common. Because of this, the automated assessment of lesion borders in dermoscopic images has become an important area of study. Fast density based skin lesion border detection method has been implemented in parallel with a new parallel technology called WebCL. WebCL utilizes client side computing capabilities to use available hardware resources such as multi cores and GPUs. Developed WebCL-parallel density based skin lesion border detection method runs efficiently from internet browsers. Previous research indicates that one of the highest accuracy rates can be achieved using density based clustering techniques for skin lesion border detection. While these algorithms do have unfavorable time complexities, this effect could be mitigated when implemented in parallel. In this study, density based clustering technique for skin lesion border detection is parallelized and redesigned to run very efficiently on the heterogeneous platforms (e.g. tablets, SmartPhones, multi-core CPUs, GPUs, and fully-integrated Accelerated Processing Units) by transforming the technique into a series of independent concurrent operations. Heterogeneous computing is adopted to support accessibility, portability and multi-device use in the clinical settings. For this, we used WebCL, an emerging technology that enables a HTML5 Web browser to execute code in parallel for heterogeneous platforms. We depicted WebCL and our parallel algorithm design. In addition, we tested parallel code on 100 dermoscopy images and showed the execution speedups with respect to the serial version. Results indicate that parallel (WebCL) version and serial version of density based lesion border detection methods generate the same accuracy rates for 100 dermoscopy images, in which mean of border error is 6.94%, mean of recall is 76.66%, and mean of precision is 99.29% respectively. Moreover, WebCL version's speedup factor for 100 dermoscopy images' lesion border detection averages around ~491.2. When large amount of high resolution dermoscopy images considered in a usual clinical setting along with the critical importance of early detection and diagnosis of melanoma before metastasis, the importance of fast processing dermoscopy images become obvious. In this paper, we introduce WebCL and the use of it for biomedical image processing applications. WebCL is a javascript binding of OpenCL, which takes advantage of GPU computing from a web browser. Therefore, WebCL parallel version of density based skin lesion border detection introduced in this study can supplement expert dermatologist, and aid them in early diagnosis of skin lesions. While WebCL is currently an emerging technology, a full adoption of WebCL into the HTML5 standard would allow for this implementation to run on a very large set of hardware and software systems. WebCL takes full advantage of parallel computational resources including multi-cores and GPUs on a local machine, and allows for compiled code to run directly from the Web Browser.

Pathways to Renewable Hydrogen Video (Text Version) | Hydrogen and Fuel

Science.gov Websites

array of abundant, sugar rich plant-based material. A fermentation process in the lab breaks down the : The photobiological process in a way is a parallel of the fermentation. The only difference is now the
Parallelization of ARC3D with Computer-Aided Tools

NASA Technical Reports Server (NTRS)

Jin, Haoqiang; Hribar, Michelle; Yan, Jerry; Saini, Subhash (Technical Monitor)

1998-01-01

A series of efforts have been devoted to investigating methods of porting and parallelizing applications quickly and efficiently for new architectures, such as the SCSI Origin 2000 and Cray T3E. This report presents the parallelization of a CFD application, ARC3D, using the computer-aided tools, Cesspools. Steps of parallelizing this code and requirements of achieving better performance are discussed. The generated parallel version has achieved reasonably well performance, for example, having a speedup of 30 for 36 Cray T3E processors. However, this performance could not be obtained without modification of the original serial code. It is suggested that in many cases improving serial code and performing necessary code transformations are important parts for the automated parallelization process although user intervention in many of these parts are still necessary. Nevertheless, development and improvement of useful software tools, such as Cesspools, can help trim down many tedious parallelization details and improve the processing efficiency.
Parallel processing on the Livermore VAX 11/780-4 parallel processor system with compatibility to Cray Research, Inc. (CRI) multitasking. Version 1

DOE Office of Scientific and Technical Information (OSTI.GOV)

Werner, N.E.; Van Matre, S.W.

1985-05-01

This manual describes the CRI Subroutine Library and Utility Package. The CRI library provides Cray multitasking functionality on the four-processor shared memory VAX 11/780-4. Additional functionality has been added for more flexibility. A discussion of the library, utilities, error messages, and example programs is provided.
Creating a Parallel Version of VisIt for Microsoft Windows

DOE Office of Scientific and Technical Information (OSTI.GOV)

Whitlock, B J; Biagas, K S; Rawson, P L

2011-12-07

VisIt is a popular, free interactive parallel visualization and analysis tool for scientific data. Users can quickly generate visualizations from their data, animate them through time, manipulate them, and save the resulting images or movies for presentations. VisIt was designed from the ground up to work on many scales of computers from modest desktops up to massively parallel clusters. VisIt is comprised of a set of cooperating programs. All programs can be run locally or in client/server mode in which some run locally and some run remotely on compute clusters. The VisIt program most able to harness today's computing powermore » is the VisIt compute engine. The compute engine is responsible for reading simulation data from disk, processing it, and sending results or images back to the VisIt viewer program. In a parallel environment, the compute engine runs several processes, coordinating using the Message Passing Interface (MPI) library. Each MPI process reads some subset of the scientific data and filters the data in various ways to create useful visualizations. By using MPI, VisIt has been able to scale well into the thousands of processors on large computers such as dawn and graph at LLNL. The advent of multicore CPU's has made parallelism the 'new' way to achieve increasing performance. With today's computers having at least 2 cores and in many cases up to 8 and beyond, it is more important than ever to deploy parallel software that can use that computing power not only on clusters but also on the desktop. We have created a parallel version of VisIt for Windows that uses Microsoft's MPI implementation (MSMPI) to process data in parallel on the Windows desktop as well as on a Windows HPC cluster running Microsoft Windows Server 2008. Initial desktop parallel support for Windows was deployed in VisIt 2.4.0. Windows HPC cluster support has been completed and will appear in the VisIt 2.5.0 release. We plan to continue supporting parallel VisIt on Windows so our users will be able to take full advantage of their multicore resources.« less
A parallel solver for huge dense linear systems

NASA Astrophysics Data System (ADS)

Badia, J. M.; Movilla, J. L.; Climente, J. I.; Castillo, M.; Marqués, M.; Mayo, R.; Quintana-Ortí, E. S.; Planelles, J.

2011-11-01

HDSS (Huge Dense Linear System Solver) is a Fortran Application Programming Interface (API) to facilitate the parallel solution of very large dense systems to scientists and engineers. The API makes use of parallelism to yield an efficient solution of the systems on a wide range of parallel platforms, from clusters of processors to massively parallel multiprocessors. It exploits out-of-core strategies to leverage the secondary memory in order to solve huge linear systems O(100.000). The API is based on the parallel linear algebra library PLAPACK, and on its Out-Of-Core (OOC) extension POOCLAPACK. Both PLAPACK and POOCLAPACK use the Message Passing Interface (MPI) as the communication layer and BLAS to perform the local matrix operations. The API provides a friendly interface to the users, hiding almost all the technical aspects related to the parallel execution of the code and the use of the secondary memory to solve the systems. In particular, the API can automatically select the best way to store and solve the systems, depending of the dimension of the system, the number of processes and the main memory of the platform. Experimental results on several parallel platforms report high performance, reaching more than 1 TFLOP with 64 cores to solve a system with more than 200 000 equations and more than 10 000 right-hand side vectors. New version program summaryProgram title: Huge Dense System Solver (HDSS) Catalogue identifier: AEHU_v1_1 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEHU_v1_1.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 87 062 No. of bytes in distributed program, including test data, etc.: 1 069 110 Distribution format: tar.gz Programming language: Fortran90, C Computer: Parallel architectures: multiprocessors, computer clusters Operating system: Linux/Unix Has the code been vectorized or parallelized?: Yes, includes MPI primitives. RAM: Tested for up to 190 GB Classification: 6.5 External routines: MPI ( http://www.mpi-forum.org/), BLAS ( http://www.netlib.org/blas/), PLAPACK ( http://www.cs.utexas.edu/~plapack/), POOCLAPACK ( ftp://ftp.cs.utexas.edu/pub/rvdg/PLAPACK/pooclapack.ps) (code for PLAPACK and POOCLAPACK is included in the distribution). Catalogue identifier of previous version: AEHU_v1_0 Journal reference of previous version: Comput. Phys. Comm. 182 (2011) 533 Does the new version supersede the previous version?: Yes Nature of problem: Huge scale dense systems of linear equations, Ax=B, beyond standard LAPACK capabilities. Solution method: The linear systems are solved by means of parallelized routines based on the LU factorization, using efficient secondary storage algorithms when the available main memory is insufficient. Reasons for new version: In many applications we need to guarantee a high accuracy in the solution of very large linear systems and we can do it by using double-precision arithmetic. Summary of revisions: Version 1.1 Can be used to solve linear systems using double-precision arithmetic. New version of the initialization routine. The user can choose the kind of arithmetic and the values of several parameters of the environment. Running time: About 5 hours to solve a system with more than 200 000 equations and more than 10 000 right-hand side vectors using double-precision arithmetic on an eight-node commodity cluster with a total of 64 Intel cores.
MPgrafic: A parallel MPI version of Grafic-1

NASA Astrophysics Data System (ADS)

Prunet, Simon; Pichon, Christophe

2013-04-01

MPgrafic is a parallel MPI version of Grafic-1 which can produce large cosmological initial conditions on a cluster without requiring shared memory. The real Fourier transforms are carried in place using fftw while minimizing the amount of used memory (at the expense of performance) in the spirit of Grafic-1. The writing of the output file is also carried in parallel. In addition to the technical parallelization, it provides three extensions over Grafic-1: it can produce power spectra with baryon wiggles (DJ Eisenstein and W. Hu, Ap. J. 496);it has the optional ability to load a lower resolution noise map corresponding to the low frequency component which will fix the larger scale modes of the simulation (extra flag 0/1 at the end of the input process) in the spirit of Grafic-2;it can be used in conjunction with constrfield, which generates initial conditions phases from a list of local constraints on density, tidal field density gradient and velocity.
Comparison of Origin 2000 and Origin 3000 Using NAS Parallel Benchmarks

NASA Technical Reports Server (NTRS)

Turney, Raymond D.

2001-01-01

This report describes results of benchmark tests on the Origin 3000 system currently being installed at the NASA Ames National Advanced Supercomputing facility. This machine will ultimately contain 1024 R14K processors. The first part of the system, installed in November, 2000 and named mendel, is an Origin 3000 with 128 R12K processors. For comparison purposes, the tests were also run on lomax, an Origin 2000 with R12K processors. The BT, LU, and SP application benchmarks in the NAS Parallel Benchmark Suite and the kernel benchmark FT were chosen to determine system performance and measure the impact of changes on the machine as it evolves. Having been written to measure performance on Computational Fluid Dynamics applications, these benchmarks are assumed appropriate to represent the NAS workload. Since the NAS runs both message passing (MPI) and shared-memory, compiler directive type codes, both MPI and OpenMP versions of the benchmarks were used. The MPI versions used were the latest official release of the NAS Parallel Benchmarks, version 2.3. The OpenMP versiqns used were PBN3b2, a beta version that is in the process of being released. NPB 2.3 and PBN 3b2 are technically different benchmarks, and NPB results are not directly comparable to PBN results.
A distributed version of the NASA Engine Performance Program

NASA Technical Reports Server (NTRS)

Cours, Jeffrey T.; Curlett, Brian P.

1993-01-01

Distributed NEPP, a version of the NASA Engine Performance Program, uses the original NEPP code but executes it in a distributed computer environment. Multiple workstations connected by a network increase the program's speed and, more importantly, the complexity of the cases it can handle in a reasonable time. Distributed NEPP uses the public domain software package, called Parallel Virtual Machine, allowing it to execute on clusters of machines containing many different architectures. It includes the capability to link with other computers, allowing them to process NEPP jobs in parallel. This paper discusses the design issues and granularity considerations that entered into programming Distributed NEPP and presents the results of timing runs.
A massively parallel computational approach to coupled thermoelastic/porous gas flow problems

NASA Technical Reports Server (NTRS)

Shia, David; Mcmanus, Hugh L.

1995-01-01

A new computational scheme for coupled thermoelastic/porous gas flow problems is presented. Heat transfer, gas flow, and dynamic thermoelastic governing equations are expressed in fully explicit form, and solved on a massively parallel computer. The transpiration cooling problem is used as an example problem. The numerical solutions have been verified by comparison to available analytical solutions. Transient temperature, pressure, and stress distributions have been obtained. Small spatial oscillations in pressure and stress have been observed, which would be impractical to predict with previously available schemes. Comparisons between serial and massively parallel versions of the scheme have also been made. The results indicate that for small scale problems the serial and parallel versions use practically the same amount of CPU time. However, as the problem size increases the parallel version becomes more efficient than the serial version.
Quantitative Image Feature Engine (QIFE): an Open-Source, Modular Engine for 3D Quantitative Feature Extraction from Volumetric Medical Images.

PubMed

Echegaray, Sebastian; Bakr, Shaimaa; Rubin, Daniel L; Napel, Sandy

2017-10-06

The aim of this study was to develop an open-source, modular, locally run or server-based system for 3D radiomics feature computation that can be used on any computer system and included in existing workflows for understanding associations and building predictive models between image features and clinical data, such as survival. The QIFE exploits various levels of parallelization for use on multiprocessor systems. It consists of a managing framework and four stages: input, pre-processing, feature computation, and output. Each stage contains one or more swappable components, allowing run-time customization. We benchmarked the engine using various levels of parallelization on a cohort of CT scans presenting 108 lung tumors. Two versions of the QIFE have been released: (1) the open-source MATLAB code posted to Github, (2) a compiled version loaded in a Docker container, posted to DockerHub, which can be easily deployed on any computer. The QIFE processed 108 objects (tumors) in 2:12 (h/mm) using 1 core, and 1:04 (h/mm) hours using four cores with object-level parallelization. We developed the Quantitative Image Feature Engine (QIFE), an open-source feature-extraction framework that focuses on modularity, standards, parallelism, provenance, and integration. Researchers can easily integrate it with their existing segmentation and imaging workflows by creating input and output components that implement their existing interfaces. Computational efficiency can be improved by parallelizing execution at the cost of memory usage. Different parallelization levels provide different trade-offs, and the optimal setting will depend on the size and composition of the dataset to be processed.
Single product lot-sizing on unrelated parallel machines with non-decreasing processing times

NASA Astrophysics Data System (ADS)

Eremeev, A.; Kovalyov, M.; Kuznetsov, P.

2018-01-01

We consider a problem in which at least a given quantity of a single product has to be partitioned into lots, and lots have to be assigned to unrelated parallel machines for processing. In one version of the problem, the maximum machine completion time should be minimized, in another version of the problem, the sum of machine completion times is to be minimized. Machine-dependent lower and upper bounds on the lot size are given. The product is either assumed to be continuously divisible or discrete. The processing time of each machine is defined by an increasing function of the lot volume, given as an oracle. Setup times and costs are assumed to be negligibly small, and therefore, they are not considered. We derive optimal polynomial time algorithms for several special cases of the problem. An NP-hard case is shown to admit a fully polynomial time approximation scheme. An application of the problem in energy efficient processors scheduling is considered.
Computer-Aided Parallelizer and Optimizer

NASA Technical Reports Server (NTRS)

Jin, Haoqiang

2011-01-01

The Computer-Aided Parallelizer and Optimizer (CAPO) automates the insertion of compiler directives (see figure) to facilitate parallel processing on Shared Memory Parallel (SMP) machines. While CAPO currently is integrated seamlessly into CAPTools (developed at the University of Greenwich, now marketed as ParaWise), CAPO was independently developed at Ames Research Center as one of the components for the Legacy Code Modernization (LCM) project. The current version takes serial FORTRAN programs, performs interprocedural data dependence analysis, and generates OpenMP directives. Due to the widely supported OpenMP standard, the generated OpenMP codes have the potential to run on a wide range of SMP machines. CAPO relies on accurate interprocedural data dependence information currently provided by CAPTools. Compiler directives are generated through identification of parallel loops in the outermost level, construction of parallel regions around parallel loops and optimization of parallel regions, and insertion of directives with automatic identification of private, reduction, induction, and shared variables. Attempts also have been made to identify potential pipeline parallelism (implemented with point-to-point synchronization). Although directives are generated automatically, user interaction with the tool is still important for producing good parallel codes. A comprehensive graphical user interface is included for users to interact with the parallelization process.
Computing effective properties of random heterogeneous materials on heterogeneous parallel processors

NASA Astrophysics Data System (ADS)

Leidi, Tiziano; Scocchi, Giulio; Grossi, Loris; Pusterla, Simone; D'Angelo, Claudio; Thiran, Jean-Philippe; Ortona, Alberto

2012-11-01

In recent decades, finite element (FE) techniques have been extensively used for predicting effective properties of random heterogeneous materials. In the case of very complex microstructures, the choice of numerical methods for the solution of this problem can offer some advantages over classical analytical approaches, and it allows the use of digital images obtained from real material samples (e.g., using computed tomography). On the other hand, having a large number of elements is often necessary for properly describing complex microstructures, ultimately leading to extremely time-consuming computations and high memory requirements. With the final objective of reducing these limitations, we improved an existing freely available FE code for the computation of effective conductivity (electrical and thermal) of microstructure digital models. To allow execution on hardware combining multi-core CPUs and a GPU, we first translated the original algorithm from Fortran to C, and we subdivided it into software components. Then, we enhanced the C version of the algorithm for parallel processing with heterogeneous processors. With the goal of maximizing the obtained performances and limiting resource consumption, we utilized a software architecture based on stream processing, event-driven scheduling, and dynamic load balancing. The parallel processing version of the algorithm has been validated using a simple microstructure consisting of a single sphere located at the centre of a cubic box, yielding consistent results. Finally, the code was used for the calculation of the effective thermal conductivity of a digital model of a real sample (a ceramic foam obtained using X-ray computed tomography). On a computer equipped with dual hexa-core Intel Xeon X5670 processors and an NVIDIA Tesla C2050, the parallel application version features near to linear speed-up progression when using only the CPU cores. It executes more than 20 times faster when additionally using the GPU.
Comparing an FPGA to a Cell for an Image Processing Application

NASA Astrophysics Data System (ADS)

Rakvic, Ryan N.; Ngo, Hau; Broussard, Randy P.; Ives, Robert W.

2010-12-01

Modern advancements in configurable hardware, most notably Field-Programmable Gate Arrays (FPGAs), have provided an exciting opportunity to discover the parallel nature of modern image processing algorithms. On the other hand, PlayStation3 (PS3) game consoles contain a multicore heterogeneous processor known as the Cell, which is designed to perform complex image processing algorithms at a high performance. In this research project, our aim is to study the differences in performance of a modern image processing algorithm on these two hardware platforms. In particular, Iris Recognition Systems have recently become an attractive identification method because of their extremely high accuracy. Iris matching, a repeatedly executed portion of a modern iris recognition algorithm, is parallelized on an FPGA system and a Cell processor. We demonstrate a 2.5 times speedup of the parallelized algorithm on the FPGA system when compared to a Cell processor-based version.
Parallel Gaussian elimination of a block tridiagonal matrix using multiple microcomputers

NASA Technical Reports Server (NTRS)

Blech, Richard A.

1989-01-01

The solution of a block tridiagonal matrix using parallel processing is demonstrated. The multiprocessor system on which results were obtained and the software environment used to program that system are described. Theoretical partitioning and resource allocation for the Gaussian elimination method used to solve the matrix are discussed. The results obtained from running 1, 2 and 3 processor versions of the block tridiagonal solver are presented. The PASCAL source code for these solvers is given in the appendix, and may be transportable to other shared memory parallel processors provided that the synchronization outlines are reproduced on the target system.
Automated and Assistive Tools for Accelerated Code migration of Scientific Computing on to Heterogeneous MultiCore Systems

DTIC Science & Technology

2017-04-13

modelling code, a parallel benchmark , and a communication avoiding version of the QR algorithm. Further, several improvements to the OmpSs model were...movement; and a port of the dynamic load balancing library to OmpSs. Finally, several updates to the tools infrastructure were accomplished, including: an...OmpSs: a basic algorithm on image processing applications, a mini application representative of an ocean modelling code, a parallel benchmark , and a
Application of a Scalable, Parallel, Unstructured-Grid-Based Navier-Stokes Solver

NASA Technical Reports Server (NTRS)

Parikh, Paresh

2001-01-01

A parallel version of an unstructured-grid based Navier-Stokes solver, USM3Dns, previously developed for efficient operation on a variety of parallel computers, has been enhanced to incorporate upgrades made to the serial version. The resultant parallel code has been extensively tested on a variety of problems of aerospace interest and on two sets of parallel computers to understand and document its characteristics. An innovative grid renumbering construct and use of non-blocking communication are shown to produce superlinear computing performance. Preliminary results from parallelization of a recently introduced "porous surface" boundary condition are also presented.
Evaluation of a parallel implementation of the learning portion of the backward error propagation neural network: experiments in artifact identification.

PubMed Central

Sittig, D. F.; Orr, J. A.

1991-01-01

Various methods have been proposed in an attempt to solve problems in artifact and/or alarm identification including expert systems, statistical signal processing techniques, and artificial neural networks (ANN). ANNs consist of a large number of simple processing units connected by weighted links. To develop truly robust ANNs, investigators are required to train their networks on huge training data sets, requiring enormous computing power. We implemented a parallel version of the backward error propagation neural network training algorithm in the widely portable parallel programming language C-Linda. A maximum speedup of 4.06 was obtained with six processors. This speedup represents a reduction in total run-time from approximately 6.4 hours to 1.5 hours. We conclude that use of the master-worker model of parallel computation is an excellent method for obtaining speedups in the backward error propagation neural network training algorithm. PMID:1807607
Classification of hyperspectral imagery using MapReduce on a NVIDIA graphics processing unit (Conference Presentation)

NASA Astrophysics Data System (ADS)

Ramirez, Andres; Rahnemoonfar, Maryam

2017-04-01

A hyperspectral image provides multidimensional figure rich in data consisting of hundreds of spectral dimensions. Analyzing the spectral and spatial information of such image with linear and non-linear algorithms will result in high computational time. In order to overcome this problem, this research presents a system using a MapReduce-Graphics Processing Unit (GPU) model that can help analyzing a hyperspectral image through the usage of parallel hardware and a parallel programming model, which will be simpler to handle compared to other low-level parallel programming models. Additionally, Hadoop was used as an open-source version of the MapReduce parallel programming model. This research compared classification accuracy results and timing results between the Hadoop and GPU system and tested it against the following test cases: the CPU and GPU test case, a CPU test case and a test case where no dimensional reduction was applied.
Parallelization of NAS Benchmarks for Shared Memory Multiprocessors

NASA Technical Reports Server (NTRS)

Waheed, Abdul; Yan, Jerry C.; Saini, Subhash (Technical Monitor)

1998-01-01

This paper presents our experiences of parallelizing the sequential implementation of NAS benchmarks using compiler directives on SGI Origin2000 distributed shared memory (DSM) system. Porting existing applications to new high performance parallel and distributed computing platforms is a challenging task. Ideally, a user develops a sequential version of the application, leaving the task of porting to new generations of high performance computing systems to parallelization tools and compilers. Due to the simplicity of programming shared-memory multiprocessors, compiler developers have provided various facilities to allow the users to exploit parallelism. Native compilers on SGI Origin2000 support multiprocessing directives to allow users to exploit loop-level parallelism in their programs. Additionally, supporting tools can accomplish this process automatically and present the results of parallelization to the users. We experimented with these compiler directives and supporting tools by parallelizing sequential implementation of NAS benchmarks. Results reported in this paper indicate that with minimal effort, the performance gain is comparable with the hand-parallelized, carefully optimized, message-passing implementations of the same benchmarks.

Modifying the test of understanding graphs in kinematics

NASA Astrophysics Data System (ADS)

Zavala, Genaro; Tejeda, Santa; Barniol, Pablo; Beichner, Robert J.

2017-12-01

In this article, we present several modifications to the Test of Understanding Graphs in Kinematics. The most significant changes are (i) the addition and removal of items to achieve parallelism in the objectives (dimensions) of the test, thus allowing comparisons of students' performance that were not possible with the original version, and (ii) changes to the distractors of some of the original items that represent the most frequent alternative conceptions. The final modified version (after an iterative process involving four administrations of test variations over two years) was administered to 471 students of an introductory university physics course at a large private university in Mexico. When analyzing the final modified version of the test it was found that the added items satisfied the statistical tests of difficulty, discriminatory power, and reliability; also, that the great majority of the modified distractors were effective in terms of their frequency selection and discriminatory power; and, that the final modified version of the test satisfied the reliability and discriminatory power criteria as well as the original test. Here, we also show the use of the new version of the test, presenting a new analysis of students' understanding not possible to do before with the original version of the test, specifically regarding the objectives and items that in the new version meet parallelisms. Finally, in the PhysPort project (physport.org), we present the final modified version of the test. It can be used by teachers and researchers to assess students' understanding of graphs in kinematics, as well as their learning about them.
[Not Available].

PubMed

Brosseau, Lucie; Laroche, Chantal; Sutton, Anne; Guitard, Paulette; King, Judy; Poitras, Stéphane; Casimiro, Lynn; Tremblay, Manon; Cardinal, Dominique; Cavallo, Sabrina; Laferrière, Lucie; Grisé, Isabelle; Marshall, Lisa; Smith, Jacky R; Lagacé, Josée; Pharand, Denyse; Galipeau, Roseline; Toupin-April, Karine; Loew, Laurianne; Demers, Catrine; Sauvé-Schenk, Katrine; Paquet, Nicole; Savard, Jacinthe; Tourigny, Jocelyne; Vaillancourt, Véronique

2015-08-01

To prepare a Canadian French translation of the PEDro Scale under the proposed name l'Échelle PEDro, and to examine the validity of its content. A modified approach of Vallerand's cross-cultural validation methodology was used, beginning with a parallel back-translation of the PEDro scale by both professional translators and clinical researchers. These versions were reviewed by an initial panel of experts (P1), who then created the first experimental version of l'Échelle PEDro. This version was evaluated by a second panel of experts (P2). Finally, 32 clinical researchers evaluated the second experimental version of l'Échelle PEDro, using a 5-point clarity scale, and suggested final modifications. The various items on the final version of l'Échelle PEDro show a high degree of clarity (from 4.0 to 4.7 on the 5-point scale). The four rigorous steps of the translation process have produced a valid Canadian French version of the PEDro scale.
Performance Improvements of the CYCOFOS Flow Model

NASA Astrophysics Data System (ADS)

Radhakrishnan, Hari; Moulitsas, Irene; Syrakos, Alexandros; Zodiatis, George; Nikolaides, Andreas; Hayes, Daniel; Georgiou, Georgios C.

2013-04-01

The CYCOFOS-Cyprus Coastal Ocean Forecasting and Observing System has been operational since early 2002, providing daily sea current, temperature, salinity and sea level forecasting data for the next 4 and 10 days to end-users in the Levantine Basin, necessary for operational application in marine safety, particularly concerning oil spills and floating objects predictions. CYCOFOS flow model, similar to most of the coastal and sub-regional operational hydrodynamic forecasting systems of the MONGOOS-Mediterranean Oceanographic Network for Global Ocean Observing System is based on the POM-Princeton Ocean Model. CYCOFOS is nested with the MyOcean Mediterranean regional forecasting data and with SKIRON and ECMWF for surface forcing. The increasing demand for higher and higher resolution data to meet coastal and offshore downstream applications motivated the parallelization of the CYCOFOS POM model. This development was carried out in the frame of the IPcycofos project, funded by the Cyprus Research Promotion Foundation. The parallel processing provides a viable solution to satisfy these demands without sacrificing accuracy or omitting any physical phenomena. Prior to IPcycofos project, there are been several attempts to parallelise the POM, as for example the MP-POM. The existing parallel code models rely on the use of specific outdated hardware architectures and associated software. The objective of the IPcycofos project is to produce an operational parallel version of the CYCOFOS POM code that can replicate the results of the serial version of the POM code used in CYCOFOS. The parallelization of the CYCOFOS POM model use Message Passing Interface-MPI, implemented on commodity computing clusters running open source software and not depending on any specialized vendor hardware. The parallel CYCOFOS POM code constructed in a modular fashion, allowing a fast re-locatable downscaled implementation. The MPI takes advantage of the Cartesian nature of the POM mesh, and use the built-in functionality of MPI routines to split the mesh, using a weighting scheme, along longitude and latitude among the processors. Each server processor work on the model based on domain decomposition techniques. The new parallel CYCOFOS POM code has been benchmarked against the serial POM version of CYCOFOS for speed, accuracy, and resolution and the results are more than satisfactory. With a higher resolution CYCOFOS Levantine model domain the forecasts need much less time than the serial CYCOFOS POM coarser version, both with identical accuracy.
Parallelization of KENO-Va Monte Carlo code

NASA Astrophysics Data System (ADS)

Ramón, Javier; Peña, Jorge

1995-07-01

KENO-Va is a code integrated within the SCALE system developed by Oak Ridge that solves the transport equation through the Monte Carlo Method. It is being used at the Consejo de Seguridad Nuclear (CSN) to perform criticality calculations for fuel storage pools and shipping casks. Two parallel versions of the code: one for shared memory machines and other for distributed memory systems using the message-passing interface PVM have been generated. In both versions the neutrons of each generation are tracked in parallel. In order to preserve the reproducibility of the results in both versions, advanced seeds for random numbers were used. The CONVEX C3440 with four processors and shared memory at CSN was used to implement the shared memory version. A FDDI network of 6 HP9000/735 was employed to implement the message-passing version using proprietary PVM. The speedup obtained was 3.6 in both cases.
Analysis of parameters for technological equipment of parallel kinematics based on rods of variable length for processing accuracy assurance

NASA Astrophysics Data System (ADS)

Koltsov, A. G.; Shamutdinov, A. H.; Blokhin, D. A.; Krivonos, E. V.

2018-01-01

A new classification of parallel kinematics mechanisms on symmetry coefficient, being proportional to mechanism stiffness and accuracy of the processing product using the technological equipment under study, is proposed. A new version of the Stewart platform with a high symmetry coefficient is presented for analysis. The workspace of the mechanism under study is described, this space being a complex solid figure. The workspace end points are reached by the center of the mobile platform which moves in parallel related to the base plate. Parameters affecting the processing accuracy, namely the static and dynamic stiffness, natural vibration frequencies are determined. The capability assessment of the mechanism operation under various loads, taking into account resonance phenomena at different points of the workspace, was conducted. The study proved that stiffness and therefore, processing accuracy with the use of the above mentioned mechanisms are comparable with the stiffness and accuracy of medium-sized series-produced machines.
Managing Algorithmic Skeleton Nesting Requirements in Realistic Image Processing Applications: The Case of the SKiPPER-II Parallel Programming Environment's Operating Model

NASA Astrophysics Data System (ADS)

Coudarcher, Rémi; Duculty, Florent; Serot, Jocelyn; Jurie, Frédéric; Derutin, Jean-Pierre; Dhome, Michel

2005-12-01

SKiPPER is a SKeleton-based Parallel Programming EnviRonment being developed since 1996 and running at LASMEA Laboratory, the Blaise-Pascal University, France. The main goal of the project was to demonstrate the applicability of skeleton-based parallel programming techniques to the fast prototyping of reactive vision applications. This paper deals with the special features embedded in the latest version of the project: algorithmic skeleton nesting capabilities and a fully dynamic operating model. Throughout the case study of a complete and realistic image processing application, in which we have pointed out the requirement for skeleton nesting, we are presenting the operating model of this feature. The work described here is one of the few reported experiments showing the application of skeleton nesting facilities for the parallelisation of a realistic application, especially in the area of image processing. The image processing application we have chosen is a 3D face-tracking algorithm from appearance.
Microphone Array Phased Processing System (MAPPS): Version 4.0 Manual

NASA Technical Reports Server (NTRS)

Watts, Michael E.; Mosher, Marianne; Barnes, Michael; Bardina, Jorge

1999-01-01

A processing system has been developed to meet increasing demands for detailed noise measurement of individual model components. The Microphone Array Phased Processing System (MAPPS) uses graphical user interfaces to control all aspects of data processing and visualization. The system uses networked parallel computers to provide noise maps at selected frequencies in a near real-time testing environment. The system has been successfully used in the NASA Ames 7- by 10-Foot Wind Tunnel.
cljam: a library for handling DNA sequence alignment/map (SAM) with parallel processing.

PubMed

Takeuchi, Toshiki; Yamada, Atsuo; Aoki, Takashi; Nishimura, Kunihiro

2016-01-01

Next-generation sequencing can determine DNA bases and the results of sequence alignments are generally stored in files in the Sequence Alignment/Map (SAM) format and the compressed binary version (BAM) of it. SAMtools is a typical tool for dealing with files in the SAM/BAM format. SAMtools has various functions, including detection of variants, visualization of alignments, indexing, extraction of parts of the data and loci, and conversion of file formats. It is written in C and can execute fast. However, SAMtools requires an additional implementation to be used in parallel with, for example, OpenMP (Open Multi-Processing) libraries. For the accumulation of next-generation sequencing data, a simple parallelization program, which can support cloud and PC cluster environments, is required. We have developed cljam using the Clojure programming language, which simplifies parallel programming, to handle SAM/BAM data. Cljam can run in a Java runtime environment (e.g., Windows, Linux, Mac OS X) with Clojure. Cljam can process and analyze SAM/BAM files in parallel and at high speed. The execution time with cljam is almost the same as with SAMtools. The cljam code is written in Clojure and has fewer lines than other similar tools.
Running ATLAS workloads within massively parallel distributed applications using Athena Multi-Process framework (AthenaMP)

NASA Astrophysics Data System (ADS)

Calafiura, Paolo; Leggett, Charles; Seuster, Rolf; Tsulaia, Vakhtang; Van Gemmeren, Peter

2015-12-01

AthenaMP is a multi-process version of the ATLAS reconstruction, simulation and data analysis framework Athena. By leveraging Linux fork and copy-on-write mechanisms, it allows for sharing of memory pages between event processors running on the same compute node with little to no change in the application code. Originally targeted to optimize the memory footprint of reconstruction jobs, AthenaMP has demonstrated that it can reduce the memory usage of certain configurations of ATLAS production jobs by a factor of 2. AthenaMP has also evolved to become the parallel event-processing core of the recently developed ATLAS infrastructure for fine-grained event processing (Event Service) which allows the running of AthenaMP inside massively parallel distributed applications on hundreds of compute nodes simultaneously. We present the architecture of AthenaMP, various strategies implemented by AthenaMP for scheduling workload to worker processes (for example: Shared Event Queue and Shared Distributor of Event Tokens) and the usage of AthenaMP in the diversity of ATLAS event processing workloads on various computing resources: Grid, opportunistic resources and HPC.
GRAVIDY, a GPU modular, parallel direct-summation N-body integrator: dynamics with softening

NASA Astrophysics Data System (ADS)

Maureira-Fredes, Cristián; Amaro-Seoane, Pau

2018-01-01

A wide variety of outstanding problems in astrophysics involve the motion of a large number of particles under the force of gravity. These include the global evolution of globular clusters, tidal disruptions of stars by a massive black hole, the formation of protoplanets and sources of gravitational radiation. The direct-summation of N gravitational forces is a complex problem with no analytical solution and can only be tackled with approximations and numerical methods. To this end, the Hermite scheme is a widely used integration method. With different numerical techniques and special-purpose hardware, it can be used to speed up the calculations. But these methods tend to be computationally slow and cumbersome to work with. We present a new graphics processing unit (GPU), direct-summation N-body integrator written from scratch and based on this scheme, which includes relativistic corrections for sources of gravitational radiation. GRAVIDY has high modularity, allowing users to readily introduce new physics, it exploits available computational resources and will be maintained by regular updates. GRAVIDY can be used in parallel on multiple CPUs and GPUs, with a considerable speed-up benefit. The single-GPU version is between one and two orders of magnitude faster than the single-CPU version. A test run using four GPUs in parallel shows a speed-up factor of about 3 as compared to the single-GPU version. The conception and design of this first release is aimed at users with access to traditional parallel CPU clusters or computational nodes with one or a few GPU cards.
Parallel-vector computation for linear structural analysis and non-linear unconstrained optimization problems

NASA Technical Reports Server (NTRS)

Nguyen, D. T.; Al-Nasra, M.; Zhang, Y.; Baddourah, M. A.; Agarwal, T. K.; Storaasli, O. O.; Carmona, E. A.

1991-01-01

Several parallel-vector computational improvements to the unconstrained optimization procedure are described which speed up the structural analysis-synthesis process. A fast parallel-vector Choleski-based equation solver, pvsolve, is incorporated into the well-known SAP-4 general-purpose finite-element code. The new code, denoted PV-SAP, is tested for static structural analysis. Initial results on a four processor CRAY 2 show that using pvsolve reduces the equation solution time by a factor of 14-16 over the original SAP-4 code. In addition, parallel-vector procedures for the Golden Block Search technique and the BFGS method are developed and tested for nonlinear unconstrained optimization. A parallel version of an iterative solver and the pvsolve direct solver are incorporated into the BFGS method. Preliminary results on nonlinear unconstrained optimization test problems, using pvsolve in the analysis, show excellent parallel-vector performance indicating that these parallel-vector algorithms can be used in a new generation of finite-element based structural design/analysis-synthesis codes.
Time Series Discord Detection in Medical Data using a Parallel Relational Database

DOE Office of Scientific and Technical Information (OSTI.GOV)

Woodbridge, Diane; Rintoul, Mark Daniel; Wilson, Andrew T.

Recent advances in sensor technology have made continuous real-time health monitoring available in both hospital and non-hospital settings. Since data collected from high frequency medical sensors includes a huge amount of data, storing and processing continuous medical data is an emerging big data area. Especially detecting anomaly in real time is important for patients’ emergency detection and prevention. A time series discord indicates a subsequence that has the maximum difference to the rest of the time series subsequences, meaning that it has abnormal or unusual data trends. In this study, we implemented two versions of time series discord detection algorithmsmore » on a high performance parallel database management system (DBMS) and applied them to 240 Hz waveform data collected from 9,723 patients. The initial brute force version of the discord detection algorithm takes each possible subsequence and calculates a distance to the nearest non-self match to find the biggest discords in time series. For the heuristic version of the algorithm, a combination of an array and a trie structure was applied to order time series data for enhancing time efficiency. The study results showed efficient data loading, decoding and discord searches in a large amount of data, benefiting from the time series discord detection algorithm and the architectural characteristics of the parallel DBMS including data compression, data pipe-lining, and task scheduling.« less
Time Series Discord Detection in Medical Data using a Parallel Relational Database [PowerPoint

DOE Office of Scientific and Technical Information (OSTI.GOV)

Woodbridge, Diane; Wilson, Andrew T.; Rintoul, Mark Daniel

Recent advances in sensor technology have made continuous real-time health monitoring available in both hospital and non-hospital settings. Since data collected from high frequency medical sensors includes a huge amount of data, storing and processing continuous medical data is an emerging big data area. Especially detecting anomaly in real time is important for patients’ emergency detection and prevention. A time series discord indicates a subsequence that has the maximum difference to the rest of the time series subsequences, meaning that it has abnormal or unusual data trends. In this study, we implemented two versions of time series discord detection algorithmsmore » on a high performance parallel database management system (DBMS) and applied them to 240 Hz waveform data collected from 9,723 patients. The initial brute force version of the discord detection algorithm takes each possible subsequence and calculates a distance to the nearest non-self match to find the biggest discords in time series. For the heuristic version of the algorithm, a combination of an array and a trie structure was applied to order time series data for enhancing time efficiency. The study results showed efficient data loading, decoding and discord searches in a large amount of data, benefiting from the time series discord detection algorithm and the architectural characteristics of the parallel DBMS including data compression, data pipe-lining, and task scheduling.« less
Advanced mathematical on-line analysis in nuclear experiments. Usage of parallel computing CUDA routines in standard root analysis

NASA Astrophysics Data System (ADS)

Grzeszczuk, A.; Kowalski, S.

2015-04-01

Compute Unified Device Architecture (CUDA) is a parallel computing platform developed by Nvidia for increase speed of graphics by usage of parallel mode for processes calculation. The success of this solution has opened technology General-Purpose Graphic Processor Units (GPGPUs) for applications not coupled with graphics. The GPGPUs system can be applying as effective tool for reducing huge number of data for pulse shape analysis measures, by on-line recalculation or by very quick system of compression. The simplified structure of CUDA system and model of programming based on example Nvidia GForce GTX580 card are presented by our poster contribution in stand-alone version and as ROOT application.
A parallel finite-difference method for computational aerodynamics

NASA Technical Reports Server (NTRS)

Swisshelm, Julie M.

1989-01-01

A finite-difference scheme for solving complex three-dimensional aerodynamic flow on parallel-processing supercomputers is presented. The method consists of a basic flow solver with multigrid convergence acceleration, embedded grid refinements, and a zonal equation scheme. Multitasking and vectorization have been incorporated into the algorithm. Results obtained include multiprocessed flow simulations from the Cray X-MP and Cray-2. Speedups as high as 3.3 for the two-dimensional case and 3.5 for segments of the three-dimensional case have been achieved on the Cray-2. The entire solver attained a factor of 2.7 improvement over its unitasked version on the Cray-2. The performance of the parallel algorithm on each machine is analyzed.
Performance Modeling and Measurement of Parallelized Code for Distributed Shared Memory Multiprocessors

NASA Technical Reports Server (NTRS)

Waheed, Abdul; Yan, Jerry

1998-01-01

This paper presents a model to evaluate the performance and overhead of parallelizing sequential code using compiler directives for multiprocessing on distributed shared memory (DSM) systems. With increasing popularity of shared address space architectures, it is essential to understand their performance impact on programs that benefit from shared memory multiprocessing. We present a simple model to characterize the performance of programs that are parallelized using compiler directives for shared memory multiprocessing. We parallelized the sequential implementation of NAS benchmarks using native Fortran77 compiler directives for an Origin2000, which is a DSM system based on a cache-coherent Non Uniform Memory Access (ccNUMA) architecture. We report measurement based performance of these parallelized benchmarks from four perspectives: efficacy of parallelization process; scalability; parallelization overhead; and comparison with hand-parallelized and -optimized version of the same benchmarks. Our results indicate that sequential programs can conveniently be parallelized for DSM systems using compiler directives but realizing performance gains as predicted by the performance model depends primarily on minimizing architecture-specific data locality overhead.
Accelerating the Pace of Protein Functional Annotation With Intel Xeon Phi Coprocessors.

PubMed

Feinstein, Wei P; Moreno, Juana; Jarrell, Mark; Brylinski, Michal

2015-06-01

Intel Xeon Phi is a new addition to the family of powerful parallel accelerators. The range of its potential applications in computationally driven research is broad; however, at present, the repository of scientific codes is still relatively limited. In this study, we describe the development and benchmarking of a parallel version of eFindSite, a structural bioinformatics algorithm for the prediction of ligand-binding sites in proteins. Implemented for the Intel Xeon Phi platform, the parallelization of the structure alignment portion of eFindSite using pragma-based OpenMP brings about the desired performance improvements, which scale well with the number of computing cores. Compared to a serial version, the parallel code runs 11.8 and 10.1 times faster on the CPU and the coprocessor, respectively; when both resources are utilized simultaneously, the speedup is 17.6. For example, ligand-binding predictions for 501 benchmarking proteins are completed in 2.1 hours on a single Stampede node equipped with the Intel Xeon Phi card compared to 3.1 hours without the accelerator and 36.8 hours required by a serial version. In addition to the satisfactory parallel performance, porting existing scientific codes to the Intel Xeon Phi architecture is relatively straightforward with a short development time due to the support of common parallel programming models by the coprocessor. The parallel version of eFindSite is freely available to the academic community at www.brylinski.org/efindsite.
Implementing parallel spreadsheet models for health policy decisions: The impact of unintentional errors on model projections

PubMed Central

Bailey, Stephanie L.; Bono, Rose S.; Nash, Denis; Kimmel, April D.

2018-01-01

Background Spreadsheet software is increasingly used to implement systems science models informing health policy decisions, both in academia and in practice where technical capacity may be limited. However, spreadsheet models are prone to unintentional errors that may not always be identified using standard error-checking techniques. Our objective was to illustrate, through a methodologic case study analysis, the impact of unintentional errors on model projections by implementing parallel model versions. Methods We leveraged a real-world need to revise an existing spreadsheet model designed to inform HIV policy. We developed three parallel versions of a previously validated spreadsheet-based model; versions differed by the spreadsheet cell-referencing approach (named single cells; column/row references; named matrices). For each version, we implemented three model revisions (re-entry into care; guideline-concordant treatment initiation; immediate treatment initiation). After standard error-checking, we identified unintentional errors by comparing model output across the three versions. Concordant model output across all versions was considered error-free. We calculated the impact of unintentional errors as the percentage difference in model projections between model versions with and without unintentional errors, using +/-5% difference to define a material error. Results We identified 58 original and 4,331 propagated unintentional errors across all model versions and revisions. Over 40% (24/58) of original unintentional errors occurred in the column/row reference model version; most (23/24) were due to incorrect cell references. Overall, >20% of model spreadsheet cells had material unintentional errors. When examining error impact along the HIV care continuum, the percentage difference between versions with and without unintentional errors ranged from +3% to +16% (named single cells), +26% to +76% (column/row reference), and 0% (named matrices). Conclusions Standard error-checking techniques may not identify all errors in spreadsheet-based models. Comparing parallel model versions can aid in identifying unintentional errors and promoting reliable model projections, particularly when resources are limited. PMID:29570737
Implementing parallel spreadsheet models for health policy decisions: The impact of unintentional errors on model projections.

PubMed

Bailey, Stephanie L; Bono, Rose S; Nash, Denis; Kimmel, April D

2018-01-01

Spreadsheet software is increasingly used to implement systems science models informing health policy decisions, both in academia and in practice where technical capacity may be limited. However, spreadsheet models are prone to unintentional errors that may not always be identified using standard error-checking techniques. Our objective was to illustrate, through a methodologic case study analysis, the impact of unintentional errors on model projections by implementing parallel model versions. We leveraged a real-world need to revise an existing spreadsheet model designed to inform HIV policy. We developed three parallel versions of a previously validated spreadsheet-based model; versions differed by the spreadsheet cell-referencing approach (named single cells; column/row references; named matrices). For each version, we implemented three model revisions (re-entry into care; guideline-concordant treatment initiation; immediate treatment initiation). After standard error-checking, we identified unintentional errors by comparing model output across the three versions. Concordant model output across all versions was considered error-free. We calculated the impact of unintentional errors as the percentage difference in model projections between model versions with and without unintentional errors, using +/-5% difference to define a material error. We identified 58 original and 4,331 propagated unintentional errors across all model versions and revisions. Over 40% (24/58) of original unintentional errors occurred in the column/row reference model version; most (23/24) were due to incorrect cell references. Overall, >20% of model spreadsheet cells had material unintentional errors. When examining error impact along the HIV care continuum, the percentage difference between versions with and without unintentional errors ranged from +3% to +16% (named single cells), +26% to +76% (column/row reference), and 0% (named matrices). Standard error-checking techniques may not identify all errors in spreadsheet-based models. Comparing parallel model versions can aid in identifying unintentional errors and promoting reliable model projections, particularly when resources are limited.
miTRATA: a web-based tool for microRNA Truncation and Tailing Analysis.

PubMed

Patel, Parth; Ramachandruni, S Deepthi; Kakrana, Atul; Nakano, Mayumi; Meyers, Blake C

2016-02-01

We describe miTRATA, the first web-based tool for microRNA Truncation and Tailing Analysis--the analysis of 3' modifications of microRNAs including the loss or gain of nucleotides relative to the canonical sequence. miTRATA is implemented in Python (version 3) and employs parallel processing modules to enhance its scalability when analyzing multiple small RNA (sRNA) sequencing datasets. It utilizes miRBase, currently version 21, as a source of known microRNAs for analysis. miTRATA notifies user(s) via email to download as well as visualize the results online. miTRATA's strengths lie in (i) its biologist-focused web interface, (ii) improved scalability via parallel processing and (iii) its uniqueness as a webtool to perform microRNA truncation and tailing analysis. miTRATA is developed in Python and PHP. It is available as a web-based application from https://wasabi.dbi.udel.edu/∼apps/ta/. meyers@dbi.udel.edu Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

Parallel grid library for rapid and flexible simulation development

NASA Astrophysics Data System (ADS)

Honkonen, I.; von Alfthan, S.; Sandroos, A.; Janhunen, P.; Palmroth, M.

2013-04-01

We present an easy to use and flexible grid library for developing highly scalable parallel simulations. The distributed cartesian cell-refinable grid (dccrg) supports adaptive mesh refinement and allows an arbitrary C++ class to be used as cell data. The amount of data in grid cells can vary both in space and time allowing dccrg to be used in very different types of simulations, for example in fluid and particle codes. Dccrg transfers the data between neighboring cells on different processes transparently and asynchronously allowing one to overlap computation and communication. This enables excellent scalability at least up to 32 k cores in magnetohydrodynamic tests depending on the problem and hardware. In the version of dccrg presented here part of the mesh metadata is replicated between MPI processes reducing the scalability of adaptive mesh refinement (AMR) to between 200 and 600 processes. Dccrg is free software that anyone can use, study and modify and is available at https://gitorious.org/dccrg. Users are also kindly requested to cite this work when publishing results obtained with dccrg. Catalogue identifier: AEOM_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEOM_v1_0.html Program obtainable from: CPC Program Library, Queen’s University, Belfast, N. Ireland Licensing provisions: GNU Lesser General Public License version 3 No. of lines in distributed program, including test data, etc.: 54975 No. of bytes in distributed program, including test data, etc.: 974015 Distribution format: tar.gz Programming language: C++. Computer: PC, cluster, supercomputer. Operating system: POSIX. The code has been parallelized using MPI and tested with 1-32768 processes RAM: 10 MB-10 GB per process Classification: 4.12, 4.14, 6.5, 19.3, 19.10, 20. External routines: MPI-2 [1], boost [2], Zoltan [3], sfc++ [4] Nature of problem: Grid library supporting arbitrary data in grid cells, parallel adaptive mesh refinement, transparent remote neighbor data updates and load balancing. Solution method: The simulation grid is represented by an adjacency list (graph) with vertices stored into a hash table and edges into contiguous arrays. Message Passing Interface standard is used for parallelization. Cell data is given as a template parameter when instantiating the grid. Restrictions: Logically cartesian grid. Running time: Running time depends on the hardware, problem and the solution method. Small problems can be solved in under a minute and very large problems can take weeks. The examples and tests provided with the package take less than about one minute using default options. In the version of dccrg presented here the speed of adaptive mesh refinement is at most of the order of 106 total created cells per second. http://www.mpi-forum.org/. http://www.boost.org/. K. Devine, E. Boman, R. Heaphy, B. Hendrickson, C. Vaughan, Zoltan data management services for parallel dynamic applications, Comput. Sci. Eng. 4 (2002) 90-97. http://dx.doi.org/10.1109/5992.988653. https://gitorious.org/sfc++.
Tough2{_}MP: A parallel version of TOUGH2

DOE Office of Scientific and Technical Information (OSTI.GOV)

Zhang, Keni; Wu, Yu-Shu; Ding, Chris

2003-04-09

TOUGH2{_}MP is a massively parallel version of TOUGH2. It was developed for running on distributed-memory parallel computers to simulate large simulation problems that may not be solved by the standard, single-CPU TOUGH2 code. The new code implements an efficient massively parallel scheme, while preserving the full capacity and flexibility of the original TOUGH2 code. The new software uses the METIS software package for grid partitioning and AZTEC software package for linear-equation solving. The standard message-passing interface is adopted for communication among processors. Numerical performance of the current version code has been tested on CRAY-T3E and IBM RS/6000 SP platforms. Inmore » addition, the parallel code has been successfully applied to real field problems of multi-million-cell simulations for three-dimensional multiphase and multicomponent fluid and heat flow, as well as solute transport. In this paper, we will review the development of the TOUGH2{_}MP, and discuss the basic features, modules, and their applications.« less
Improvement and speed optimization of numerical tsunami modelling program using OpenMP technology

NASA Astrophysics Data System (ADS)

Chernov, A.; Zaytsev, A.; Yalciner, A.; Kurkin, A.

2009-04-01

Currently, the basic problem of tsunami modeling is low speed of calculations which is unacceptable for services of the operative notification. Existing algorithms of numerical modeling of hydrodynamic processes of tsunami waves are developed without taking the opportunities of modern computer facilities. There is an opportunity to have considerable acceleration of process of calculations by using parallel algorithms. We discuss here new approach to parallelization tsunami modeling code using OpenMP Technology (for multiprocessing systems with the general memory). Nowadays, multiprocessing systems are easily accessible for everyone. The cost of the use of such systems becomes much lower comparing to the costs of clusters. This opportunity also benefits all programmers to apply multithreading algorithms on desktop computers of researchers. Other important advantage of the given approach is the mechanism of the general memory - there is no necessity to send data on slow networks (for example Ethernet). All memory is the common for all computing processes; it causes almost linear scalability of the program and processes. In the new version of NAMI DANCE using OpenMP technology and multi-threading algorithm provide 80% gain in speed in comparison with the one-thread version for dual-processor unit. The speed increased and 320% gain was attained for four core processor unit of PCs. Thus, it was possible to reduce considerably time of performance of calculations on the scientific workstations (desktops) without complete change of the program and user interfaces. The further modernization of algorithms of preparation of initial data and processing of results using OpenMP looks reasonable. The final version of NAMI DANCE with the increased computational speed can be used not only for research purposes but also in real time Tsunami Warning Systems.
PCLIPS: Parallel CLIPS

NASA Technical Reports Server (NTRS)

Hall, Lawrence O.; Bennett, Bonnie H.; Tello, Ivan

1994-01-01

A parallel version of CLIPS 5.1 has been developed to run on Intel Hypercubes. The user interface is the same as that for CLIPS with some added commands to allow for parallel calls. A complete version of CLIPS runs on each node of the hypercube. The system has been instrumented to display the time spent in the match, recognize, and act cycles on each node. Only rule-level parallelism is supported. Parallel commands enable the assertion and retraction of facts to/from remote nodes working memory. Parallel CLIPS was used to implement a knowledge-based command, control, communications, and intelligence (C(sup 3)I) system to demonstrate the fusion of high-level, disparate sources. We discuss the nature of the information fusion problem, our approach, and implementation. Parallel CLIPS has also be used to run several benchmark parallel knowledge bases such as one to set up a cafeteria. Results show from running Parallel CLIPS with parallel knowledge base partitions indicate that significant speed increases, including superlinear in some cases, are possible.
Use Computer-Aided Tools to Parallelize Large CFD Applications

NASA Technical Reports Server (NTRS)

Jin, H.; Frumkin, M.; Yan, J.

2000-01-01

Porting applications to high performance parallel computers is always a challenging task. It is time consuming and costly. With rapid progressing in hardware architectures and increasing complexity of real applications in recent years, the problem becomes even more sever. Today, scalability and high performance are mostly involving handwritten parallel programs using message-passing libraries (e.g. MPI). However, this process is very difficult and often error-prone. The recent reemergence of shared memory parallel (SMP) architectures, such as the cache coherent Non-Uniform Memory Access (ccNUMA) architecture used in the SGI Origin 2000, show good prospects for scaling beyond hundreds of processors. Programming on an SMP is simplified by working in a globally accessible address space. The user can supply compiler directives, such as OpenMP, to parallelize the code. As an industry standard for portable implementation of parallel programs for SMPs, OpenMP is a set of compiler directives and callable runtime library routines that extend Fortran, C and C++ to express shared memory parallelism. It promises an incremental path for parallel conversion of existing software, as well as scalability and performance for a complete rewrite or an entirely new development. Perhaps the main disadvantage of programming with directives is that inserted directives may not necessarily enhance performance. In the worst cases, it can create erroneous results. While vendors have provided tools to perform error-checking and profiling, automation in directive insertion is very limited and often failed on large programs, primarily due to the lack of a thorough enough data dependence analysis. To overcome the deficiency, we have developed a toolkit, CAPO, to automatically insert OpenMP directives in Fortran programs and apply certain degrees of optimization. CAPO is aimed at taking advantage of detailed inter-procedural dependence analysis provided by CAPTools, developed by the University of Greenwich, to reduce potential errors made by users. Earlier tests on NAS Benchmarks and ARC3D have demonstrated good success of this tool. In this study, we have applied CAPO to parallelize three large applications in the area of computational fluid dynamics (CFD): OVERFLOW, TLNS3D and INS3D. These codes are widely used for solving Navier-Stokes equations with complicated boundary conditions and turbulence model in multiple zones. Each one comprises of from 50K to 1,00k lines of FORTRAN77. As an example, CAPO took 77 hours to complete the data dependence analysis of OVERFLOW on a workstation (SGI, 175MHz, R10K processor). A fair amount of effort was spent on correcting false dependencies due to lack of necessary knowledge during the analysis. Even so, CAPO provides an easy way for user to interact with the parallelization process. The OpenMP version was generated within a day after the analysis was completed. Due to sequential algorithms involved, code sections in TLNS3D and INS3D need to be restructured by hand to produce more efficient parallel codes. An included figure shows preliminary test results of the generated OVERFLOW with several test cases in single zone. The MPI data points for the small test case were taken from a handcoded MPI version. As we can see, CAPO's version has achieved 18 fold speed up on 32 nodes of the SGI O2K. For the small test case, it outperformed the MPI version. These results are very encouraging, but further work is needed. For example, although CAPO attempts to place directives on the outer- most parallel loops in an interprocedural framework, it does not insert directives based on the best manual strategy. In particular, it lacks the support of parallelization at the multi-zone level. Future work will emphasize on the development of methodology to work in a multi-zone level and with a hybrid approach. Development of tools to perform more complicated code transformation is also needed.
Connectionist Models: Proceedings of the Summer School Held in San Diego, California on 1990

DTIC Science & Technology

1990-01-01

modes: control network continues activation spreading based There is the sequential version and the parallel version on the actual inputs instead of...ent). 2. Execute all motoric actions based on activations of r a ent.The parallel version of the algorithm is local in time, units in A. Update the...a- movements that help o recognize an entering person.) tions like ’move focus left’, ’rotate focus’ are based on the activations of the C’s output
Cognitive and affective control in a flanker word task: common and dissociable brain mechanisms.

PubMed

Alguacil, Sonia; Tudela, Pío; Ruz, María

2013-08-01

In the present study we compared the nature of cognitive and affective conflict modulations at different stages of information processing using electroencephalographic recordings. Participants performed a flanker task in which they had to focus on a central word target and indicate its semantic category (cognitive version) or its valence (affective version). Targets were flanked by congruent or incongruent words in both versions. Although tasks were equivalent at the behavioral level, event-related potentials (ERPs) showed common and dissociable cognitive and emotional conflict modulations. At early stages of information processing, both tasks generated parallel sequential conflict effects in the P1 and N170 potentials. Later, the N2 and the first part of the P3 wave were exclusively modulated by cognitive conflict, whereas the last section of the P3 deflection/Late Positive Component (LPC) was only involved in affective current conflict processing. Therefore, the whole data set suggests the existence of early common mechanisms that are equivalent for cognitive and affective materials and later task-specific conflict processing. Copyright © 2013 Elsevier Ltd. All rights reserved.
Partial Overhaul and Initial Parallel Optimization of KINETICS, a Coupled Dynamics and Chemistry Atmosphere Model

NASA Technical Reports Server (NTRS)

Nguyen, Howard; Willacy, Karen; Allen, Mark

2012-01-01

KINETICS is a coupled dynamics and chemistry atmosphere model that is data intensive and computationally demanding. The potential performance gain from using a supercomputer motivates the adaptation from a serial version to a parallelized one. Although the initial parallelization had been done, bottlenecks caused by an abundance of communication calls between processors led to an unfavorable drop in performance. Before starting on the parallel optimization process, a partial overhaul was required because a large emphasis was placed on streamlining the code for user convenience and revising the program to accommodate the new supercomputers at Caltech and JPL. After the first round of optimizations, the partial runtime was reduced by a factor of 23; however, performance gains are dependent on the size of the data, the number of processors requested, and the computer used.
Software Issues in High-Performance Computing and a Framework for the Development of HPC Applications

DTIC Science & Technology

1995-01-01

possible to determine communication points. For this version, a C program spawning Posix threads and using semaphores to synchronize would have to...performance such as the time required for network communication and synchronization as well as issues of asynchrony and memory hierarchy. For example...enhances reusability. Process (or task) parallel computations can also be succinctly expressed with a small set of process creation and synchronization
Acceleration of Semiempirical QM/MM Methods through Message Passage Interface (MPI), Hybrid MPI/Open Multiprocessing, and Self-Consistent Field Accelerator Implementations.

PubMed

Ojeda-May, Pedro; Nam, Kwangho

2017-08-08

The strategy and implementation of scalable and efficient semiempirical (SE) QM/MM methods in CHARMM are described. The serial version of the code was first profiled to identify routines that required parallelization. Afterward, the code was parallelized and accelerated with three approaches. The first approach was the parallelization of the entire QM/MM routines, including the Fock matrix diagonalization routines, using the CHARMM message passage interface (MPI) machinery. In the second approach, two different self-consistent field (SCF) energy convergence accelerators were implemented using density and Fock matrices as targets for their extrapolations in the SCF procedure. In the third approach, the entire QM/MM and MM energy routines were accelerated by implementing the hybrid MPI/open multiprocessing (OpenMP) model in which both the task- and loop-level parallelization strategies were adopted to balance loads between different OpenMP threads. The present implementation was tested on two solvated enzyme systems (including <100 QM atoms) and an S N 2 symmetric reaction in water. The MPI version exceeded existing SE QM methods in CHARMM, which include the SCC-DFTB and SQUANTUM methods, by at least 4-fold. The use of SCF convergence accelerators further accelerated the code by ∼12-35% depending on the size of the QM region and the number of CPU cores used. Although the MPI version displayed good scalability, the performance was diminished for large numbers of MPI processes due to the overhead associated with MPI communications between nodes. This issue was partially overcome by the hybrid MPI/OpenMP approach which displayed a better scalability for a larger number of CPU cores (up to 64 CPUs in the tested systems).
DInSAR time series generation within a cloud computing environment: from ERS to Sentinel-1 scenario

NASA Astrophysics Data System (ADS)

Casu, Francesco; Elefante, Stefano; Imperatore, Pasquale; Lanari, Riccardo; Manunta, Michele; Zinno, Ivana; Mathot, Emmanuel; Brito, Fabrice; Farres, Jordi; Lengert, Wolfgang

2013-04-01

One of the techniques that will strongly benefit from the advent of the Sentinel-1 system is Differential SAR Interferometry (DInSAR), which has successfully demonstrated to be an effective tool to detect and monitor ground displacements with centimetre accuracy. The geoscience communities (volcanology, seismicity, …), as well as those related to hazard monitoring and risk mitigation, make extensively use of the DInSAR technique and they will take advantage from the huge amount of SAR data acquired by Sentinel-1. Indeed, such an information will successfully permit the generation of Earth's surface displacement maps and time series both over large areas and long time span. However, the issue of managing, processing and analysing the large Sentinel data stream is envisaged by the scientific community to be a major bottleneck, particularly during crisis phases. The emerging need of creating a common ecosystem in which data, results and processing tools are shared, is envisaged to be a successful way to address such a problem and to contribute to the information and knowledge spreading. The Supersites initiative as well as the ESA SuperSites Exploitation Platform (SSEP) and the ESA Cloud Computing Operational Pilot (CIOP) projects provide effective answers to this need and they are pushing towards the development of such an ecosystem. It is clear that all the current and existent tools for querying, processing and analysing SAR data are required to be not only updated for managing the large data stream of Sentinel-1 satellite, but also reorganized for quickly replying to the simultaneous and highly demanding user requests, mainly during emergency situations. This translates into the automatic and unsupervised processing of large amount of data as well as the availability of scalable, widely accessible and high performance computing capabilities. The cloud computing environment permits to achieve all of these objectives, particularly in case of spike and peak requests of processing resources linked to disaster events. This work aims at presenting a parallel computational model for the widely used DInSAR algorithm named as Small BAseline Subset (SBAS), which has been implemented within the cloud computing environment provided by the ESA-CIOP platform. This activity has resulted in developing a scalable, unsupervised, portable, and widely accessible (through a web portal) parallel DInSAR computational tool. The activity has rewritten and developed the SBAS application algorithm within a parallel system environment, i.e., in a form that allows us to benefit from multiple processing units. This requires the devising a parallel version of the SBAS algorithm and its subsequent implementation, implying additional complexity in algorithm designing and an efficient multi processor programming, with the final aim of a parallel performance optimization. Although the presented algorithm has been designed to work with Sentinel-1 data, it can also process other satellite SAR data (ERS, ENVISAT, CSK, TSX, ALOS). Indeed, the performance analysis of the implemented SBAS parallel version has been tested on the full ASAR archive (64 acquisitions) acquired over the Napoli Bay, a volcanic and densely urbanized area in Southern Italy. The full processing - from the raw data download to the generation of DInSAR time series - has been carried out by engaging 4 nodes, each one with 2 cores and 16 GB of RAM, and has taken about 36 hours, with respect to about 135 hours of the sequential version. Extensive analysis on other test areas significant from DInSAR and geophysical viewpoint will be presented. Finally, preliminary performance evaluation of the presented approach within the Sentinel-1 scenario will be provided.
Parallelization of a Fully-Distributed Hydrologic Model using Sub-basin Partitioning

NASA Astrophysics Data System (ADS)

Vivoni, E. R.; Mniszewski, S.; Fasel, P.; Springer, E.; Ivanov, V. Y.; Bras, R. L.

2005-12-01

A primary obstacle towards advances in watershed simulations has been the limited computational capacity available to most models. The growing trend of model complexity, data availability and physical representation has not been matched by adequate developments in computational efficiency. This situation has created a serious bottleneck which limits existing distributed hydrologic models to small domains and short simulations. In this study, we present novel developments in the parallelization of a fully-distributed hydrologic model. Our work is based on the TIN-based Real-time Integrated Basin Simulator (tRIBS), which provides continuous hydrologic simulation using a multiple resolution representation of complex terrain based on a triangulated irregular network (TIN). While the use of TINs reduces computational demand, the sequential version of the model is currently limited over large basins (>10,000 km2) and long simulation periods (>1 year). To address this, a parallel MPI-based version of the tRIBS model has been implemented and tested using high performance computing resources at Los Alamos National Laboratory. Our approach utilizes domain decomposition based on sub-basin partitioning of the watershed. A stream reach graph based on the channel network structure is used to guide the sub-basin partitioning. Individual sub-basins or sub-graphs of sub-basins are assigned to separate processors to carry out internal hydrologic computations (e.g. rainfall-runoff transformation). Routed streamflow from each sub-basin forms the major hydrologic data exchange along the stream reach graph. Individual sub-basins also share subsurface hydrologic fluxes across adjacent boundaries. We demonstrate how the sub-basin partitioning provides computational feasibility and efficiency for a set of test watersheds in northeastern Oklahoma. We compare the performance of the sequential and parallelized versions to highlight the efficiency gained as the number of processors increases. We also discuss how the coupled use of TINs and parallel processing can lead to feasible long-term simulations in regional watersheds while preserving basin properties at high-resolution.
A distributed Clips implementation: dClips

NASA Technical Reports Server (NTRS)

Li, Y. Philip

1993-01-01

A distributed version of the Clips language, dClips, was implemented on top of two existing generic distributed messaging systems to show that: (1) it is easy to create a coarse-grained parallel programming environment out of an existing language if a high level messaging system is used; and (2) the computing model of a parallel programming environment can be changed easily if we change the underlying messaging system. dClips processes were first connected with a simple master-slave model. A client-server model with intercommunicating agents was later implemented. The concept of service broker is being investigated.
PEGASUS 5: An Automated Pre-Processor for Overset-Grid CFD

NASA Technical Reports Server (NTRS)

Suhs, Norman E.; Rogers, Stuart E.; Dietz, William E.; Kwak, Dochan (Technical Monitor)

2002-01-01

An all new, automated version of the PEGASUS software has been developed and tested. PEGASUS provides the hole-cutting and connectivity information between overlapping grids, and is used as the final part of the grid generation process for overset-grid computational fluid dynamics approaches. The new PEGASUS code (Version 5) has many new features: automated hole cutting; a projection scheme for fixing gaps in overset surfaces; more efficient interpolation search methods using an alternating digital tree; hole-size optimization based on adding additional layers of fringe points; and an automatic restart capability. The new code has also been parallelized using the Message Passing Interface standard. The parallelization performance provides efficient speed-up of the execution time by an order of magnitude, and up to a factor of 30 for very large problems. The results of three example cases are presented: a three-element high-lift airfoil, a generic business jet configuration, and a complete Boeing 777-200 aircraft in a high-lift landing configuration. Comparisons of the computed flow fields for the airfoil and 777 test cases between the old and new versions of the PEGASUS codes show excellent agreement with each other and with experimental results.
TOUGH3: A new efficient version of the TOUGH suite of multiphase flow and transport simulators

NASA Astrophysics Data System (ADS)

Jung, Yoojin; Pau, George Shu Heng; Finsterle, Stefan; Pollyea, Ryan M.

2017-11-01

The TOUGH suite of nonisothermal multiphase flow and transport simulators has been updated by various developers over many years to address a vast range of challenging subsurface problems. The increasing complexity of the simulated processes as well as the growing size of model domains that need to be handled call for an improvement in the simulator's computational robustness and efficiency. Moreover, modifications have been frequently introduced independently, resulting in multiple versions of TOUGH that (1) led to inconsistencies in feature implementation and usage, (2) made code maintenance and development inefficient, and (3) caused confusion to users and developers. TOUGH3-a new base version of TOUGH-addresses these issues. It consolidates both the serial (TOUGH2 V2.1) and parallel (TOUGH2-MP V2.0) implementations, enabling simulations to be performed on desktop computers and supercomputers using a single code. New PETSc parallel linear solvers are added to the existing serial solvers of TOUGH2 and the Aztec solver used in TOUGH2-MP. The PETSc solvers generally perform better than the Aztec solvers in parallel and the internal TOUGH3 linear solver in serial. TOUGH3 also incorporates many new features, addresses bugs, and improves the flexibility of data handling. Due to the improved capabilities and usability, TOUGH3 is more robust and efficient for solving tough and computationally demanding problems in diverse scientific and practical applications related to subsurface flow modeling.
A Family of ACO Routing Protocols for Mobile Ad Hoc Networks.

PubMed

Rupérez Cañas, Delfín; Sandoval Orozco, Ana Lucila; García Villalba, Luis Javier; Kim, Tai-Hoon

2017-05-22

In this work, an ACO routing protocol for mobile ad hoc networks based on AntHocNet is specified. As its predecessor, this new protocol, called AntOR, is hybrid in the sense that it contains elements from both reactive and proactive routing. Specifically, it combines a reactive route setup process with a proactive route maintenance and improvement process. Key aspects of the AntOR protocol are the disjoint-link and disjoint-node routes, separation between the regular pheromone and the virtual pheromone in the diffusion process and the exploration of routes, taking into consideration the number of hops in the best routes. In this work, a family of ACO routing protocols based on AntOR is also specified. These protocols are based on protocol successive refinements. In this work, we also present a parallelized version of AntOR that we call PAntOR. Using programming multiprocessor architectures based on the shared memory protocol, PAntOR allows running tasks in parallel using threads. This parallelization is applicable in the route setup phase, route local repair process and link failure notification. In addition, a variant of PAntOR that consists of having more than one interface, which we call PAntOR-MI (PAntOR-Multiple Interface), is specified. This approach parallelizes the sending of broadcast messages by interface through threads.
Pluto Haze

NASA Image and Video Library

2015-09-10

Two different versions of an image of Pluto's haze layers, taken by New Horizons as it looked back at Pluto's dark side nearly 16 hours after close approach, from a distance of 480,000 miles (770,000 kilometers), at a phase angle of 166 degrees. Pluto's north is at the top, and the sun illuminates Pluto from the upper right. These images are much higher quality than the digitally compressed images of Pluto's haze downlinked and released shortly after the July 14 encounter, and allow many new details to be seen. The left version has had only minor processing, while the right version has been specially processed to reveal a large number of discrete haze layers in the atmosphere. In the left version, faint surface details on the narrow sunlit crescent are seen through the haze in the upper right of Pluto's disk, and subtle parallel streaks in the haze may be crepuscular rays- shadows cast on the haze by topography such as mountain ranges on Pluto, similar to the rays sometimes seen in the sky after the sun sets behind mountains on Earth. http://photojournal.jpl.nasa.gov/catalog/PIA19880
Algorithms and programming tools for image processing on the MPP, part 2

NASA Technical Reports Server (NTRS)

Reeves, Anthony P.

1986-01-01

A number of algorithms were developed for image warping and pyramid image filtering. Techniques were investigated for the parallel processing of a large number of independent irregular shaped regions on the MPP. In addition some utilities for dealing with very long vectors and for sorting were developed. Documentation pages for the algorithms which are available for distribution are given. The performance of the MPP for a number of basic data manipulations was determined. From these results it is possible to predict the efficiency of the MPP for a number of algorithms and applications. The Parallel Pascal development system, which is a portable programming environment for the MPP, was improved and better documentation including a tutorial was written. This environment allows programs for the MPP to be developed on any conventional computer system; it consists of a set of system programs and a library of general purpose Parallel Pascal functions. The algorithms were tested on the MPP and a presentation on the development system was made to the MPP users group. The UNIX version of the Parallel Pascal System was distributed to a number of new sites.
CRUNCH_PARALLEL

DOE Office of Scientific and Technical Information (OSTI.GOV)

Shumaker, Dana E.; Steefel, Carl I.

The code CRUNCH_PARALLEL is a parallel version of the CRUNCH code. CRUNCH code version 2.0 was previously released by LLNL, (UCRL-CODE-200063). Crunch is a general purpose reactive transport code developed by Carl Steefel and Yabusake (Steefel Yabsaki 1996). The code handles non-isothermal transport and reaction in one, two, and three dimensions. The reaction algorithm is generic in form, handling an arbitrary number of aqueous and surface complexation as well as mineral dissolution/precipitation. A standardized database is used containing thermodynamic and kinetic data. The code includes advective, dispersive, and diffusive transport.
Hyperswitch communication network

NASA Technical Reports Server (NTRS)

Peterson, J.; Pniel, M.; Upchurch, E.

1991-01-01

The Hyperswitch Communication Network (HCN) is a large scale parallel computer prototype being developed at JPL. Commercial versions of the HCN computer are planned. The HCN computer being designed is a message passing multiple instruction multiple data (MIMD) computer, and offers many advantages in price-performance ratio, reliability and availability, and manufacturing over traditional uniprocessors and bus based multiprocessors. The design of the HCN operating system is a uniquely flexible environment that combines both parallel processing and distributed processing. This programming paradigm can achieve a balance among the following competing factors: performance in processing and communications, user friendliness, and fault tolerance. The prototype is being designed to accommodate a maximum of 64 state of the art microprocessors. The HCN is classified as a distributed supercomputer. The HCN system is described, and the performance/cost analysis and other competing factors within the system design are reviewed.

Distributed File System Utilities to Manage Large DatasetsVersion 0.5

DOE Office of Scientific and Technical Information (OSTI.GOV)

2014-05-21

FileUtils provides a suite of tools to manage large datasets typically created by large parallel MPI applications. They are written in C and use standard POSIX I/Ocalls. The current suite consists of tools to copy, compare, remove, and list. The tools provide dramatic speedup over existing Linux tools, which often run as a single process.
Automated Generation of Message-Passing Programs: An Evaluation Using CAPTools

NASA Technical Reports Server (NTRS)

Hribar, Michelle R.; Jin, Haoqiang; Yan, Jerry C.; Saini, Subhash (Technical Monitor)

1998-01-01

Scientists at NASA Ames Research Center have been developing computational aeroscience applications on highly parallel architectures over the past ten years. During that same time period, a steady transition of hardware and system software also occurred, forcing us to expend great efforts into migrating and re-coding our applications. As applications and machine architectures become increasingly complex, the cost and time required for this process will become prohibitive. In this paper, we present the first set of results in our evaluation of interactive parallelization tools. In particular, we evaluate CAPTool's ability to parallelize computational aeroscience applications. CAPTools was tested on serial versions of the NAS Parallel Benchmarks and ARC3D, a computational fluid dynamics application, on two platforms: the SGI Origin 2000 and the Cray T3E. This evaluation includes performance, amount of user interaction required, limitations and portability. Based on these results, a discussion on the feasibility of computer aided parallelization of aerospace applications is presented along with suggestions for future work.
Parallel transformation of K-SVD solar image denoising algorithm

NASA Astrophysics Data System (ADS)

Liang, Youwen; Tian, Yu; Li, Mei

2017-02-01

The images obtained by observing the sun through a large telescope always suffered with noise due to the low SNR. K-SVD denoising algorithm can effectively remove Gauss white noise. Training dictionaries for sparse representations is a time consuming task, due to the large size of the data involved and to the complexity of the training algorithms. In this paper, an OpenMP parallel programming language is proposed to transform the serial algorithm to the parallel version. Data parallelism model is used to transform the algorithm. Not one atom but multiple atoms updated simultaneously is the biggest change. The denoising effect and acceleration performance are tested after completion of the parallel algorithm. Speedup of the program is 13.563 in condition of using 16 cores. This parallel version can fully utilize the multi-core CPU hardware resources, greatly reduce running time and easily to transplant in multi-core platform.
Distributed and parallel Ada and the Ada 9X recommendations

NASA Technical Reports Server (NTRS)

Volz, Richard A.; Goldsack, Stephen J.; Theriault, R.; Waldrop, Raymond S.; Holzbacher-Valero, A. A.

1992-01-01

Recently, the DoD has sponsored work towards a new version of Ada, intended to support the construction of distributed systems. The revised version, often called Ada 9X, will become the new standard sometimes in the 1990s. It is intended that Ada 9X should provide language features giving limited support for distributed system construction. The requirements for such features are given. Many of the most advanced computer applications involve embedded systems that are comprised of parallel processors or networks of distributed computers. If Ada is to become the widely adopted language envisioned by many, it is essential that suitable compilers and tools be available to facilitate the creation of distributed and parallel Ada programs for these applications. The major languages issues impacting distributed and parallel programming are reviewed, and some principles upon which distributed/parallel language systems should be built are suggested. Based upon these, alternative language concepts for distributed/parallel programming are analyzed.
GNAQPMS v1.1: accelerating the Global Nested Air Quality Prediction Modeling System (GNAQPMS) on Intel Xeon Phi processors

NASA Astrophysics Data System (ADS)

Wang, Hui; Chen, Huansheng; Wu, Qizhong; Lin, Junmin; Chen, Xueshun; Xie, Xinwei; Wang, Rongrong; Tang, Xiao; Wang, Zifa

2017-08-01

The Global Nested Air Quality Prediction Modeling System (GNAQPMS) is the global version of the Nested Air Quality Prediction Modeling System (NAQPMS), which is a multi-scale chemical transport model used for air quality forecast and atmospheric environmental research. In this study, we present the porting and optimisation of GNAQPMS on a second-generation Intel Xeon Phi processor, codenamed Knights Landing (KNL). Compared with the first-generation Xeon Phi coprocessor (codenamed Knights Corner, KNC), KNL has many new hardware features such as a bootable processor, high-performance in-package memory and ISA compatibility with Intel Xeon processors. In particular, we describe the five optimisations we applied to the key modules of GNAQPMS, including the CBM-Z gas-phase chemistry, advection, convection and wet deposition modules. These optimisations work well on both the KNL 7250 processor and the Intel Xeon E5-2697 V4 processor. They include (1) updating the pure Message Passing Interface (MPI) parallel mode to the hybrid parallel mode with MPI and OpenMP in the emission, advection, convection and gas-phase chemistry modules; (2) fully employing the 512 bit wide vector processing units (VPUs) on the KNL platform; (3) reducing unnecessary memory access to improve cache efficiency; (4) reducing the thread local storage (TLS) in the CBM-Z gas-phase chemistry module to improve its OpenMP performance; and (5) changing the global communication from writing/reading interface files to MPI functions to improve the performance and the parallel scalability. These optimisations greatly improved the GNAQPMS performance. The same optimisations also work well for the Intel Xeon Broadwell processor, specifically E5-2697 v4. Compared with the baseline version of GNAQPMS, the optimised version was 3.51 × faster on KNL and 2.77 × faster on the CPU. Moreover, the optimised version ran at 26 % lower average power on KNL than on the CPU. With the combined performance and energy improvement, the KNL platform was 37.5 % more efficient on power consumption compared with the CPU platform. The optimisations also enabled much further parallel scalability on both the CPU cluster and the KNL cluster scaled to 40 CPU nodes and 30 KNL nodes, with a parallel efficiency of 70.4 and 42.2 %, respectively.
The GBS code for tokamak scrape-off layer simulations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Halpern, F.D., E-mail: federico.halpern@epfl.ch; Ricci, P.; Jolliet, S.

2016-06-15

We describe a new version of GBS, a 3D global, flux-driven plasma turbulence code to simulate the turbulent dynamics in the tokamak scrape-off layer (SOL), superseding the code presented by Ricci et al. (2012) [14]. The present work is driven by the objective of studying SOL turbulent dynamics in medium size tokamaks and beyond with a high-fidelity physics model. We emphasize an intertwining framework of improved physics models and the computational improvements that allow them. The model extensions include neutral atom physics, finite ion temperature, the addition of a closed field line region, and a non-Boussinesq treatment of the polarizationmore » drift. GBS has been completely refactored with the introduction of a 3-D Cartesian communicator and a scalable parallel multigrid solver. We report dramatically enhanced parallel scalability, with the possibility of treating electromagnetic fluctuations very efficiently. The method of manufactured solutions as a verification process has been carried out for this new code version, demonstrating the correct implementation of the physical model.« less
A Stream Tilling Approach to Surface Area Estimation for Large Scale Spatial Data in a Shared Memory System

NASA Astrophysics Data System (ADS)

Liu, Jiping; Kang, Xiaochen; Dong, Chun; Xu, Shenghua

2017-12-01

Surface area estimation is a widely used tool for resource evaluation in the physical world. When processing large scale spatial data, the input/output (I/O) can easily become the bottleneck in parallelizing the algorithm due to the limited physical memory resources and the very slow disk transfer rate. In this paper, we proposed a stream tilling approach to surface area estimation that first decomposed a spatial data set into tiles with topological expansions. With these tiles, the one-to-one mapping relationship between the input and the computing process was broken. Then, we realized a streaming framework towards the scheduling of the I/O processes and computing units. Herein, each computing unit encapsulated a same copy of the estimation algorithm, and multiple asynchronous computing units could work individually in parallel. Finally, the performed experiment demonstrated that our stream tilling estimation can efficiently alleviate the heavy pressures from the I/O-bound work, and the measured speedup after being optimized have greatly outperformed the directly parallel versions in shared memory systems with multi-core processors.
Graphics Processing Unit (GPU) implementation of image processing algorithms to improve system performance of the Control, Acquisition, Processing, and Image Display System (CAPIDS) of the Micro-Angiographic Fluoroscope (MAF).

PubMed

Vasan, S N Swetadri; Ionita, Ciprian N; Titus, A H; Cartwright, A N; Bednarek, D R; Rudin, S

2012-02-23

We present the image processing upgrades implemented on a Graphics Processing Unit (GPU) in the Control, Acquisition, Processing, and Image Display System (CAPIDS) for the custom Micro-Angiographic Fluoroscope (MAF) detector. Most of the image processing currently implemented in the CAPIDS system is pixel independent; that is, the operation on each pixel is the same and the operation on one does not depend upon the result from the operation on the other, allowing the entire image to be processed in parallel. GPU hardware was developed for this kind of massive parallel processing implementation. Thus for an algorithm which has a high amount of parallelism, a GPU implementation is much faster than a CPU implementation. The image processing algorithm upgrades implemented on the CAPIDS system include flat field correction, temporal filtering, image subtraction, roadmap mask generation and display window and leveling. A comparison between the previous and the upgraded version of CAPIDS has been presented, to demonstrate how the improvement is achieved. By performing the image processing on a GPU, significant improvements (with respect to timing or frame rate) have been achieved, including stable operation of the system at 30 fps during a fluoroscopy run, a DSA run, a roadmap procedure and automatic image windowing and leveling during each frame.
An efficient parallel-processing method for transposing large matrices in place.

PubMed

Portnoff, M R

1999-01-01

We have developed an efficient algorithm for transposing large matrices in place. The algorithm is efficient because data are accessed either sequentially in blocks or randomly within blocks small enough to fit in cache, and because the same indexing calculations are shared among identical procedures operating on independent subsets of the data. This inherent parallelism makes the method well suited for a multiprocessor computing environment. The algorithm is easy to implement because the same two procedures are applied to the data in various groupings to carry out the complete transpose operation. Using only a single processor, we have demonstrated nearly an order of magnitude increase in speed over the previously published algorithm by Gate and Twigg for transposing a large rectangular matrix in place. With multiple processors operating in parallel, the processing speed increases almost linearly with the number of processors. A simplified version of the algorithm for square matrices is presented as well as an extension for matrices large enough to require virtual memory.
Vector processing efficiency of plasma MHD codes by use of the FACOM 230-75 APU

NASA Astrophysics Data System (ADS)

Matsuura, T.; Tanaka, Y.; Naraoka, K.; Takizuka, T.; Tsunematsu, T.; Tokuda, S.; Azumi, M.; Kurita, G.; Takeda, T.

1982-06-01

In the framework of pipelined vector architecture, the efficiency of vector processing is assessed with respect to plasma MHD codes in nuclear fusion research. By using a vector processor, the FACOM 230-75 APU, the limit of the enhancement factor due to parallelism of current vector machines is examined for three numerical codes based on a fluid model. Reasonable speed-up factors of approximately 6,6 and 4 times faster than the highly optimized scalar version are obtained for ERATO (linear stability code), AEOLUS-R1 (nonlinear stability code) and APOLLO (1-1/2D transport code), respectively. Problems of the pipelined vector processors are discussed from the viewpoint of restructuring, optimization and choice of algorithms. In conclusion, the important concept of "concurrency within pipelined parallelism" is emphasized.
2nd-Order CESE Results For C1.4: Vortex Transport by Uniform Flow

NASA Technical Reports Server (NTRS)

Friedlander, David J.

2015-01-01

The Conservation Element and Solution Element (CESE) method was used as implemented in the NASA research code ez4d. The CESE method is a time accurate formulation with flux-conservation in both space and time. The method treats the discretized derivatives of space and time identically and while the 2nd-order accurate version was used, high-order versions exist, the 2nd-order accurate version was used. In regards to the ez4d code, it is an unstructured Navier-Stokes solver coded in C++ with serial and parallel versions available. As part of its architecture, ez4d has the capability to utilize multi-thread and Messaging Passage Interface (MPI) for parallel runs.
Flood predictions using the parallel version of distributed numerical physical rainfall-runoff model TOPKAPI

NASA Astrophysics Data System (ADS)

Boyko, Oleksiy; Zheleznyak, Mark

2015-04-01

The original numerical code TOPKAPI-IMMS of the distributed rainfall-runoff model TOPKAPI ( Todini et al, 1996-2014) is developed and implemented in Ukraine. The parallel version of the code has been developed recently to be used on multiprocessors systems - multicore/processors PC and clusters. Algorithm is based on binary-tree decomposition of the watershed for the balancing of the amount of computation for all processors/cores. Message passing interface (MPI) protocol is used as a parallel computing framework. The numerical efficiency of the parallelization algorithms is demonstrated for the case studies for the flood predictions of the mountain watersheds of the Ukrainian Carpathian regions. The modeling results is compared with the predictions based on the lumped parameters models.
Parallel Ada benchmarks for the SVMS

NASA Technical Reports Server (NTRS)

Collard, Philippe E.

1990-01-01

The use of parallel processing paradigm to design and develop faster and more reliable computers appear to clearly mark the future of information processing. NASA started the development of such an architecture: the Spaceborne VHSIC Multi-processor System (SVMS). Ada will be one of the languages used to program the SVMS. One of the unique characteristics of Ada is that it supports parallel processing at the language level through the tasking constructs. It is important for the SVMS project team to assess how efficiently the SVMS architecture will be implemented, as well as how efficiently Ada environment will be ported to the SVMS. AUTOCLASS II, a Bayesian classifier written in Common Lisp, was selected as one of the benchmarks for SVMS configurations. The purpose of the R and D effort was to provide the SVMS project team with the version of AUTOCLASS II, written in Ada, that would make use of Ada tasking constructs as much as possible so as to constitute a suitable benchmark. Additionally, a set of programs was developed that would measure Ada tasking efficiency on parallel architectures as well as determine the critical parameters influencing tasking efficiency. All this was designed to provide the SVMS project team with a set of suitable tools in the development of the SVMS architecture.
A Family of ACO Routing Protocols for Mobile Ad Hoc Networks

PubMed Central

Rupérez Cañas, Delfín; Sandoval Orozco, Ana Lucila; García Villalba, Luis Javier; Kim, Tai-hoon

2017-01-01

In this work, an ACO routing protocol for mobile ad hoc networks based on AntHocNet is specified. As its predecessor, this new protocol, called AntOR, is hybrid in the sense that it contains elements from both reactive and proactive routing. Specifically, it combines a reactive route setup process with a proactive route maintenance and improvement process. Key aspects of the AntOR protocol are the disjoint-link and disjoint-node routes, separation between the regular pheromone and the virtual pheromone in the diffusion process and the exploration of routes, taking into consideration the number of hops in the best routes. In this work, a family of ACO routing protocols based on AntOR is also specified. These protocols are based on protocol successive refinements. In this work, we also present a parallelized version of AntOR that we call PAntOR. Using programming multiprocessor architectures based on the shared memory protocol, PAntOR allows running tasks in parallel using threads. This parallelization is applicable in the route setup phase, route local repair process and link failure notification. In addition, a variant of PAntOR that consists of having more than one interface, which we call PAntOR-MI (PAntOR-Multiple Interface), is specified. This approach parallelizes the sending of broadcast messages by interface through threads. PMID:28531159
Telephone word-list recall tested in the rural aging and memory study: two parallel versions for the TICS-M.

PubMed

Hogervorst, Eva; Bandelow, Stephan; Hart, John; Henderson, Victor W

2004-09-01

Parallel versions of memory tasks are useful in clinical and research settings to reduce practice effects engendered by multiple administrations. We aimed to investigate the usefulness of three parallel versions of ten-item word list recall tasks administered by telephone. A population based telephone survey of middle-aged and elderly residents of Bradley County, Arkansas was carried out as part of the Rural Aging and Memory Study (RAMS). Participants in the study were 1845 persons aged 40 to 95 years. Word lists included that used in the telephone interview of cognitive status (TICS) as a criterion standard and two newly developed lists. The mean age of participants was 61.05 (SD 12.44) years; 39.5% were over age 65. 78% of the participants had completed high school, 66% were women and 21% were African-American. There was no difference in demographic characteristics between groups receiving different word list versions, and performances on the three versions were equivalent for both immediate (mean 4.22, SD 1.53) and delayed (mean 2.35 SD 1.75) recall trials. The total memory score (immediate+delayed recall) was negatively associated with older age (beta = -0.41, 95%CI=-0.11 to -0.04), lower education (beta = 0.24, 95%CI = 0.36 to 0.51), male gender (beta = -0.18, 95%CI = -1.39 to -0.90) and African-American race (beta = -0.15, 95%CI = -1.41 to -0.82). The two RAMS word recall lists and the TICS word recall list can be used interchangeably in telephone assessment of memory of middle-aged and elderly persons. This finding is important for future studies where parallel versions of a word-list memory task are needed. (250 words).
Parallel community climate model: Description and user`s guide

DOE Office of Scientific and Technical Information (OSTI.GOV)

Drake, J.B.; Flanery, R.E.; Semeraro, B.D.

This report gives an overview of a parallel version of the NCAR Community Climate Model, CCM2, implemented for MIMD massively parallel computers using a message-passing programming paradigm. The parallel implementation was developed on an Intel iPSC/860 with 128 processors and on the Intel Delta with 512 processors, and the initial target platform for the production version of the code is the Intel Paragon with 2048 processors. Because the implementation uses a standard, portable message-passing libraries, the code has been easily ported to other multiprocessors supporting a message-passing programming paradigm. The parallelization strategy used is to decompose the problem domain intomore » geographical patches and assign each processor the computation associated with a distinct subset of the patches. With this decomposition, the physics calculations involve only grid points and data local to a processor and are performed in parallel. Using parallel algorithms developed for the semi-Lagrangian transport, the fast Fourier transform and the Legendre transform, both physics and dynamics are computed in parallel with minimal data movement and modest change to the original CCM2 source code. Sequential or parallel history tapes are written and input files (in history tape format) are read sequentially by the parallel code to promote compatibility with production use of the model on other computer systems. A validation exercise has been performed with the parallel code and is detailed along with some performance numbers on the Intel Paragon and the IBM SP2. A discussion of reproducibility of results is included. A user`s guide for the PCCM2 version 2.1 on the various parallel machines completes the report. Procedures for compilation, setup and execution are given. A discussion of code internals is included for those who may wish to modify and use the program in their own research.« less
Crashworthiness simulations with DYNA3D

DOE Office of Scientific and Technical Information (OSTI.GOV)

Schauer, D.A.; Hoover, C.G.; Kay, G.J.

1996-04-01

Current progress in parallel algorithm research and applications in vehicle crash simulation is described for the explicit, finite element algorithms in DYNA3D. Problem partitioning methods and parallel algorithms for contact at material interfaces are the two challenging algorithm research problems that are addressed. Two prototype parallel contact algorithms have been developed for treating the cases of local and arbitrary contact. Demonstration problems for local contact are crashworthiness simulations with 222 locally defined contact surfaces and a vehicle/barrier collision modeled with arbitrary contact. A simulation of crash tests conducted for a vehicle impacting a U-channel small sign post embedded in soilmore » has been run on both the serial and parallel versions of DYNA3D. A significant reduction in computational time has been observed when running these problems on the parallel version. However, to achieve maximum efficiency, complex problems must be appropriately partitioned, especially when contact dominates the computation.« less
The Snow Data System at NASA JPL

NASA Astrophysics Data System (ADS)

Laidlaw, R.; Painter, T. H.; Mattmann, C. A.; Ramirez, P.; Brodzik, M. J.; Rittger, K.; Bormann, K. J.; Burgess, A. B.; Zimdars, P.; McGibbney, L. J.; Goodale, C. E.; Joyce, M.

2015-12-01

The Snow Data System at NASA JPL includes a data processing pipeline built with open source software, Apache 'Object Oriented Data Technology' (OODT). It produces a variety of data products using inputs from satellites such as MODIS, VIIRS and Landsat. Processing is carried out in parallel across a high-powered computing cluster. Algorithms such as 'Snow Covered Area and Grain-size' (SCAG) and 'Dust Radiative Forcing in Snow' (DRFS) are applied to satellite inputs to produce output images that are used by many scientists and institutions around the world. This poster will describe the Snow Data System, its outputs and their uses and applications, along with recent advancements to the system and plans for the future. Advancements for 2015 include automated daily processing of historic MODIS data for SCAG (MODSCAG) and DRFS (MODDRFS), automation of SCAG processing for VIIRS satellite inputs (VIIRSCAG) and an updated version of SCAG for Landsat Thematic Mapper inputs (TMSCAG) that takes advantage of Graphics Processing Units (GPUs) for faster processing speeds. The pipeline has been upgraded to use the latest version of OODT and its workflows have been streamlined to enable computer operators to process data on demand. Additional products have been added, such as rolling 8-day composites of MODSCAG data, a new version of the MODSCAG 'annual minimum ice and snow extent' (MODICE) product, and recoded MODSCAG data for the 'Satellite Snow Product Intercomparison and Evaluation Experiment' (SnowPEx) project.
A compositional reservoir simulator on distributed memory parallel computers

DOE Office of Scientific and Technical Information (OSTI.GOV)

Rame, M.; Delshad, M.

1995-12-31

This paper presents the application of distributed memory parallel computes to field scale reservoir simulations using a parallel version of UTCHEM, The University of Texas Chemical Flooding Simulator. The model is a general purpose highly vectorized chemical compositional simulator that can simulate a wide range of displacement processes at both field and laboratory scales. The original simulator was modified to run on both distributed memory parallel machines (Intel iPSC/960 and Delta, Connection Machine 5, Kendall Square 1 and 2, and CRAY T3D) and a cluster of workstations. A domain decomposition approach has been taken towards parallelization of the code. Amore » portion of the discrete reservoir model is assigned to each processor by a set-up routine that attempts a data layout as even as possible from the load-balance standpoint. Each of these subdomains is extended so that data can be shared between adjacent processors for stencil computation. The added routines that make parallel execution possible are written in a modular fashion that makes the porting to new parallel platforms straight forward. Results of the distributed memory computing performance of Parallel simulator are presented for field scale applications such as tracer flood and polymer flood. A comparison of the wall-clock times for same problems on a vector supercomputer is also presented.« less
Parallelization and Visual Analysis of Multidimensional Fields: Application to Ozone Production, Destruction, and Transport in Three Dimensions

NASA Technical Reports Server (NTRS)

Schwan, Karsten; Alyea, Fred; Ribarsky, M. William; Trauner, Mary; Eisenhauer, Greg; Jean, Yves; Gu, Weiming; Wang, Ray; Waldrop, Jeffrey; Schroeder, Beth;

1996-01-01

The three-dimensional, spectral transport model used in the current project was first successfully integrated over climatological time scales by Dr. Guang Ping Lou for the simulation of atmospheric N2O using the United Kingdom Meteorological Office (UKMO) 4-dimensional, assimilated wind and temperature data set. A non-parallel, FORTRAN version of this integration using a fairly simple N2O chemistry package containing only photo-chemical reactions was used to verify our initial parallel model results. The integrations reproduced the gross features of the observed stratospheric climatological N2O distributions but also simulated the structure of the stratospheric Antarctic vortex and its evolution. Subsequently, Dr. Thomas Kindler, who produced much of the parallel version of our model, enlarged the N2O model chemistry package to include N2O reactions involving O(D-1) and also introduced assimilated wind data from NASA as well as UKMO. Initially, transport calculations without chemistry were run using Carbon-14 as a non-reactive tracer gas with the result that large differences in the transport properties of the two assimilated wind data sets were apparent from the resultant Carbon-14 distributions. Subsequent calculations for N2O, including its chemistry, with the two input winds data sets with verification from UARS satellite observations have refined the transport differences between the two such that the model's steering capabilities could be used to infer the correct climatological vertical velocity fields required to support the N2O observations. During this process, it was also discovered that both the NASA and the UKMO data contained spurious values in some of the higher frequency wave components, leading to incorrect local transport calculations and ultimately affecting the large scale properties of the model's N2O distributions, particularly at tropical latitudes. Subsequent model runs with wind data that had been filtered to remove some of the high frequency components produced much more realistic N2O distributions. During the past few months, the UKMO wind data base for a complete two-year period was processed into spectral form for model use. This new version of the input transport data base now includes complete temperature fields as well as the necessary wind data. This was done to facilitate advanced chemical calculations in the parallel model which often depend upon temperature. Additional UKMO data is being added as it becomes available.

Time Warp Operating System, Version 2.5.1

NASA Technical Reports Server (NTRS)

Bellenot, Steven F.; Gieselman, John S.; Hawley, Lawrence R.; Peterson, Judy; Presley, Matthew T.; Reiher, Peter L.; Springer, Paul L.; Tupman, John R.; Wedel, John J., Jr.; Wieland, Frederick P.;

1993-01-01

Time Warp Operating System, TWOS, is special purpose computer program designed to support parallel simulation of discrete events. Complete implementation of Time Warp software mechanism, which implements distributed protocol for virtual synchronization based on rollback of processes and annihilation of messages. Supports simulations and other computations in which both virtual time and dynamic load balancing used. Program utilizes underlying resources of operating system. Written in C programming language.

Accelerated Adaptive MGS Phase Retrieval

NASA Technical Reports Server (NTRS)

Lam, Raymond K.; Ohara, Catherine M.; Green, Joseph J.; Bikkannavar, Siddarayappa A.; Basinger, Scott A.; Redding, David C.; Shi, Fang

2011-01-01

The Modified Gerchberg-Saxton (MGS) algorithm is an image-based wavefront-sensing method that can turn any science instrument focal plane into a wavefront sensor. MGS characterizes optical systems by estimating the wavefront errors in the exit pupil using only intensity images of a star or other point source of light. This innovative implementation of MGS significantly accelerates the MGS phase retrieval algorithm by using stream-processing hardware on conventional graphics cards. Stream processing is a relatively new, yet powerful, paradigm to allow parallel processing of certain applications that apply single instructions to multiple data (SIMD). These stream processors are designed specifically to support large-scale parallel computing on a single graphics chip. Computationally intensive algorithms, such as the Fast Fourier Transform (FFT), are particularly well suited for this computing environment. This high-speed version of MGS exploits commercially available hardware to accomplish the same objective in a fraction of the original time. The exploit involves performing matrix calculations in nVidia graphic cards. The graphical processor unit (GPU) is hardware that is specialized for computationally intensive, highly parallel computation. From the software perspective, a parallel programming model is used, called CUDA, to transparently scale multicore parallelism in hardware. This technology gives computationally intensive applications access to the processing power of the nVidia GPUs through a C/C++ programming interface. The AAMGS (Accelerated Adaptive MGS) software takes advantage of these advanced technologies, to accelerate the optical phase error characterization. With a single PC that contains four nVidia GTX-280 graphic cards, the new implementation can process four images simultaneously to produce a JWST (James Webb Space Telescope) wavefront measurement 60 times faster than the previous code.
Design of a real-time wind turbine simulator using a custom parallel architecture

NASA Technical Reports Server (NTRS)

Hoffman, John A.; Gluck, R.; Sridhar, S.

1995-01-01

The design of a new parallel-processing digital simulator is described. The new simulator has been developed specifically for analysis of wind energy systems in real time. The new processor has been named: the Wind Energy System Time-domain simulator, version 3 (WEST-3). Like previous WEST versions, WEST-3 performs many computations in parallel. The modules in WEST-3 are pure digital processors, however. These digital processors can be programmed individually and operated in concert to achieve real-time simulation of wind turbine systems. Because of this programmability, WEST-3 is very much more flexible and general than its two predecessors. The design features of WEST-3 are described to show how the system produces high-speed solutions of nonlinear time-domain equations. WEST-3 has two very fast Computational Units (CU's) that use minicomputer technology plus special architectural features that make them many times faster than a microcomputer. These CU's are needed to perform the complex computations associated with the wind turbine rotor system in real time. The parallel architecture of the CU causes several tasks to be done in each cycle, including an IO operation and the combination of a multiply, add, and store. The WEST-3 simulator can be expanded at any time for additional computational power. This is possible because the CU's interfaced to each other and to other portions of the simulation using special serial buses. These buses can be 'patched' together in essentially any configuration (in a manner very similar to the programming methods used in analog computation) to balance the input/ output requirements. CU's can be added in any number to share a given computational load. This flexible bus feature is very different from many other parallel processors which usually have a throughput limit because of rigid bus architecture.
Parallel Fortran-MPI software for numerical inversion of the Laplace transform and its application to oscillatory water levels in groundwater environments

USGS Publications Warehouse

Zhan, X.

2005-01-01

A parallel Fortran-MPI (Message Passing Interface) software for numerical inversion of the Laplace transform based on a Fourier series method is developed to meet the need of solving intensive computational problems involving oscillatory water level's response to hydraulic tests in a groundwater environment. The software is a parallel version of ACM (The Association for Computing Machinery) Transactions on Mathematical Software (TOMS) Algorithm 796. Running 38 test examples indicated that implementation of MPI techniques with distributed memory architecture speedups the processing and improves the efficiency. Applications to oscillatory water levels in a well during aquifer tests are presented to illustrate how this package can be applied to solve complicated environmental problems involved in differential and integral equations. The package is free and is easy to use for people with little or no previous experience in using MPI but who wish to get off to a quick start in parallel computing. ?? 2004 Elsevier Ltd. All rights reserved.
Overcoming rule-based rigidity and connectionist limitations through massively-parallel case-based reasoning

NASA Technical Reports Server (NTRS)

Barnden, John; Srinivas, Kankanahalli

1990-01-01

Symbol manipulation as used in traditional Artificial Intelligence has been criticized by neural net researchers for being excessively inflexible and sequential. On the other hand, the application of neural net techniques to the types of high-level cognitive processing studied in traditional artificial intelligence presents major problems as well. A promising way out of this impasse is to build neural net models that accomplish massively parallel case-based reasoning. Case-based reasoning, which has received much attention recently, is essentially the same as analogy-based reasoning, and avoids many of the problems leveled at traditional artificial intelligence. Further problems are avoided by doing many strands of case-based reasoning in parallel, and by implementing the whole system as a neural net. In addition, such a system provides an approach to some aspects of the problems of noise, uncertainty and novelty in reasoning systems. The current neural net system (Conposit), which performs standard rule-based reasoning, is being modified into a massively parallel case-based reasoning version.
OpenGeoSys-GEMS: Hybrid parallelization of a reactive transport code with MPI and threads

NASA Astrophysics Data System (ADS)

Kosakowski, G.; Kulik, D. A.; Shao, H.

2012-04-01

OpenGeoSys-GEMS is a generic purpose reactive transport code based on the operator splitting approach. The code couples the Finite-Element groundwater flow and multi-species transport modules of the OpenGeoSys (OGS) project (http://www.ufz.de/index.php?en=18345) with the GEM-Selektor research package to model thermodynamic equilibrium of aquatic (geo)chemical systems utilizing the Gibbs Energy Minimization approach (http://gems.web.psi.ch/). The combination of OGS and the GEM-Selektor kernel (GEMS3K) is highly flexible due to the object-oriented modular code structures and the well defined (memory based) data exchange modules. Like other reactive transport codes, the practical applicability of OGS-GEMS is often hampered by the long calculation time and large memory requirements. • For realistic geochemical systems which might include dozens of mineral phases and several (non-ideal) solid solutions the time needed to solve the chemical system with GEMS3K may increase exceptionally. • The codes are coupled in a sequential non-iterative loop. In order to keep the accuracy, the time step size is restricted. In combination with a fine spatial discretization the time step size may become very small which increases calculation times drastically even for small 1D problems. • The current version of OGS is not optimized for memory use and the MPI version of OGS does not distribute data between nodes. Even for moderately small 2D problems the number of MPI processes that fit into memory of up-to-date workstations or HPC hardware is limited. One strategy to overcome the above mentioned restrictions of OGS-GEMS is to parallelize the coupled code. For OGS a parallelized version already exists. It is based on a domain decomposition method implemented with MPI and provides a parallel solver for fluid and mass transport processes. In the coupled code, after solving fluid flow and solute transport, geochemical calculations are done in form of a central loop over all finite element nodes with calls to GEMS3K and consecutive calculations of changed material parameters. In a first step the existing MPI implementation was utilized to parallelize this loop. Calculations were split between the MPI processes and afterwards data was synchronized by using MPI communication routines. Furthermore, multi-threaded calculation of the loop was implemented with help of the boost thread library (http://www.boost.org). This implementation provides a flexible environment to distribute calculations between several threads. For each MPI process at least one and up to several dozens of worker threads are spawned. These threads do not replicate the complete OGS-GEM data structure and use only a limited amount of memory. Calculation of the central geochemical loop is shared between all threads. Synchronization between the threads is done by barrier commands. The overall number of local threads times MPI processes should match the number of available computing nodes. The combination of multi-threading and MPI provides an effective and flexible environment to speed up OGS-GEMS calculations while limiting the required memory use. Test calculations on different hardware show that for certain types of applications tremendous speedups are possible.
Time-dependent density-functional theory in massively parallel computer architectures: the octopus project

NASA Astrophysics Data System (ADS)

Andrade, Xavier; Alberdi-Rodriguez, Joseba; Strubbe, David A.; Oliveira, Micael J. T.; Nogueira, Fernando; Castro, Alberto; Muguerza, Javier; Arruabarrena, Agustin; Louie, Steven G.; Aspuru-Guzik, Alán; Rubio, Angel; Marques, Miguel A. L.

2012-06-01

Octopus is a general-purpose density-functional theory (DFT) code, with a particular emphasis on the time-dependent version of DFT (TDDFT). In this paper we present the ongoing efforts to achieve the parallelization of octopus. We focus on the real-time variant of TDDFT, where the time-dependent Kohn-Sham equations are directly propagated in time. This approach has great potential for execution in massively parallel systems such as modern supercomputers with thousands of processors and graphics processing units (GPUs). For harvesting the potential of conventional supercomputers, the main strategy is a multi-level parallelization scheme that combines the inherent scalability of real-time TDDFT with a real-space grid domain-partitioning approach. A scalable Poisson solver is critical for the efficiency of this scheme. For GPUs, we show how using blocks of Kohn-Sham states provides the required level of data parallelism and that this strategy is also applicable for code optimization on standard processors. Our results show that real-time TDDFT, as implemented in octopus, can be the method of choice for studying the excited states of large molecular systems in modern parallel architectures.
Time-dependent density-functional theory in massively parallel computer architectures: the OCTOPUS project.

PubMed

Andrade, Xavier; Alberdi-Rodriguez, Joseba; Strubbe, David A; Oliveira, Micael J T; Nogueira, Fernando; Castro, Alberto; Muguerza, Javier; Arruabarrena, Agustin; Louie, Steven G; Aspuru-Guzik, Alán; Rubio, Angel; Marques, Miguel A L

2012-06-13

Octopus is a general-purpose density-functional theory (DFT) code, with a particular emphasis on the time-dependent version of DFT (TDDFT). In this paper we present the ongoing efforts to achieve the parallelization of octopus. We focus on the real-time variant of TDDFT, where the time-dependent Kohn-Sham equations are directly propagated in time. This approach has great potential for execution in massively parallel systems such as modern supercomputers with thousands of processors and graphics processing units (GPUs). For harvesting the potential of conventional supercomputers, the main strategy is a multi-level parallelization scheme that combines the inherent scalability of real-time TDDFT with a real-space grid domain-partitioning approach. A scalable Poisson solver is critical for the efficiency of this scheme. For GPUs, we show how using blocks of Kohn-Sham states provides the required level of data parallelism and that this strategy is also applicable for code optimization on standard processors. Our results show that real-time TDDFT, as implemented in octopus, can be the method of choice for studying the excited states of large molecular systems in modern parallel architectures.
mm_par2.0: An object-oriented molecular dynamics simulation program parallelized using a hierarchical scheme with MPI and OPENMP

NASA Astrophysics Data System (ADS)

Oh, Kwang Jin; Kang, Ji Hoon; Myung, Hun Joo

2012-02-01

We have revised a general purpose parallel molecular dynamics simulation program mm_par using the object-oriented programming. We parallelized the revised version using a hierarchical scheme in order to utilize more processors for a given system size. The benchmark result will be presented here. New version program summaryProgram title: mm_par2.0 Catalogue identifier: ADXP_v2_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADXP_v2_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC license, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 2 390 858 No. of bytes in distributed program, including test data, etc.: 25 068 310 Distribution format: tar.gz Programming language: C++ Computer: Any system operated by Linux or Unix Operating system: Linux Classification: 7.7 External routines: We provide wrappers for FFTW [1], Intel MKL library [2] FFT routine, and Numerical recipes [3] FFT, random number generator, and eigenvalue solver routines, SPRNG [4] random number generator, Mersenne Twister [5] random number generator, space filling curve routine. Catalogue identifier of previous version: ADXP_v1_0 Journal reference of previous version: Comput. Phys. Comm. 174 (2006) 560 Does the new version supersede the previous version?: Yes Nature of problem: Structural, thermodynamic, and dynamical properties of fluids and solids from microscopic scales to mesoscopic scales. Solution method: Molecular dynamics simulation in NVE, NVT, and NPT ensemble, Langevin dynamics simulation, dissipative particle dynamics simulation. Reasons for new version: First, object-oriented programming has been used, which is known to be open for extension and closed for modification. It is also known to be better for maintenance. Second, version 1.0 was based on atom decomposition and domain decomposition scheme [6] for parallelization. However, atom decomposition is not popular due to its poor scalability. On the other hand, domain decomposition scheme is better for scalability. It still has a limitation in utilizing a large number of cores on recent petascale computers due to the requirement that the domain size is larger than the potential cutoff distance. To go beyond such a limitation, a hierarchical parallelization scheme has been adopted in this new version and implemented using MPI [7] and OPENMP [8]. Summary of revisions: (1) Object-oriented programming has been used. (2) A hierarchical parallelization scheme has been adopted. (3) SPME routine has been fully parallelized with parallel 3D FFT using volumetric decomposition scheme [9]. K.J.O. thanks Mr. Seung Min Lee for useful discussion on programming and debugging. Running time: Running time depends on system size and methods used. For test system containing a protein (PDB id: 5DHFR) with CHARMM22 force field [10] and 7023 TIP3P [11] waters in simulation box having dimension 62.23 Å×62.23 Å×62.23 Å, the benchmark results are given in Fig. 1. Here the potential cutoff distance was set to 12 Å and the switching function was applied from 10 Å for the force calculation in real space. For the SPME [12] calculation, K, K, and K were set to 64 and the interpolation order was set to 4. To do the fast Fourier transform, we used Intel MKL library. All bonds including hydrogen atoms were constrained using SHAKE/RATTLE algorithms [13,14]. The code was compiled using Intel compiler version 11.1 and mvapich2 version 1.5. Fig. 2 shows performance gains from using CUDA-enabled version [15] of mm_par for 5DHFR simulation in water on Intel Core2Quad 2.83 GHz and GeForce GTX 580. Even though mm_par2.0 is not ported yet for GPU, its performance data would be useful to expect mm_par2.0 performance on GPU. Timing results for 1000 MD steps. 1, 2, 4, and 8 in the figure mean the number of OPENMP threads. Timing results for 1000 MD steps from double precision simulation on CPU, single precision simulation on GPU, and double precision simulation on GPU.
Xyce release and distribution management : version 1.2.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hutchinson, Scott Alan; Williamson, Charles Michael

2003-10-01

This document presents a high-level description of the Xyce {trademark} Parallel Electronic Simulator Release and Distribution Management Process. The purpose of this process is to standardize the manner in which all Xyce software products progress toward release and how releases are made available to customers. Rigorous Release Management will assure that Xyce releases are created in such a way that the elements comprising the release are traceable and the release itself is reproducible. Distribution Management describes what is to be done with a Xyce release that is eligible for distribution.
Scalable Unix commands for parallel processors : a high-performance implementation.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ong, E.; Lusk, E.; Gropp, W.

2001-06-22

We describe a family of MPI applications we call the Parallel Unix Commands. These commands are natural parallel versions of common Unix user commands such as ls, ps, and find, together with a few similar commands particular to the parallel environment. We describe the design and implementation of these programs and present some performance results on a 256-node Linux cluster. The Parallel Unix Commands are open source and freely available.
Multi-language translation and cross-cultural adaptation of the OARSI/OMERACT measure of intermittent and constant osteoarthritis pain (ICOAP).

PubMed

Maillefert, J F; Kloppenburg, M; Fernandes, L; Punzi, L; Günther, K-P; Martin Mola, E; Lohmander, L S; Pavelka, K; Lopez-Olivo, M A; Dougados, M; Hawker, G A

2009-10-01

To conduct a multi-language translation and cross-cultural adaptation of the Intermittent and Constant OsteoArthritis Pain (ICOAP) questionnaire for hip and knee osteoarthritis (OA). The questionnaires were translated and cross-culturally adapted in parallel, using a common protocol, into the following languages: Czech, Dutch, French (France), German, Italian, Norwegian, Spanish (Castillan), North and Central American Spanish, Swedish. The process was conducted following five steps: (1)--independent translation into the target language by two or three persons; (2)--consensus meeting to obtain a single preliminary translated version; (3)--backward translation by an independent bilingual native English speaker, blinded to the English original version; (4)--final version produced by a multidisciplinary consensus committee; (5)--pre-testing of the final version with 10-20 target-language-native hip and knee OA patients. The process could be followed and completed in all countries. Only slight differences were identified in the structure of the sentences between the original and the translated versions. A large majority of the patients felt that the questionnaire was easy to understand and complete. Only a few minor criticisms were expressed. Moreover, a majority of patients found the concepts of constant pain and pain that comes and goes to be of a great pertinence and were very happy with the distinction. The ICOAP questionnaire is now available for multi-center international studies.
Accelerating Fibre Orientation Estimation from Diffusion Weighted Magnetic Resonance Imaging Using GPUs

PubMed Central

Hernández, Moisés; Guerrero, Ginés D.; Cecilia, José M.; García, José M.; Inuggi, Alberto; Jbabdi, Saad; Behrens, Timothy E. J.; Sotiropoulos, Stamatios N.

2013-01-01

With the performance of central processing units (CPUs) having effectively reached a limit, parallel processing offers an alternative for applications with high computational demands. Modern graphics processing units (GPUs) are massively parallel processors that can execute simultaneously thousands of light-weight processes. In this study, we propose and implement a parallel GPU-based design of a popular method that is used for the analysis of brain magnetic resonance imaging (MRI). More specifically, we are concerned with a model-based approach for extracting tissue structural information from diffusion-weighted (DW) MRI data. DW-MRI offers, through tractography approaches, the only way to study brain structural connectivity, non-invasively and in-vivo. We parallelise the Bayesian inference framework for the ball & stick model, as it is implemented in the tractography toolbox of the popular FSL software package (University of Oxford). For our implementation, we utilise the Compute Unified Device Architecture (CUDA) programming model. We show that the parameter estimation, performed through Markov Chain Monte Carlo (MCMC), is accelerated by at least two orders of magnitude, when comparing a single GPU with the respective sequential single-core CPU version. We also illustrate similar speed-up factors (up to 120x) when comparing a multi-GPU with a multi-CPU implementation. PMID:23658616
A biconjugate gradient type algorithm on massively parallel architectures

NASA Technical Reports Server (NTRS)

Freund, Roland W.; Hochbruck, Marlis

1991-01-01

The biconjugate gradient (BCG) method is the natural generalization of the classical conjugate gradient algorithm for Hermitian positive definite matrices to general non-Hermitian linear systems. Unfortunately, the original BCG algorithm is susceptible to possible breakdowns and numerical instabilities. Recently, Freund and Nachtigal have proposed a novel BCG type approach, the quasi-minimal residual method (QMR), which overcomes the problems of BCG. Here, an implementation is presented of QMR based on an s-step version of the nonsymmetric look-ahead Lanczos algorithm. The main feature of the s-step Lanczos algorithm is that, in general, all inner products, except for one, can be computed in parallel at the end of each block; this is unlike the other standard Lanczos process where inner products are generated sequentially. The resulting implementation of QMR is particularly attractive on massively parallel SIMD architectures, such as the Connection Machine.
Aztec user`s guide. Version 1

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hutchinson, S.A.; Shadid, J.N.; Tuminaro, R.S.

1995-10-01

Aztec is an iterative library that greatly simplifies the parallelization process when solving the linear systems of equations Ax = b where A is a user supplied n x n sparse matrix, b is a user supplied vector of length n and x is a vector of length n to be computed. Aztec is intended as a software tool for users who want to avoid cumbersome parallel programming details but who have large sparse linear systems which require an efficiently utilized parallel processing system. A collection of data transformation tools are provided that allow for easy creation of distributed sparsemore » unstructured matrices for parallel solution. Once the distributed matrix is created, computation can be performed on any of the parallel machines running Aztec: nCUBE 2, IBM SP2 and Intel Paragon, MPI platforms as well as standard serial and vector platforms. Aztec includes a number of Krylov iterative methods such as conjugate gradient (CG), generalized minimum residual (GMRES) and stabilized biconjugate gradient (BICGSTAB) to solve systems of equations. These Krylov methods are used in conjunction with various preconditioners such as polynomial or domain decomposition methods using LU or incomplete LU factorizations within subdomains. Although the matrix A can be general, the package has been designed for matrices arising from the approximation of partial differential equations (PDEs). In particular, the Aztec package is oriented toward systems arising from PDE applications.« less
Optimized R functions for analysis of ecological community data using the R virtual laboratory (RvLab)

PubMed Central

Varsos, Constantinos; Patkos, Theodore; Pavloudi, Christina; Gougousis, Alexandros; Ijaz, Umer Zeeshan; Filiopoulou, Irene; Pattakos, Nikolaos; Vanden Berghe, Edward; Fernández-Guerra, Antonio; Faulwetter, Sarah; Chatzinikolaou, Eva; Pafilis, Evangelos; Bekiari, Chryssoula; Doerr, Martin; Arvanitidis, Christos

2016-01-01

Abstract Background Parallel data manipulation using R has previously been addressed by members of the R community, however most of these studies produce ad hoc solutions that are not readily available to the average R user. Our targeted users, ranging from the expert ecologist/microbiologists to computational biologists, often experience difficulties in finding optimal ways to exploit the full capacity of their computational resources. In addition, improving performance of commonly used R scripts becomes increasingly difficult especially with large datasets. Furthermore, the implementations described here can be of significant interest to expert bioinformaticians or R developers. Therefore, our goals can be summarized as: (i) description of a complete methodology for the analysis of large datasets by combining capabilities of diverse R packages, (ii) presentation of their application through a virtual R laboratory (RvLab) that makes execution of complex functions and visualization of results easy and readily available to the end-user. New information In this paper, the novelty stems from implementations of parallel methodologies which rely on the processing of data on different levels of abstraction and the availability of these processes through an integrated portal. Parallel implementation R packages, such as the pbdMPI (Programming with Big Data – Interface to MPI) package, are used to implement Single Program Multiple Data (SPMD) parallelization on primitive mathematical operations, allowing for interplay with functions of the vegan package. The dplyr and RPostgreSQL R packages are further integrated offering connections to dataframe like objects (databases) as secondary storage solutions whenever memory demands exceed available RAM resources. The RvLab is running on a PC cluster, using version 3.1.2 (2014-10-31) on a x86_64-pc-linux-gnu (64-bit) platform, and offers an intuitive virtual environmet interface enabling users to perform analysis of ecological and microbial communities based on optimized vegan functions. A beta version of the RvLab is available after registration at: https://portal.lifewatchgreece.eu/ PMID:27932907
Optimized R functions for analysis of ecological community data using the R virtual laboratory (RvLab).

PubMed

Varsos, Constantinos; Patkos, Theodore; Oulas, Anastasis; Pavloudi, Christina; Gougousis, Alexandros; Ijaz, Umer Zeeshan; Filiopoulou, Irene; Pattakos, Nikolaos; Vanden Berghe, Edward; Fernández-Guerra, Antonio; Faulwetter, Sarah; Chatzinikolaou, Eva; Pafilis, Evangelos; Bekiari, Chryssoula; Doerr, Martin; Arvanitidis, Christos

2016-01-01

Parallel data manipulation using R has previously been addressed by members of the R community, however most of these studies produce ad hoc solutions that are not readily available to the average R user. Our targeted users, ranging from the expert ecologist/microbiologists to computational biologists, often experience difficulties in finding optimal ways to exploit the full capacity of their computational resources. In addition, improving performance of commonly used R scripts becomes increasingly difficult especially with large datasets. Furthermore, the implementations described here can be of significant interest to expert bioinformaticians or R developers. Therefore, our goals can be summarized as: (i) description of a complete methodology for the analysis of large datasets by combining capabilities of diverse R packages, (ii) presentation of their application through a virtual R laboratory (RvLab) that makes execution of complex functions and visualization of results easy and readily available to the end-user. In this paper, the novelty stems from implementations of parallel methodologies which rely on the processing of data on different levels of abstraction and the availability of these processes through an integrated portal. Parallel implementation R packages, such as the pbdMPI (Programming with Big Data - Interface to MPI) package, are used to implement Single Program Multiple Data (SPMD) parallelization on primitive mathematical operations, allowing for interplay with functions of the vegan package. The dplyr and RPostgreSQL R packages are further integrated offering connections to dataframe like objects (databases) as secondary storage solutions whenever memory demands exceed available RAM resources. The RvLab is running on a PC cluster, using version 3.1.2 (2014-10-31) on a x86_64-pc-linux-gnu (64-bit) platform, and offers an intuitive virtual environmet interface enabling users to perform analysis of ecological and microbial communities based on optimized vegan functions. A beta version of the RvLab is available after registration at: https://portal.lifewatchgreece.eu/.
Free-electron laser simulations on the MPP

NASA Technical Reports Server (NTRS)

Vonlaven, Scott A.; Liebrock, Lorie M.

1987-01-01

Free electron lasers (FELs) are of interest because they provide high power, high efficiency, and broad tunability. FEL simulations can make efficient use of computers of the Massively Parallel Processor (MPP) class because most of the processing consists of applying a simple equation to a set of identical particles. A test version of the KMS Fusion FEL simulation, which resides mainly in the MPPs host computer and only partially in the MPP, has run successfully.
A highly parallel multigrid-like method for the solution of the Euler equations

NASA Technical Reports Server (NTRS)

Tuminaro, Ray S.

1989-01-01

We consider a highly parallel multigrid-like method for the solution of the two dimensional steady Euler equations. The new method, introduced as filtering multigrid, is similar to a standard multigrid scheme in that convergence on the finest grid is accelerated by iterations on coarser grids. In the filtering method, however, additional fine grid subproblems are processed concurrently with coarse grid computations to further accelerate convergence. These additional problems are obtained by splitting the residual into a smooth and an oscillatory component. The smooth component is then used to form a coarse grid problem (similar to standard multigrid) while the oscillatory component is used for a fine grid subproblem. The primary advantage in the filtering approach is that fewer iterations are required and that most of the additional work per iteration can be performed in parallel with the standard coarse grid computations. We generalize the filtering algorithm to a version suitable for nonlinear problems. We emphasize that this generalization is conceptually straight-forward and relatively easy to implement. In particular, no explicit linearization (e.g., formation of Jacobians) needs to be performed (similar to the FAS multigrid approach). We illustrate the nonlinear version by applying it to the Euler equations, and presenting numerical results. Finally, a performance evaluation is made based on execution time models and convergence information obtained from numerical experiments.
Quantum realization of the bilinear interpolation method for NEQR.

PubMed

Zhou, Ri-Gui; Hu, Wenwen; Fan, Ping; Ian, Hou

2017-05-31

In recent years, quantum image processing is one of the most active fields in quantum computation and quantum information. Image scaling as a kind of image geometric transformation has been widely studied and applied in the classical image processing, however, the quantum version of which does not exist. This paper is concerned with the feasibility of the classical bilinear interpolation based on novel enhanced quantum image representation (NEQR). Firstly, the feasibility of the bilinear interpolation for NEQR is proven. Then the concrete quantum circuits of the bilinear interpolation including scaling up and scaling down for NEQR are given by using the multiply Control-Not operation, special adding one operation, the reverse parallel adder, parallel subtractor, multiplier and division operations. Finally, the complexity analysis of the quantum network circuit based on the basic quantum gates is deduced. Simulation result shows that the scaled-up image using bilinear interpolation is clearer and less distorted than nearest interpolation.

Tempest: Accelerated MS/MS Database Search Software for Heterogeneous Computing Platforms.

PubMed

Adamo, Mark E; Gerber, Scott A

2016-09-07

MS/MS database search algorithms derive a set of candidate peptide sequences from in silico digest of a protein sequence database, and compute theoretical fragmentation patterns to match these candidates against observed MS/MS spectra. The original Tempest publication described these operations mapped to a CPU-GPU model, in which the CPU (central processing unit) generates peptide candidates that are asynchronously sent to a discrete GPU (graphics processing unit) to be scored against experimental spectra in parallel. The current version of Tempest expands this model, incorporating OpenCL to offer seamless parallelization across multicore CPUs, GPUs, integrated graphics chips, and general-purpose coprocessors. Three protocols describe how to configure and run a Tempest search, including discussion of how to leverage Tempest's unique feature set to produce optimal results. © 2016 by John Wiley & Sons, Inc. Copyright © 2016 John Wiley & Sons, Inc.
Asynchronous multilevel adaptive methods for solving partial differential equations on multiprocessors - Performance results

NASA Technical Reports Server (NTRS)

Mccormick, S.; Quinlan, D.

1989-01-01

The fast adaptive composite grid method (FAC) is an algorithm that uses various levels of uniform grids (global and local) to provide adaptive resolution and fast solution of PDEs. Like all such methods, it offers parallelism by using possibly many disconnected patches per level, but is hindered by the need to handle these levels sequentially. The finest levels must therefore wait for processing to be essentially completed on all the coarser ones. A recently developed asynchronous version of FAC, called AFAC, completely eliminates this bottleneck to parallelism. This paper describes timing results for AFAC, coupled with a simple load balancing scheme, applied to the solution of elliptic PDEs on an Intel iPSC hypercube. These tests include performance of certain processes necessary in adaptive methods, including moving grids and changing refinement. A companion paper reports on numerical and analytical results for estimating convergence factors of AFAC applied to very large scale examples.
Proactive action preparation: seeing action preparation as a continuous and proactive process.

PubMed

Pezzulo, Giovanni; Ognibene, Dimitri

2012-07-01

In this paper, we aim to elucidate the processes that occur during action preparation from both a conceptual and a computational point of view. We first introduce the traditional, serial model of goal-directed action and discuss from a computational viewpoint its subprocesses occurring during the two phases of covert action preparation and overt motor control. Then, we discuss recent evidence indicating that these subprocesses are highly intertwined at representational and neural levels, which undermines the validity of the serial model and points instead to a parallel model of action specification and selection. Within the parallel view, we analyze the case of delayed choice, arguing that action preparation can be proactive, and preparatory processes can take place even before decisions are made. Specifically, we discuss how prior knowledge and prospective abilities can be used to maximize utility even before deciding what to do. To support our view, we present a computational implementation of (an approximated version of) proactive action preparation, showing its advantages in a simulated tennis-like scenario.
Scalable descriptive and correlative statistics with Titan.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Thompson, David C.; Pebay, Philippe Pierre

This report summarizes the existing statistical engines in VTK/Titan and presents the parallel versions thereof which have already been implemented. The ease of use of these parallel engines is illustrated by the means of C++ code snippets. Furthermore, this report justifies the design of these engines with parallel scalability in mind; then, this theoretical property is verified with test runs that demonstrate optimal parallel speed-up with up to 200 processors.
Parallel/Vector Integration Methods for Dynamical Astronomy

NASA Astrophysics Data System (ADS)

Fukushima, T.

Progress of parallel/vector computers has driven us to develop suitable numerical integrators utilizing their computational power to the full extent while being independent on the size of system to be integrated. Unfortunately, the parallel version of Runge-Kutta type integrators are known to be not so efficient. Recently we developed a parallel version of the extrapolation method (Ito and Fukushima 1997), which allows variable timesteps and still gives an acceleration factor of 3-4 for general problems. While the vector-mode usage of Picard-Chebyshev method (Fukushima 1997a, 1997b) will lead the acceleration factor of order of 1000 for smooth problems such as planetary/satellites orbit integration. The success of multiple-correction PECE mode of time-symmetric implicit Hermitian integrator (Kokubo 1998) seems to enlighten Milankar's so-called "pipelined predictor corrector method", which is expected to lead an acceleration factor of 3-4. We will review these directions and discuss future prospects.
Parallel compression of data chunks of a shared data object using a log-structured file system

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bent, John M.; Faibish, Sorin; Grider, Gary

2016-10-25

Techniques are provided for parallel compression of data chunks being written to a shared object. A client executing on a compute node or a burst buffer node in a parallel computing system stores a data chunk generated by the parallel computing system to a shared data object on a storage node by compressing the data chunk; and providing the data compressed data chunk to the storage node that stores the shared object. The client and storage node may employ Log-Structured File techniques. The compressed data chunk can be de-compressed by the client when the data chunk is read. A storagemore » node stores a data chunk as part of a shared object by receiving a compressed version of the data chunk from a compute node; and storing the compressed version of the data chunk to the shared data object on the storage node.« less
Parallel algorithms for modeling flow in permeable media. Annual report, February 15, 1995 - February 14, 1996

DOE Office of Scientific and Technical Information (OSTI.GOV)

G.A. Pope; K. Sephernoori; D.C. McKinney

1996-03-15

This report describes the application of distributed-memory parallel programming techniques to a compositional simulator called UTCHEM. The University of Texas Chemical Flooding reservoir simulator (UTCHEM) is a general-purpose vectorized chemical flooding simulator that models the transport of chemical species in three-dimensional, multiphase flow through permeable media. The parallel version of UTCHEM addresses solving large-scale problems by reducing the amount of time that is required to obtain the solution as well as providing a flexible and portable programming environment. In this work, the original parallel version of UTCHEM was modified and ported to CRAY T3D and CRAY T3E, distributed-memory, multiprocessor computersmore » using CRAY-PVM as the interprocessor communication library. Also, the data communication routines were modified such that the portability of the original code across different computer architectures was mad possible.« less
Employing Nested OpenMP for the Parallelization of Multi-Zone Computational Fluid Dynamics Applications

NASA Technical Reports Server (NTRS)

Ayguade, Eduard; Gonzalez, Marc; Martorell, Xavier; Jost, Gabriele

2004-01-01

In this paper we describe the parallelization of the multi-zone code versions of the NAS Parallel Benchmarks employing multi-level OpenMP parallelism. For our study we use the NanosCompiler, which supports nesting of OpenMP directives and provides clauses to control the grouping of threads, load balancing, and synchronization. We report the benchmark results, compare the timings with those of different hybrid parallelization paradigms and discuss OpenMP implementation issues which effect the performance of multi-level parallel applications.
Parallel PAB3D: Experiences with a Prototype in MPI

NASA Technical Reports Server (NTRS)

Guerinoni, Fabio; Abdol-Hamid, Khaled S.; Pao, S. Paul

1998-01-01

PAB3D is a three-dimensional Navier Stokes solver that has gained acceptance in the research and industrial communities. It takes as computational domain, a set disjoint blocks covering the physical domain. This is the first report on the implementation of PAB3D using the Message Passing Interface (MPI), a standard for parallel processing. We discuss briefly the characteristics of tile code and define a prototype for testing. The principal data structure used for communication is derived from preprocessing "patching". We describe a simple interface (COMMSYS) for MPI communication, and some general techniques likely to be encountered when working on problems of this nature. Last, we identify levels of improvement from the current version and outline future work.
Concurrent Cuba

NASA Astrophysics Data System (ADS)

Hahn, T.

2016-10-01

The parallel version of the multidimensional numerical integration package Cuba is presented and achievable speed-ups discussed. The parallelization is based on the fork/wait POSIX functions, needs no extra software installed, imposes almost no constraints on the integrand function, and works largely automatically.
Knowledge Support and Automation for Performance Analysis with PerfExplorer 2.0

DOE PAGES

Huck, Kevin A.; Malony, Allen D.; Shende, Sameer; ...

2008-01-01

The integration of scalable performance analysis in parallel development tools is difficult. The potential size of data sets and the need to compare results from multiple experiments presents a challenge to manage and process the information. Simply to characterize the performance of parallel applications running on potentially hundreds of thousands of processor cores requires new scalable analysis techniques. Furthermore, many exploratory analysis processes are repeatable and could be automated, but are now implemented as manual procedures. In this paper, we will discuss the current version of PerfExplorer, a performance analysis framework which provides dimension reduction, clustering and correlation analysis ofmore » individual trails of large dimensions, and can perform relative performance analysis between multiple application executions. PerfExplorer analysis processes can be captured in the form of Python scripts, automating what would otherwise be time-consuming tasks. We will give examples of large-scale analysis results, and discuss the future development of the framework, including the encoding and processing of expert performance rules, and the increasing use of performance metadata.« less
Multiprocessing the Sieve of Eratosthenes

NASA Technical Reports Server (NTRS)

Bokhari, S.

1986-01-01

The Sieve of Eratosthenes for finding prime numbers in recent years has seen much use as a benchmark algorithm for serial computers while its intrinsically parallel nature has gone largely unnoticed. The implementation of a parallel version of this algorithm for a real parallel computer, the Flex/32, is described and its performance discussed. It is shown that the algorithm is sensitive to several fundamental performance parameters of parallel machines, such as spawning time, signaling time, memory access, and overhead of process switching. Because of the nature of the algorithm, it is impossible to get any speedup beyond 4 or 5 processors unless some form of dynamic load balancing is employed. We describe the performance of our algorithm with and without load balancing and compare it with theoretical lower bounds and simulated results. It is straightforward to understand this algorithm and to check the final results. However, its efficient implementation on a real parallel machine requires thoughtful design, especially if dynamic load balancing is desired. The fundamental operations required by the algorithm are very simple: this means that the slightest overhead appears prominently in performance data. The Sieve thus serves not only as a very severe test of the capabilities of a parallel processor but is also an interesting challenge for the programmer.
Integrating Cache Performance Modeling and Tuning Support in Parallelization Tools

NASA Technical Reports Server (NTRS)

Waheed, Abdul; Yan, Jerry; Saini, Subhash (Technical Monitor)

1998-01-01

With the resurgence of distributed shared memory (DSM) systems based on cache-coherent Non Uniform Memory Access (ccNUMA) architectures and increasing disparity between memory and processors speeds, data locality overheads are becoming the greatest bottlenecks in the way of realizing potential high performance of these systems. While parallelization tools and compilers facilitate the users in porting their sequential applications to a DSM system, a lot of time and effort is needed to tune the memory performance of these applications to achieve reasonable speedup. In this paper, we show that integrating cache performance modeling and tuning support within a parallelization environment can alleviate this problem. The Cache Performance Modeling and Prediction Tool (CPMP), employs trace-driven simulation techniques without the overhead of generating and managing detailed address traces. CPMP predicts the cache performance impact of source code level "what-if" modifications in a program to assist a user in the tuning process. CPMP is built on top of a customized version of the Computer Aided Parallelization Tools (CAPTools) environment. Finally, we demonstrate how CPMP can be applied to tune a real Computational Fluid Dynamics (CFD) application.
P-HARP: A parallel dynamic spectral partitioner

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sohn, A.; Biswas, R.; Simon, H.D.

1997-05-01

Partitioning unstructured graphs is central to the parallel solution of problems in computational science and engineering. The authors have introduced earlier the sequential version of an inertial spectral partitioner called HARP which maintains the quality of recursive spectral bisection (RSB) while forming the partitions an order of magnitude faster than RSB. The serial HARP is known to be the fastest spectral partitioner to date, three to four times faster than similar partitioners on a variety of meshes. This paper presents a parallel version of HARP, called P-HARP. Two types of parallelism have been exploited: loop level parallelism and recursive parallelism.more » P-HARP has been implemented in MPI on the SGI/Cray T3E and the IBM SP2. Experimental results demonstrate that P-HARP can partition a mesh of over 100,000 vertices into 256 partitions in 0.25 seconds on a 64-processor T3E. Experimental results further show that P-HARP can give nearly a 20-fold speedup on 64 processors. These results indicate that graph partitioning is no longer a major bottleneck that hinders the advancement of computational science and engineering for dynamically-changing real-world applications.« less
Multiprocessor smalltalk: Implementation, performance, and analysis

DOE Office of Scientific and Technical Information (OSTI.GOV)

Pallas, J.I.

1990-01-01

Multiprocessor Smalltalk demonstrates the value of object-oriented programming on a multiprocessor. Its implementation and analysis shed light on three areas: concurrent programming in an object oriented language without special extensions, implementation techniques for adapting to multiprocessors, and performance factors in the resulting system. Adding parallelism to Smalltalk code is easy, because programs already use control abstractions like iterators. Smalltalk's basic control and concurrency primitives (lambda expressions, processes and semaphores) can be used to build parallel control abstractions, including parallel iterators, parallel objects, atomic objects, and futures. Language extensions for concurrency are not required. This implementation demonstrates that it is possiblemore » to build an efficient parallel object-oriented programming system and illustrates techniques for doing so. Three modification tools-serialization, replication, and reorganization-adapted the Berkeley Smalltalk interpreter to the Firefly multiprocessor. Multiprocessor Smalltalk's performance shows that the combination of multiprocessing and object-oriented programming can be effective: speedups (relative to the original serial version) exceed 2.0 for five processors on all the benchmarks; the median efficiency is 48%. Analysis shows both where performance is lost and how to improve and generalize the experimental results. Changes in the interpreter to support concurrency add at most 12% overhead; better access to per-process variables could eliminate much of that. Changes in the user code to express concurrency add as much as 70% overhead; this overhead could be reduced to 54% if blocks (lambda expressions) were reentrant. Performance is also lost when the program cannot keep all five processors busy.« less
GaAs Supercomputing: Architecture, Language, And Algorithms For Image Processing

NASA Astrophysics Data System (ADS)

Johl, John T.; Baker, Nick C.

1988-10-01

The application of high-speed GaAs processors in a parallel system matches the demanding computational requirements of image processing. The architecture of the McDonnell Douglas Astronautics Company (MDAC) vector processor is described along with the algorithms and language translator. Most image and signal processing algorithms can utilize parallel processing and show a significant performance improvement over sequential versions. The parallelization performed by this system is within each vector instruction. Since each vector has many elements, each requiring some computation, useful concurrent arithmetic operations can easily be performed. Balancing the memory bandwidth with the computation rate of the processors is an important design consideration for high efficiency and utilization. The architecture features a bus-based execution unit consisting of four to eight 32-bit GaAs RISC microprocessors running at a 200 MHz clock rate for a peak performance of 1.6 BOPS. The execution unit is connected to a vector memory with three buses capable of transferring two input words and one output word every 10 nsec. The address generators inside the vector memory perform different vector addressing modes and feed the data to the execution unit. The functions discussed in this paper include basic MATRIX OPERATIONS, 2-D SPATIAL CONVOLUTION, HISTOGRAM, and FFT. For each of these algorithms, assembly language programs were run on a behavioral model of the system to obtain performance figures.
Enhancing Image Processing Performance for PCID in a Heterogeneous Network of Multi-code Processors

NASA Astrophysics Data System (ADS)

Linderman, R.; Spetka, S.; Fitzgerald, D.; Emeny, S.

The Physically-Constrained Iterative Deconvolution (PCID) image deblurring code is being ported to heterogeneous networks of multi-core systems, including Intel Xeons and IBM Cell Broadband Engines. This paper reports results from experiments using the JAWS supercomputer at MHPCC (60 TFLOPS of dual-dual Xeon nodes linked with Infiniband) and the Cell Cluster at AFRL in Rome, NY. The Cell Cluster has 52 TFLOPS of Playstation 3 (PS3) nodes with IBM Cell Broadband Engine multi-cores and 15 dual-quad Xeon head nodes. The interconnect fabric includes Infiniband, 10 Gigabit Ethernet and 1 Gigabit Ethernet to each of the 336 PS3s. The results compare approaches to parallelizing FFT executions across the Xeons and the Cell's Synergistic Processing Elements (SPEs) for frame-level image processing. The experiments included Intel's Performance Primitives and Math Kernel Library, FFTW3.2, and Carnegie Mellon's SPIRAL. Optimization of FFTs in the PCID code led to a decrease in relative processing time for FFTs. Profiling PCID version 6.2, about one year ago, showed the 13 functions that accounted for the highest percentage of processing were all FFT processing functions. They accounted for over 88% of processing time in one run on Xeons. FFT optimizations led to improvement in the current PCID version 8.0. A recent profile showed that only two of the 19 functions with the highest processing time were FFT processing functions. Timing measurements showed that FFT processing for PCID version 8.0 has been reduced to less than 19% of overall processing time. We are working toward a goal of scaling to 200-400 cores per job (1-2 imagery frames/core). Running a pair of cores on each set of frames reduces latency by implementing parallel FFT processing. Our current results show scaling well out to 100 pairs of cores. These results support the next higher level of parallelism in PCID, where groups of several hundred frames each producing one resolved image are sent to cliques of several hundred cores in a round robin fashion. Current efforts toward further performance enhancement for PCID are shifting toward using the Playstations in conjunction with the Xeons to take advantage of outstanding price/performance as well as the Flops/Watt cost advantage. We are fine-tuning the PCID parallization strategy to balance processing over Xeons and Cell BEs to find an optimal partitioning of PCID over the heterogeneous processors. A high performance information management system that exploits native Infiniband multicast is used to improve latency among the head nodes. Using a publication/subscription oriented information management system to implement a unified communications platform makes runs on large HPCs with thousands of intercommunicating cores more flexible and more fault tolerant. It features a loose couplingof publishers to subscribers through intervening brokers. We are also working on enhancing performance for both Xeons and Cell BEs, buy moving selected operations to single precision. Techniques for adapting the code to single precision and performance results are reported.
User's guide of TOUGH2-EGS-MP: A Massively Parallel Simulator with Coupled Geomechanics for Fluid and Heat Flow in Enhanced Geothermal Systems VERSION 1.0

DOE Office of Scientific and Technical Information (OSTI.GOV)

Xiong, Yi; Fakcharoenphol, Perapon; Wang, Shihao

2013-12-01

TOUGH2-EGS-MP is a parallel numerical simulation program coupling geomechanics with fluid and heat flow in fractured and porous media, and is applicable for simulation of enhanced geothermal systems (EGS). TOUGH2-EGS-MP is based on the TOUGH2-MP code, the massively parallel version of TOUGH2. In TOUGH2-EGS-MP, the fully-coupled flow-geomechanics model is developed from linear elastic theory for thermo-poro-elastic systems and is formulated in terms of mean normal stress as well as pore pressure and temperature. Reservoir rock properties such as porosity and permeability depend on rock deformation, and the relationships between these two, obtained from poro-elasticity theories and empirical correlations, are incorporatedmore » into the simulation. This report provides the user with detailed information on the TOUGH2-EGS-MP mathematical model and instructions for using it for Thermal-Hydrological-Mechanical (THM) simulations. The mathematical model includes the fluid and heat flow equations, geomechanical equation, and discretization of those equations. In addition, the parallel aspects of the code, such as domain partitioning and communication between processors, are also included. Although TOUGH2-EGS-MP has the capability for simulating fluid and heat flows coupled with geomechanical effects, it is up to the user to select the specific coupling process, such as THM or only TH, in a simulation. There are several example problems illustrating applications of this program. These example problems are described in detail and their input data are presented. Their results demonstrate that this program can be used for field-scale geothermal reservoir simulation in porous and fractured media with fluid and heat flow coupled with geomechanical effects.« less
First Applications of the New Parallel Krylov Solver for MODFLOW on a National and Global Scale

NASA Astrophysics Data System (ADS)

Verkaik, J.; Hughes, J. D.; Sutanudjaja, E.; van Walsum, P.

2016-12-01

Integrated high-resolution hydrologic models are increasingly being used for evaluating water management measures at field scale. Their drawbacks are large memory requirements and long run times. Examples of such models are The Netherlands Hydrological Instrument (NHI) model and the PCRaster Global Water Balance (PCR-GLOBWB) model. Typical simulation periods are 30-100 years with daily timesteps. The NHI model predicts water demands in periods of drought, supporting operational and long-term water-supply decisions. The NHI is a state-of-the-art coupling of several models: a 7-layer MODFLOW groundwater model ( 6.5M 250m cells), a MetaSWAP model for the unsaturated zone (Richards emulator of 0.5M cells), and a surface water model (MOZART-DM). The PCR-GLOBWB model provides a grid-based representation of global terrestrial hydrology and this work uses the version that includes a 2-layer MODFLOW groundwater model ( 4.5M 10km cells). The Parallel Krylov Solver (PKS) speeds up computation by both distributed memory parallelization (Message Passing Interface) and shared memory parallelization (Open Multi-Processing). PKS includes conjugate gradient, bi-conjugate gradient stabilized, and generalized minimal residual linear accelerators that use an overlapping additive Schwarz domain decomposition preconditioner. PKS can be used for both structured and unstructured grids and has been fully integrated in MODFLOW-USG using METIS partitioning and in iMODFLOW using RCB partitioning. iMODFLOW is an accelerated version of MODFLOW-2005 that is implicitly and online coupled to MetaSWAP. Results for benchmarks carried out on the Cartesius Dutch supercomputer (https://userinfo.surfsara.nl/systems/cartesius) for the PCRGLOB-WB model and on a 2x16 core Windows machine for the NHI model show speedups up to 10-20 and 5-10, respectively.
A Tutorial on Parallel and Concurrent Programming in Haskell

NASA Astrophysics Data System (ADS)

Peyton Jones, Simon; Singh, Satnam

This practical tutorial introduces the features available in Haskell for writing parallel and concurrent programs. We first describe how to write semi-explicit parallel programs by using annotations to express opportunities for parallelism and to help control the granularity of parallelism for effective execution on modern operating systems and processors. We then describe the mechanisms provided by Haskell for writing explicitly parallel programs with a focus on the use of software transactional memory to help share information between threads. Finally, we show how nested data parallelism can be used to write deterministically parallel programs which allows programmers to use rich data types in data parallel programs which are automatically transformed into flat data parallel versions for efficient execution on multi-core processors.

gpuPOM: a GPU-based Princeton Ocean Model

NASA Astrophysics Data System (ADS)

Xu, S.; Huang, X.; Zhang, Y.; Fu, H.; Oey, L.-Y.; Xu, F.; Yang, G.

2014-11-01

Rapid advances in the performance of the graphics processing unit (GPU) have made the GPU a compelling solution for a series of scientific applications. However, most existing GPU acceleration works for climate models are doing partial code porting for certain hot spots, and can only achieve limited speedup for the entire model. In this work, we take the mpiPOM (a parallel version of the Princeton Ocean Model) as our starting point, design and implement a GPU-based Princeton Ocean Model. By carefully considering the architectural features of the state-of-the-art GPU devices, we rewrite the full mpiPOM model from the original Fortran version into a new Compute Unified Device Architecture C (CUDA-C) version. We take several accelerating methods to further improve the performance of gpuPOM, including optimizing memory access in a single GPU, overlapping communication and boundary operations among multiple GPUs, and overlapping input/output (I/O) between the hybrid Central Processing Unit (CPU) and the GPU. Our experimental results indicate that the performance of the gpuPOM on a workstation containing 4 GPUs is comparable to a powerful cluster with 408 CPU cores and it reduces the energy consumption by 6.8 times.
A Comparison of Automatic Parallelization Tools/Compilers on the SGI Origin 2000 Using the NAS Benchmarks

NASA Technical Reports Server (NTRS)

Saini, Subhash; Frumkin, Michael; Hribar, Michelle; Jin, Hao-Qiang; Waheed, Abdul; Yan, Jerry

1998-01-01

Porting applications to new high performance parallel and distributed computing platforms is a challenging task. Since writing parallel code by hand is extremely time consuming and costly, porting codes would ideally be automated by using some parallelization tools and compilers. In this paper, we compare the performance of the hand written NAB Parallel Benchmarks against three parallel versions generated with the help of tools and compilers: 1) CAPTools: an interactive computer aided parallelization too] that generates message passing code, 2) the Portland Group's HPF compiler and 3) using compiler directives with the native FORTAN77 compiler on the SGI Origin2000.
Parallel Lattice Basis Reduction Using a Multi-threaded Schnorr-Euchner LLL Algorithm

NASA Astrophysics Data System (ADS)

Backes, Werner; Wetzel, Susanne

In this paper, we introduce a new parallel variant of the LLL lattice basis reduction algorithm. Our new, multi-threaded algorithm is the first to provide an efficient, parallel implementation of the Schorr-Euchner algorithm for today’s multi-processor, multi-core computer architectures. Experiments with sparse and dense lattice bases show a speed-up factor of about 1.8 for the 2-thread and about factor 3.2 for the 4-thread version of our new parallel lattice basis reduction algorithm in comparison to the traditional non-parallel algorithm.
Parallel and Efficient Sensitivity Analysis of Microscopy Image Segmentation Workflows in Hybrid Systems

PubMed Central

Barreiros, Willian; Teodoro, George; Kurc, Tahsin; Kong, Jun; Melo, Alba C. M. A.; Saltz, Joel

2017-01-01

We investigate efficient sensitivity analysis (SA) of algorithms that segment and classify image features in a large dataset of high-resolution images. Algorithm SA is the process of evaluating variations of methods and parameter values to quantify differences in the output. A SA can be very compute demanding because it requires re-processing the input dataset several times with different parameters to assess variations in output. In this work, we introduce strategies to efficiently speed up SA via runtime optimizations targeting distributed hybrid systems and reuse of computations from runs with different parameters. We evaluate our approach using a cancer image analysis workflow on a hybrid cluster with 256 nodes, each with an Intel Phi and a dual socket CPU. The SA attained a parallel efficiency of over 90% on 256 nodes. The cooperative execution using the CPUs and the Phi available in each node with smart task assignment strategies resulted in an additional speedup of about 2×. Finally, multi-level computation reuse lead to an additional speedup of up to 2.46× on the parallel version. The level of performance attained with the proposed optimizations will allow the use of SA in large-scale studies. PMID:29081725
LAURA Users Manual: 5.3-48528

NASA Technical Reports Server (NTRS)

Mazaheri, Alireza; Gnoffo, Peter A.; Johnston, Chirstopher O.; Kleb, Bil

2010-01-01

This users manual provides in-depth information concerning installation and execution of LAURA, version 5. LAURA is a structured, multi-block, computational aerothermodynamic simulation code. Version 5 represents a major refactoring of the original Fortran 77 LAURA code toward a modular structure afforded by Fortran 95. The refactoring improved usability and maintainability by eliminating the requirement for problem-dependent re-compilations, providing more intuitive distribution of functionality, and simplifying interfaces required for multi-physics coupling. As a result, LAURA now shares gas-physics modules, MPI modules, and other low-level modules with the FUN3D unstructured-grid code. In addition to internal refactoring, several new features and capabilities have been added, e.g., a GNU-standard installation process, parallel load balancing, automatic trajectory point sequencing, free-energy minimization, and coupled ablation and flowfield radiation.
LAURA Users Manual: 5.5-64987

NASA Technical Reports Server (NTRS)

Mazaheri, Alireza; Gnoffo, Peter A.; Johnston, Christopher O.; Kleb, William L.

2013-01-01

This users manual provides in-depth information concerning installation and execution of LAURA, version 5. LAURA is a structured, multi-block, computational aerothermodynamic simulation code. Version 5 represents a major refactoring of the original Fortran 77 LAURA code toward a modular structure afforded by Fortran 95. The refactoring improved usability and maintain ability by eliminating the requirement for problem dependent recompilations, providing more intuitive distribution of functionality, and simplifying interfaces required for multi-physics coupling. As a result, LAURA now shares gas-physics modules, MPI modules, and other low-level modules with the Fun3D unstructured-grid code. In addition to internal refactoring, several new features and capabilities have been added, e.g., a GNU standard installation process, parallel load balancing, automatic trajectory point sequencing, free-energy minimization, and coupled ablation and flowfield radiation.
LAURA Users Manual: 5.4-54166

NASA Technical Reports Server (NTRS)

Mazaheri, Alireza; Gnoffo, Peter A.; Johnston, Christopher O.; Kleb, Bil

2011-01-01

This users manual provides in-depth information concerning installation and execution of Laura, version 5. Laura is a structured, multi-block, computational aerothermodynamic simulation code. Version 5 represents a major refactoring of the original Fortran 77 Laura code toward a modular structure afforded by Fortran 95. The refactoring improved usability and maintainability by eliminating the requirement for problem dependent re-compilations, providing more intuitive distribution of functionality, and simplifying interfaces required for multi-physics coupling. As a result, Laura now shares gas-physics modules, MPI modules, and other low-level modules with the Fun3D unstructured-grid code. In addition to internal refactoring, several new features and capabilities have been added, e.g., a GNU-standard installation process, parallel load balancing, automatic trajectory point sequencing, free-energy minimization, and coupled ablation and flowfield radiation.
LAURA Users Manual: 5.2-43231

NASA Technical Reports Server (NTRS)

Mazaheri, Alireza; Gnoffo, Peter A.; Johnston, Christopher O.; Kleb, Bil

2009-01-01

This users manual provides in-depth information concerning installation and execution of LAURA, version 5. LAURA is a structured, multi-block, computational aerothermodynamic simulation code. Version 5 represents a major refactoring of the original Fortran 77 LAURA code toward a modular structure afforded by Fortran 95. The refactoring improved usability and maintainability by eliminating the requirement for problem-dependent re-compilations, providing more intuitive distribution of functionality, and simplifying interfaces required for multiphysics coupling. As a result, LAURA now shares gas-physics modules, MPI modules, and other low-level modules with the FUN3D unstructured-grid code. In addition to internal refactoring, several new features and capabilities have been added, e.g., a GNU-standard installation process, parallel load balancing, automatic trajectory point sequencing, free-energy minimization, and coupled ablation and flowfield radiation.
Laura Users Manual: 5.1-41601

NASA Technical Reports Server (NTRS)

Mazaheri, Alireza; Gnoffo, Peter A.; Johnston, Christopher O.; Kleb, Bil

2009-01-01

This users manual provides in-depth information concerning installation and execution of LAURA, version 5. LAURA is a structured, multi-block, computational aerothermodynamic simulation code. Version 5 represents a major refactoring of the original Fortran 77 LAURA code toward a modular structure afforded by Fortran 95. The refactoring improved usability and maintainability by eliminating the requirement for problem-dependent re-compilations, providing more intuitive distribution of functionality, and simplifying interfaces required for multiphysics coupling. As a result, LAURA now shares gas-physics modules, MPI modules, and other low-level modules with the FUN3D unstructured-grid code. In addition to internal refactoring, several new features and capabilities have been added, e.g., a GNU-standard installation process, parallel load balancing, automatic trajectory point sequencing, free-energy minimization, and coupled ablation and flowfield radiation.
Tool for Rapid Analysis of Monte Carlo Simulations

NASA Technical Reports Server (NTRS)

Restrepo, Carolina; McCall, Kurt E.; Hurtado, John E.

2013-01-01

Designing a spacecraft, or any other complex engineering system, requires extensive simulation and analysis work. Oftentimes, the large amounts of simulation data generated are very difficult and time consuming to analyze, with the added risk of overlooking potentially critical problems in the design. The authors have developed a generic data analysis tool that can quickly sort through large data sets and point an analyst to the areas in the data set that cause specific types of failures. The first version of this tool was a serial code and the current version is a parallel code, which has greatly increased the analysis capabilities. This paper describes the new implementation of this analysis tool on a graphical processing unit, and presents analysis results for NASA's Orion Monte Carlo data to demonstrate its capabilities.
A numerical differentiation library exploiting parallel architectures

NASA Astrophysics Data System (ADS)

Voglis, C.; Hadjidoukas, P. E.; Lagaris, I. E.; Papageorgiou, D. G.

2009-08-01

We present a software library for numerically estimating first and second order partial derivatives of a function by finite differencing. Various truncation schemes are offered resulting in corresponding formulas that are accurate to order O(h), O(h), and O(h), h being the differencing step. The derivatives are calculated via forward, backward and central differences. Care has been taken that only feasible points are used in the case where bound constraints are imposed on the variables. The Hessian may be approximated either from function or from gradient values. There are three versions of the software: a sequential version, an OpenMP version for shared memory architectures and an MPI version for distributed systems (clusters). The parallel versions exploit the multiprocessing capability offered by computer clusters, as well as modern multi-core systems and due to the independent character of the derivative computation, the speedup scales almost linearly with the number of available processors/cores. Program summaryProgram title: NDL (Numerical Differentiation Library) Catalogue identifier: AEDG_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEDG_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 73 030 No. of bytes in distributed program, including test data, etc.: 630 876 Distribution format: tar.gz Programming language: ANSI FORTRAN-77, ANSI C, MPI, OPENMP Computer: Distributed systems (clusters), shared memory systems Operating system: Linux, Solaris Has the code been vectorised or parallelized?: Yes RAM: The library uses O(N) internal storage, N being the dimension of the problem Classification: 4.9, 4.14, 6.5 Nature of problem: The numerical estimation of derivatives at several accuracy levels is a common requirement in many computational tasks, such as optimization, solution of nonlinear systems, etc. The parallel implementation that exploits systems with multiple CPUs is very important for large scale and computationally expensive problems. Solution method: Finite differencing is used with carefully chosen step that minimizes the sum of the truncation and round-off errors. The parallel versions employ both OpenMP and MPI libraries. Restrictions: The library uses only double precision arithmetic. Unusual features: The software takes into account bound constraints, in the sense that only feasible points are used to evaluate the derivatives, and given the level of the desired accuracy, the proper formula is automatically employed. Running time: Running time depends on the function's complexity. The test run took 15 ms for the serial distribution, 0.6 s for the OpenMP and 4.2 s for the MPI parallel distribution on 2 processors.
Research in computer science

NASA Technical Reports Server (NTRS)

Ortega, J. M.

1986-01-01

Various graduate research activities in the field of computer science are reported. Among the topics discussed are: (1) failure probabilities in multi-version software; (2) Gaussian Elimination on parallel computers; (3) three dimensional Poisson solvers on parallel/vector computers; (4) automated task decomposition for multiple robot arms; (5) multi-color incomplete cholesky conjugate gradient methods on the Cyber 205; and (6) parallel implementation of iterative methods for solving linear equations.
The openGL visualization of the 2D parallel FDTD algorithm

NASA Astrophysics Data System (ADS)

Walendziuk, Wojciech

2005-02-01

This paper presents a way of visualization of a two-dimensional version of a parallel algorithm of the FDTD method. The visualization module was created on the basis of the OpenGL graphic standard with the use of the GLUT interface. In addition, the work includes the results of the efficiency of the parallel algorithm in the form of speedup charts.
Hybrid parallelization of the XTOR-2F code for the simulation of two-fluid MHD instabilities in tokamaks

NASA Astrophysics Data System (ADS)

Marx, Alain; Lütjens, Hinrich

2017-03-01

A hybrid MPI/OpenMP parallel version of the XTOR-2F code [Lütjens and Luciani, J. Comput. Phys. 229 (2010) 8130] solving the two-fluid MHD equations in full tokamak geometry by means of an iterative Newton-Krylov matrix-free method has been developed. The present work shows that the code has been parallelized significantly despite the numerical profile of the problem solved by XTOR-2F, i.e. a discretization with pseudo-spectral representations in all angular directions, the stiffness of the two-fluid stability problem in tokamaks, and the use of a direct LU decomposition to invert the physical pre-conditioner at every Krylov iteration of the solver. The execution time of the parallelized version is an order of magnitude smaller than the sequential one for low resolution cases, with an increasing speedup when the discretization mesh is refined. Moreover, it allows to perform simulations with higher resolutions, previously forbidden because of memory limitations.
Building a Snow Data Management System using Open Source Software (and IDL)

NASA Astrophysics Data System (ADS)

Goodale, C. E.; Mattmann, C. A.; Ramirez, P.; Hart, A. F.; Painter, T.; Zimdars, P. A.; Bryant, A.; Brodzik, M.; Skiles, M.; Seidel, F. C.; Rittger, K. E.

2012-12-01

At NASA's Jet Propulsion Laboratory free and open source software is used everyday to support a wide range of projects, from planetary to climate to research and development. In this abstract I will discuss the key role that open source software has played in building a robust science data processing pipeline for snow hydrology research, and how the system is also able to leverage programs written in IDL, making JPL's Snow Data System a hybrid of open source and proprietary software. Main Points: - The Design of the Snow Data System (illustrate how the collection of sub-systems are combined to create a complete data processing pipeline) - Discuss the Challenges of moving from a single algorithm on a laptop, to running 100's of parallel algorithms on a cluster of servers (lesson's learned) - Code changes - Software license related challenges - Storage Requirements - System Evolution (from data archiving, to data processing, to data on a map, to near-real-time products and maps) - Road map for the next 6 months (including how easily we re-used the snowDS code base to support the Airborne Snow Observatory Mission) Software in Use and their Software Licenses: IDL - Used for pre and post processing of data. Licensed under a proprietary software license held by Excelis. Apache OODT - Used for data management and workflow processing. Licensed under the Apache License Version 2. GDAL - Geospatial Data processing library used for data re-projection currently. Licensed under the X/MIT license. GeoServer - WMS Server. Licensed under the General Public License Version 2.0 Leaflet.js - Javascript web mapping library. Licensed under the Berkeley Software Distribution License. Python - Glue code and miscellaneous data processing support. Licensed under the Python Software Foundation License. Perl - Script wrapper for running the SCAG algorithm. Licensed under the General Public License Version 3. PHP - Front-end web application programming. Licensed under the PHP License Version 3.01
Hybrid-view programming of nuclear fusion simulation code in the PGAS parallel programming language XcalableMP

DOE Office of Scientific and Technical Information (OSTI.GOV)

Tsugane, Keisuke; Boku, Taisuke; Murai, Hitoshi

Recently, the Partitioned Global Address Space (PGAS) parallel programming model has emerged as a usable distributed memory programming model. XcalableMP (XMP) is a PGAS parallel programming language that extends base languages such as C and Fortran with directives in OpenMP-like style. XMP supports a global-view model that allows programmers to define global data and to map them to a set of processors, which execute the distributed global data as a single thread. In XMP, the concept of a coarray is also employed for local-view programming. In this study, we port Gyrokinetic Toroidal Code - Princeton (GTC-P), which is a three-dimensionalmore » gyrokinetic PIC code developed at Princeton University to study the microturbulence phenomenon in magnetically confined fusion plasmas, to XMP as an example of hybrid memory model coding with the global-view and local-view programming models. In local-view programming, the coarray notation is simple and intuitive compared with Message Passing Interface (MPI) programming while the performance is comparable to that of the MPI version. Thus, because the global-view programming model is suitable for expressing the data parallelism for a field of grid space data, we implement a hybrid-view version using a global-view programming model to compute the field and a local-view programming model to compute the movement of particles. Finally, the performance is degraded by 20% compared with the original MPI version, but the hybrid-view version facilitates more natural data expression for static grid space data (in the global-view model) and dynamic particle data (in the local-view model), and it also increases the readability of the code for higher productivity.« less
Hybrid-view programming of nuclear fusion simulation code in the PGAS parallel programming language XcalableMP

DOE PAGES

Tsugane, Keisuke; Boku, Taisuke; Murai, Hitoshi; ...

2016-06-01

Recently, the Partitioned Global Address Space (PGAS) parallel programming model has emerged as a usable distributed memory programming model. XcalableMP (XMP) is a PGAS parallel programming language that extends base languages such as C and Fortran with directives in OpenMP-like style. XMP supports a global-view model that allows programmers to define global data and to map them to a set of processors, which execute the distributed global data as a single thread. In XMP, the concept of a coarray is also employed for local-view programming. In this study, we port Gyrokinetic Toroidal Code - Princeton (GTC-P), which is a three-dimensionalmore » gyrokinetic PIC code developed at Princeton University to study the microturbulence phenomenon in magnetically confined fusion plasmas, to XMP as an example of hybrid memory model coding with the global-view and local-view programming models. In local-view programming, the coarray notation is simple and intuitive compared with Message Passing Interface (MPI) programming while the performance is comparable to that of the MPI version. Thus, because the global-view programming model is suitable for expressing the data parallelism for a field of grid space data, we implement a hybrid-view version using a global-view programming model to compute the field and a local-view programming model to compute the movement of particles. Finally, the performance is degraded by 20% compared with the original MPI version, but the hybrid-view version facilitates more natural data expression for static grid space data (in the global-view model) and dynamic particle data (in the local-view model), and it also increases the readability of the code for higher productivity.« less
Automation of a Wave-Optics Simulation and Image Post-Processing Package on Riptide

NASA Astrophysics Data System (ADS)

Werth, M.; Lucas, J.; Thompson, D.; Abercrombie, M.; Holmes, R.; Roggemann, M.

Detailed wave-optics simulations and image post-processing algorithms are computationally expensive and benefit from the massively parallel hardware available at supercomputing facilities. We created an automated system that interfaces with the Maui High Performance Computing Center (MHPCC) Distributed MATLAB® Portal interface to submit massively parallel waveoptics simulations to the IBM iDataPlex (Riptide) supercomputer. This system subsequently postprocesses the output images with an improved version of physically constrained iterative deconvolution (PCID) and analyzes the results using a series of modular algorithms written in Python. With this architecture, a single person can simulate thousands of unique scenarios and produce analyzed, archived, and briefing-compatible output products with very little effort. This research was developed with funding from the Defense Advanced Research Projects Agency (DARPA). The views, opinions, and/or findings expressed are those of the author(s) and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.
Parallelization of a Monte Carlo particle transport simulation code

NASA Astrophysics Data System (ADS)

Hadjidoukas, P.; Bousis, C.; Emfietzoglou, D.

2010-05-01

We have developed a high performance version of the Monte Carlo particle transport simulation code MC4. The original application code, developed in Visual Basic for Applications (VBA) for Microsoft Excel, was first rewritten in the C programming language for improving code portability. Several pseudo-random number generators have been also integrated and studied. The new MC4 version was then parallelized for shared and distributed-memory multiprocessor systems using the Message Passing Interface. Two parallel pseudo-random number generator libraries (SPRNG and DCMT) have been seamlessly integrated. The performance speedup of parallel MC4 has been studied on a variety of parallel computing architectures including an Intel Xeon server with 4 dual-core processors, a Sun cluster consisting of 16 nodes of 2 dual-core AMD Opteron processors and a 200 dual-processor HP cluster. For large problem size, which is limited only by the physical memory of the multiprocessor server, the speedup results are almost linear on all systems. We have validated the parallel implementation against the serial VBA and C implementations using the same random number generator. Our experimental results on the transport and energy loss of electrons in a water medium show that the serial and parallel codes are equivalent in accuracy. The present improvements allow for studying of higher particle energies with the use of more accurate physical models, and improve statistics as more particles tracks can be simulated in low response time.
Architectures for reasoning in parallel

NASA Technical Reports Server (NTRS)

Hall, Lawrence O.

1989-01-01

The research conducted has dealt with rule-based expert systems. The algorithms that may lead to effective parallelization of them were investigated. Both the forward and backward chained control paradigms were investigated in the course of this work. The best computer architecture for the developed and investigated algorithms has been researched. Two experimental vehicles were developed to facilitate this research. They are Backpac, a parallel backward chained rule-based reasoning system and Datapac, a parallel forward chained rule-based reasoning system. Both systems have been written in Multilisp, a version of Lisp which contains the parallel construct, future. Applying the future function to a function causes the function to become a task parallel to the spawning task. Additionally, Backpac and Datapac have been run on several disparate parallel processors. The machines are an Encore Multimax with 10 processors, the Concert Multiprocessor with 64 processors, and a 32 processor BBN GP1000. Both the Concert and the GP1000 are switch-based machines. The Multimax has all its processors hung off a common bus. All are shared memory machines, but have different schemes for sharing the memory and different locales for the shared memory. The main results of the investigations come from experiments on the 10 processor Encore and the Concert with partitions of 32 or less processors. Additionally, experiments have been run with a stripped down version of EMYCIN.

The Tera Multithreaded Architecture and Unstructured Meshes

NASA Technical Reports Server (NTRS)

Bokhari, Shahid H.; Mavriplis, Dimitri J.

1998-01-01

The Tera Multithreaded Architecture (MTA) is a new parallel supercomputer currently being installed at San Diego Supercomputing Center (SDSC). This machine has an architecture quite different from contemporary parallel machines. The computational processor is a custom design and the machine uses hardware to support very fine grained multithreading. The main memory is shared, hardware randomized and flat. These features make the machine highly suited to the execution of unstructured mesh problems, which are difficult to parallelize on other architectures. We report the results of a study carried out during July-August 1998 to evaluate the execution of EUL3D, a code that solves the Euler equations on an unstructured mesh, on the 2 processor Tera MTA at SDSC. Our investigation shows that parallelization of an unstructured code is extremely easy on the Tera. We were able to get an existing parallel code (designed for a shared memory machine), running on the Tera by changing only the compiler directives. Furthermore, a serial version of this code was compiled to run in parallel on the Tera by judicious use of directives to invoke the "full/empty" tag bits of the machine to obtain synchronization. This version achieves 212 and 406 Mflop/s on one and two processors respectively, and requires no attention to partitioning or placement of data issues that would be of paramount importance in other parallel architectures.
Hyperswitch Communication Network Computer

NASA Technical Reports Server (NTRS)

Peterson, John C.; Chow, Edward T.; Priel, Moshe; Upchurch, Edwin T.

1993-01-01

Hyperswitch Communications Network (HCN) computer is prototype multiple-processor computer being developed. Incorporates improved version of hyperswitch communication network described in "Hyperswitch Network For Hypercube Computer" (NPO-16905). Designed to support high-level software and expansion of itself. HCN computer is message-passing, multiple-instruction/multiple-data computer offering significant advantages over older single-processor and bus-based multiple-processor computers, with respect to price/performance ratio, reliability, availability, and manufacturing. Design of HCN operating-system software provides flexible computing environment accommodating both parallel and distributed processing. Also achieves balance among following competing factors; performance in processing and communications, ease of use, and tolerance of (and recovery from) faults.
Massively Parallel Assimilation of TOGA/TAO and Topex/Poseidon Measurements into a Quasi Isopycnal Ocean General Circulation Model Using an Ensemble Kalman Filter

NASA Technical Reports Server (NTRS)

Keppenne, Christian L.; Rienecker, Michele; Borovikov, Anna Y.; Suarez, Max

1999-01-01

A massively parallel ensemble Kalman filter (EnKF)is used to assimilate temperature data from the TOGA/TAO array and altimetry from TOPEX/POSEIDON into a Pacific basin version of the NASA Seasonal to Interannual Prediction Project (NSIPP)ls quasi-isopycnal ocean general circulation model. The EnKF is an approximate Kalman filter in which the error-covariance propagation step is modeled by the integration of multiple instances of a numerical model. An estimate of the true error covariances is then inferred from the distribution of the ensemble of model state vectors. This inplementation of the filter takes advantage of the inherent parallelism in the EnKF algorithm by running all the model instances concurrently. The Kalman filter update step also occurs in parallel by having each processor process the observations that occur in the region of physical space for which it is responsible. The massively parallel data assimilation system is validated by withholding some of the data and then quantifying the extent to which the withheld information can be inferred from the assimilation of the remaining data. The distributions of the forecast and analysis error covariances predicted by the ENKF are also examined.
Big Data GPU-Driven Parallel Processing Spatial and Spatio-Temporal Clustering Algorithms

NASA Astrophysics Data System (ADS)

Konstantaras, Antonios; Skounakis, Emmanouil; Kilty, James-Alexander; Frantzeskakis, Theofanis; Maravelakis, Emmanuel

2016-04-01

Advances in graphics processing units' technology towards encompassing parallel architectures [1], comprised of thousands of cores and multiples of parallel threads, provide the foundation in terms of hardware for the rapid processing of various parallel applications regarding seismic big data analysis. Seismic data are normally stored as collections of vectors in massive matrices, growing rapidly in size as wider areas are covered, denser recording networks are being established and decades of data are being compiled together [2]. Yet, many processes regarding seismic data analysis are performed on each seismic event independently or as distinct tiles [3] of specific grouped seismic events within a much larger data set. Such processes, independent of one another can be performed in parallel narrowing down processing times drastically [1,3]. This research work presents the development and implementation of three parallel processing algorithms using Cuda C [4] for the investigation of potentially distinct seismic regions [5,6] present in the vicinity of the southern Hellenic seismic arc. The algorithms, programmed and executed in parallel comparatively, are the: fuzzy k-means clustering with expert knowledge [7] in assigning overall clusters' number; density-based clustering [8]; and a selves-developed spatio-temporal clustering algorithm encompassing expert [9] and empirical knowledge [10] for the specific area under investigation. Indexing terms: GPU parallel programming, Cuda C, heterogeneous processing, distinct seismic regions, parallel clustering algorithms, spatio-temporal clustering References [1] Kirk, D. and Hwu, W.: 'Programming massively parallel processors - A hands-on approach', 2nd Edition, Morgan Kaufman Publisher, 2013 [2] Konstantaras, A., Valianatos, F., Varley, M.R. and Makris, J.P.: 'Soft-Computing Modelling of Seismicity in the Southern Hellenic Arc', Geoscience and Remote Sensing Letters, vol. 5 (3), pp. 323-327, 2008 [3] Papadakis, S. and Diamantaras, K.: 'Programming and architecture of parallel processing systems', 1st Edition, Eds. Kleidarithmos, 2011 [4] NVIDIA.: 'NVidia CUDA C Programming Guide', version 5.0, NVidia (reference book) [5] Konstantaras, A.: 'Classification of Distinct Seismic Regions and Regional Temporal Modelling of Seismicity in the Vicinity of the Hellenic Seismic Arc', IEEE Selected Topics in Applied Earth Observations and Remote Sensing, vol. 6 (4), pp. 1857-1863, 2013 [6] Konstantaras, A. Varley, M.R.,. Valianatos, F., Collins, G. and Holifield, P.: 'Recognition of electric earthquake precursors using neuro-fuzzy models: methodology and simulation results', Proc. IASTED International Conference on Signal Processing Pattern Recognition and Applications (SPPRA 2002), Crete, Greece, 2002, pp 303-308, 2002 [7] Konstantaras, A., Katsifarakis, E., Maravelakis, E., Skounakis, E., Kokkinos, E. and Karapidakis, E.: 'Intelligent Spatial-Clustering of Seismicity in the Vicinity of the Hellenic Seismic Arc', Earth Science Research, vol. 1 (2), pp. 1-10, 2012 [8] Georgoulas, G., Konstantaras, A., Katsifarakis, E., Stylios, C.D., Maravelakis, E. and Vachtsevanos, G.: '"Seismic-Mass" Density-based Algorithm for Spatio-Temporal Clustering', Expert Systems with Applications, vol. 40 (10), pp. 4183-4189, 2013 [9] Konstantaras, A. J.: 'Expert knowledge-based algorithm for the dynamic discrimination of interactive natural clusters', Earth Science Informatics, 2015 (In Press, see: www.scopus.com) [10] Drakatos, G. and Latoussakis, J.: 'A catalog of aftershock sequences in Greece (1971-1997): Their spatial and temporal characteristics', Journal of Seismology, vol. 5, pp. 137-145, 2001
The language parallel Pascal and other aspects of the massively parallel processor

NASA Technical Reports Server (NTRS)

Reeves, A. P.; Bruner, J. D.

1982-01-01

A high level language for the Massively Parallel Processor (MPP) was designed. This language, called Parallel Pascal, is described in detail. A description of the language design, a description of the intermediate language, Parallel P-Code, and details for the MPP implementation are included. Formal descriptions of Parallel Pascal and Parallel P-Code are given. A compiler was developed which converts programs in Parallel Pascal into the intermediate Parallel P-Code language. The code generator to complete the compiler for the MPP is being developed independently. A Parallel Pascal to Pascal translator was also developed. The architecture design for a VLSI version of the MPP was completed with a description of fault tolerant interconnection networks. The memory arrangement aspects of the MPP are discussed and a survey of other high level languages is given.
Massively Parallel Processing for Fast and Accurate Stamping Simulations

NASA Astrophysics Data System (ADS)

Gress, Jeffrey J.; Xu, Siguang; Joshi, Ramesh; Wang, Chuan-tao; Paul, Sabu

2005-08-01

The competitive automotive market drives automotive manufacturers to speed up the vehicle development cycles and reduce the lead-time. Fast tooling development is one of the key areas to support fast and short vehicle development programs (VDP). In the past ten years, the stamping simulation has become the most effective validation tool in predicting and resolving all potential formability and quality problems before the dies are physically made. The stamping simulation and formability analysis has become an critical business segment in GM math-based die engineering process. As the simulation becomes as one of the major production tools in engineering factory, the simulation speed and accuracy are the two of the most important measures for stamping simulation technology. The speed and time-in-system of forming analysis becomes an even more critical to support the fast VDP and tooling readiness. Since 1997, General Motors Die Center has been working jointly with our software vendor to develop and implement a parallel version of simulation software for mass production analysis applications. By 2001, this technology was matured in the form of distributed memory processing (DMP) of draw die simulations in a networked distributed memory computing environment. In 2004, this technology was refined to massively parallel processing (MPP) and extended to line die forming analysis (draw, trim, flange, and associated spring-back) running on a dedicated computing environment. The evolution of this technology and the insight gained through the implementation of DM0P/MPP technology as well as performance benchmarks are discussed in this publication.
Parallel image compression

NASA Technical Reports Server (NTRS)

Reif, John H.

1987-01-01

A parallel compression algorithm for the 16,384 processor MPP machine was developed. The serial version of the algorithm can be viewed as a combination of on-line dynamic lossless test compression techniques (which employ simple learning strategies) and vector quantization. These concepts are described. How these concepts are combined to form a new strategy for performing dynamic on-line lossy compression is discussed. Finally, the implementation of this algorithm in a massively parallel fashion on the MPP is discussed.
Backtracking and Re-execution in the Automatic Debugging of Parallelized Programs

NASA Technical Reports Server (NTRS)

Matthews, Gregory; Hood, Robert; Johnson, Stephen; Leggett, Peter; Biegel, Bryan (Technical Monitor)

2002-01-01

In this work we describe a new approach using relative debugging to find differences in computation between a serial program and a parallel version of th it program. We use a combination of re-execution and backtracking in order to find the first difference in computation that may ultimately lead to an incorrect value that the user has indicated. In our prototype implementation we use static analysis information from a parallelization tool in order to perform the backtracking as well as the mapping required between serial and parallel computations.
Accelerating electron tomography reconstruction algorithm ICON with GPU.

PubMed

Chen, Yu; Wang, Zihao; Zhang, Jingrong; Li, Lun; Wan, Xiaohua; Sun, Fei; Zhang, Fa

2017-01-01

Electron tomography (ET) plays an important role in studying in situ cell ultrastructure in three-dimensional space. Due to limited tilt angles, ET reconstruction always suffers from the "missing wedge" problem. With a validation procedure, iterative compressed-sensing optimized NUFFT reconstruction (ICON) demonstrates its power in the restoration of validated missing information for low SNR biological ET dataset. However, the huge computational demand has become a major problem for the application of ICON. In this work, we analyzed the framework of ICON and classified the operations of major steps of ICON reconstruction into three types. Accordingly, we designed parallel strategies and implemented them on graphics processing units (GPU) to generate a parallel program ICON-GPU. With high accuracy, ICON-GPU has a great acceleration compared to its CPU version, up to 83.7×, greatly relieving ICON's dependence on computing resource.
An improved parallel fuzzy connected image segmentation method based on CUDA.

PubMed

Wang, Liansheng; Li, Dong; Huang, Shaohui

2016-05-12

Fuzzy connectedness method (FC) is an effective method for extracting fuzzy objects from medical images. However, when FC is applied to large medical image datasets, its running time will be greatly expensive. Therefore, a parallel CUDA version of FC (CUDA-kFOE) was proposed by Ying et al. to accelerate the original FC. Unfortunately, CUDA-kFOE does not consider the edges between GPU blocks, which causes miscalculation of edge points. In this paper, an improved algorithm is proposed by adding a correction step on the edge points. The improved algorithm can greatly enhance the calculation accuracy. In the improved method, an iterative manner is applied. In the first iteration, the affinity computation strategy is changed and a look up table is employed for memory reduction. In the second iteration, the error voxels because of asynchronism are updated again. Three different CT sequences of hepatic vascular with different sizes were used in the experiments with three different seeds. NVIDIA Tesla C2075 is used to evaluate our improved method over these three data sets. Experimental results show that the improved algorithm can achieve a faster segmentation compared to the CPU version and higher accuracy than CUDA-kFOE. The calculation results were consistent with the CPU version, which demonstrates that it corrects the edge point calculation error of the original CUDA-kFOE. The proposed method has a comparable time cost and has less errors compared to the original CUDA-kFOE as demonstrated in the experimental results. In the future, we will focus on automatic acquisition method and automatic processing.
Application of Intel Many Integrated Core (MIC) architecture to the Yonsei University planetary boundary layer scheme in Weather Research and Forecasting model

NASA Astrophysics Data System (ADS)

Huang, Melin; Huang, Bormin; Huang, Allen H.

2014-10-01

The Weather Research and Forecasting (WRF) model provided operational services worldwide in many areas and has linked to our daily activity, in particular during severe weather events. The scheme of Yonsei University (YSU) is one of planetary boundary layer (PBL) models in WRF. The PBL is responsible for vertical sub-grid-scale fluxes due to eddy transports in the whole atmospheric column, determines the flux profiles within the well-mixed boundary layer and the stable layer, and thus provide atmospheric tendencies of temperature, moisture (including clouds), and horizontal momentum in the entire atmospheric column. The YSU scheme is very suitable for massively parallel computation as there are no interactions among horizontal grid points. To accelerate the computation process of the YSU scheme, we employ Intel Many Integrated Core (MIC) Architecture as it is a multiprocessor computer structure with merits of efficient parallelization and vectorization essentials. Our results show that the MIC-based optimization improved the performance of the first version of multi-threaded code on Xeon Phi 5110P by a factor of 2.4x. Furthermore, the same CPU-based optimizations improved the performance on Intel Xeon E5-2603 by a factor of 1.6x as compared to the first version of multi-threaded code.
Parallel Performance of a Combustion Chemistry Simulation

DOE PAGES

Skinner, Gregg; Eigenmann, Rudolf

1995-01-01

We used a description of a combustion simulation's mathematical and computational methods to develop a version for parallel execution. The result was a reasonable performance improvement on small numbers of processors. We applied several important programming techniques, which we describe, in optimizing the application. This work has implications for programming languages, compiler design, and software engineering.
Hi-Corrector: a fast, scalable and memory-efficient package for normalizing large-scale Hi-C data.

PubMed

Li, Wenyuan; Gong, Ke; Li, Qingjiao; Alber, Frank; Zhou, Xianghong Jasmine

2015-03-15

Genome-wide proximity ligation assays, e.g. Hi-C and its variant TCC, have recently become important tools to study spatial genome organization. Removing biases from chromatin contact matrices generated by such techniques is a critical preprocessing step of subsequent analyses. The continuing decline of sequencing costs has led to an ever-improving resolution of the Hi-C data, resulting in very large matrices of chromatin contacts. Such large-size matrices, however, pose a great challenge on the memory usage and speed of its normalization. Therefore, there is an urgent need for fast and memory-efficient methods for normalization of Hi-C data. We developed Hi-Corrector, an easy-to-use, open source implementation of the Hi-C data normalization algorithm. Its salient features are (i) scalability-the software is capable of normalizing Hi-C data of any size in reasonable times; (ii) memory efficiency-the sequential version can run on any single computer with very limited memory, no matter how little; (iii) fast speed-the parallel version can run very fast on multiple computing nodes with limited local memory. The sequential version is implemented in ANSI C and can be easily compiled on any system; the parallel version is implemented in ANSI C with the MPI library (a standardized and portable parallel environment designed for solving large-scale scientific problems). The package is freely available at http://zhoulab.usc.edu/Hi-Corrector/. © The Author 2014. Published by Oxford University Press.
Efficient Preconditioning for the p-Version Finite Element Method in Two Dimensions

DTIC Science & Technology

1989-10-01

paper, we study fast parallel preconditioners for systems of equations arising from the p-version finite element method. The p-version finite element...computations and the solution of a relatively small global auxiliary problem. We study two different methods. In the first (Section 3), the global...20], will be studied in the next section. Problem (3.12) is obviously much more easily solved than the original problem ,nd the procedure is highly
Reliability of a science admission test (HAM-Nat) at Hamburg medical school.

PubMed

Hissbach, Johanna; Klusmann, Dietrich; Hampe, Wolfgang

2011-01-01

The University Hospital in Hamburg (UKE) started to develop a test of knowledge in natural sciences for admission to medical school in 2005 (Hamburger Auswahlverfahren für Medizinische Studiengänge, Naturwissenschaftsteil, HAM-Nat). This study is a step towards establishing the HAM-Nat. We are investigating parallel forms reliability, the effect of a crash course in chemistry on test results, and correlations of HAM-Nat test results with a test of scientific reasoning (similar to a subtest of the "Test for Medical Studies", TMS). 316 first-year students participated in the study in 2007. They completed different versions of the HAM-Nat test which consisted of items that had already been used (HN2006) and new items (HN2007). Four weeks later half of the participants were tested on the HN2007 version of the HAM-Nat again, while the other half completed the test of scientific reasoning. Within this four week interval students were offered a five day chemistry course. Parallel forms reliability for four different test versions ranged from r(tt)=.53 to r(tt)=.67. The retest reliabilities of the HN2007 halves were r(tt)=.54 and r(tt )=.61. Correlations of the two HAM-Nat versions with the test of scientific reasoning were r=.34 und r=.21. The crash course in chemistry had no effect on HAM-Nat scores. The results suggest that further versions of the test of natural sciences will not easily conform to the standards of internal consistency, parallel-forms reliability and retest reliability. Much care has to be taken in order to assemble items which could be used interchangeably for the construction of new test versions. The test of scientific reasoning and the HAM-Nat are tapping different constructs. Participation in a chemistry course did not improve students' achievement, probably because the content of the course was not coordinated with the test and many students lacked of motivation to do well in the second test.
Reliability of a science admission test (HAM-Nat) at Hamburg medical school

PubMed Central

Hissbach, Johanna; Klusmann, Dietrich; Hampe, Wolfgang

2011-01-01

Objective: The University Hospital in Hamburg (UKE) started to develop a test of knowledge in natural sciences for admission to medical school in 2005 (Hamburger Auswahlverfahren für Medizinische Studiengänge, Naturwissenschaftsteil, HAM-Nat). This study is a step towards establishing the HAM-Nat. We are investigating parallel forms reliability, the effect of a crash course in chemistry on test results, and correlations of HAM-Nat test results with a test of scientific reasoning (similar to a subtest of the "Test for Medical Studies", TMS). Methods: 316 first-year students participated in the study in 2007. They completed different versions of the HAM-Nat test which consisted of items that had already been used (HN2006) and new items (HN2007). Four weeks later half of the participants were tested on the HN2007 version of the HAM-Nat again, while the other half completed the test of scientific reasoning. Within this four week interval students were offered a five day chemistry course. Results: Parallel forms reliability for four different test versions ranged from rtt=.53 to rtt=.67. The retest reliabilities of the HN2007 halves were rtt=.54 and rtt =.61. Correlations of the two HAM-Nat versions with the test of scientific reasoning were r=.34 und r=.21. The crash course in chemistry had no effect on HAM-Nat scores. Conclusions: The results suggest that further versions of the test of natural sciences will not easily conform to the standards of internal consistency, parallel-forms reliability and retest reliability. Much care has to be taken in order to assemble items which could be used interchangeably for the construction of new test versions. The test of scientific reasoning and the HAM-Nat are tapping different constructs. Participation in a chemistry course did not improve students’ achievement, probably because the content of the course was not coordinated with the test and many students lacked of motivation to do well in the second test. PMID:21866246
Dynamic grid refinement for partial differential equations on parallel computers

NASA Technical Reports Server (NTRS)

Mccormick, S.; Quinlan, D.

1989-01-01

The fast adaptive composite grid method (FAC) is an algorithm that uses various levels of uniform grids to provide adaptive resolution and fast solution of PDEs. An asynchronous version of FAC, called AFAC, that completely eliminates the bottleneck to parallelism is presented. This paper describes the advantage that this algorithm has in adaptive refinement for moving singularities on multiprocessor computers. This work is applicable to the parallel solution of two- and three-dimensional shock tracking problems.
Parallel short forms for the assessment of activities of daily living in cardiovascular rehabilitation patients (PADL-cardio): development and validation.

PubMed

Schmucker, Andreas; Abberger, Birgit; Boecker, Maren; Baumeister, Harald

2017-11-26

To develop and validate parallel short forms for the assessment of activities of daily living in cardiac rehabilitation patients (PADL-cardio I & II). PADL-cardio I & II were developed based on a sample of 106 patients [mean age = 57.6; standard deviation (SD) = 11.1; 72.6% males] using Rasch analysis and validated with a sample of 81 patients (mean age = 59.1; SD = 11.1; 88.9% males). All patients answered PADL-cardio and the Short Form 12 Health Survey. Both versions of PADL-cardio are composed of 10 items. The fit to the Rasch model was given documented by a non-significant Item-trait interaction score (PADL-cardio I: χ 2 = 31.08, df = 30, p = 0.41; PADL-cardio II: χ 2 = 45.6, df = 40, p = 0.25). The two versions were free of differential item functioning. Person-separation reliability was 0.72/0.78 and unidimensionality was given. The two versions correlated with r = 0.98 and the correlation between PADL-cardio and the underlying item bank was 0.99 for both versions. Concurrent validity is indicated through correlations with the Short Form 12 Health Survey (r = -0.37 to -0.40). PADL-cardio provides a short and psychometrically sound option for the assessment of activities of daily living in cardiovascular rehabilitation patients. The two versions of PADL-cardio are equivalent. Hence, they can be used to reduce practice and retest effects in repeated measurement, facilitating the longitudinal assessment of activities of daily living. Implications for Rehabilitation New parallel test forms for the assessment of activities of daily living in cardiac rehabilitation (PADL-cardio I & PADL-cardio II) are available. PADL-cardio I & II consist of 10 items and are therefore especially timesaving. Concurrent validity is given through correlations with the Short Form Health Survey 12. Therapeutic success could be determined more precisely by the parallel forms reducing practice and retest effects.
Comparison of Alternate and Original Items on the Montreal Cognitive Assessment.

PubMed

Lebedeva, Elena; Huang, Mei; Koski, Lisa

2016-03-01

The Montreal Cognitive Assessment (MoCA) is a screening tool for mild cognitive impairment (MCI) in elderly individuals. We hypothesized that measurement error when using the new alternate MoCA versions to monitor change over time could be related to the use of items that are not of comparable difficulty to their corresponding originals of similar content. The objective of this study was to compare the difficulty of the alternate MoCA items to the original ones. Five selected items from alternate versions of the MoCA were included with items from the original MoCA administered adaptively to geriatric outpatients (N = 78). Rasch analysis was used to estimate the difficulty level of the items. None of the five items from the alternate versions matched the difficulty level of their corresponding original items. This study demonstrates the potential benefits of a Rasch analysis-based approach for selecting items during the process of development of parallel forms. The results suggest that better match of the items from different MoCA forms by their difficulty would result in higher sensitivity to changes in cognitive function over time.
Unstructured Adaptive (UA) NAS Parallel Benchmark. Version 1.0

NASA Technical Reports Server (NTRS)

Feng, Huiyu; VanderWijngaart, Rob; Biswas, Rupak; Mavriplis, Catherine

2004-01-01

We present a complete specification of a new benchmark for measuring the performance of modern computer systems when solving scientific problems featuring irregular, dynamic memory accesses. It complements the existing NAS Parallel Benchmark suite. The benchmark involves the solution of a stylized heat transfer problem in a cubic domain, discretized on an adaptively refined, unstructured mesh.

Evolving binary classifiers through parallel computation of multiple fitness cases.

PubMed

Cagnoni, Stefano; Bergenti, Federico; Mordonini, Monica; Adorni, Giovanni

2005-06-01

This paper describes two versions of a novel approach to developing binary classifiers, based on two evolutionary computation paradigms: cellular programming and genetic programming. Such an approach achieves high computation efficiency both during evolution and at runtime. Evolution speed is optimized by allowing multiple solutions to be computed in parallel. Runtime performance is optimized explicitly using parallel computation in the case of cellular programming or implicitly taking advantage of the intrinsic parallelism of bitwise operators on standard sequential architectures in the case of genetic programming. The approach was tested on a digit recognition problem and compared with a reference classifier.
Overview of the NCC

NASA Technical Reports Server (NTRS)

Liu, Nan-Suey

2001-01-01

A multi-disciplinary design/analysis tool for combustion systems is critical for optimizing the low-emission, high-performance combustor design process. Based on discussions between then NASA Lewis Research Center and the jet engine companies, an industry-government team was formed in early 1995 to develop the National Combustion Code (NCC), which is an integrated system of computer codes for the design and analysis of combustion systems. NCC has advanced features that address the need to meet designer's requirements such as "assured accuracy", "fast turnaround", and "acceptable cost". The NCC development team is comprised of Allison Engine Company (Allison), CFD Research Corporation (CFDRC), GE Aircraft Engines (GEAE), NASA Glenn Research Center (LeRC), and Pratt & Whitney (P&W). The "unstructured mesh" capability and "parallel computing" are fundamental features of NCC from its inception. The NCC system is composed of a set of "elements" which includes grid generator, main flow solver, turbulence module, turbulence and chemistry interaction module, chemistry module, spray module, radiation heat transfer module, data visualization module, and a post-processor for evaluating engine performance parameters. Each element may have contributions from several team members. Such a multi-source multi-element system needs to be integrated in a way that facilitates inter-module data communication, flexibility in module selection, and ease of integration. The development of the NCC beta version was essentially completed in June 1998. Technical details of the NCC elements are given in the Reference List. Elements such as the baseline flow solver, turbulence module, and the chemistry module, have been extensively validated; and their parallel performance on large-scale parallel systems has been evaluated and optimized. However the scalar PDF module and the Spray module, as well as their coupling with the baseline flow solver, were developed in a small-scale distributed computing environment. As a result, the validation of the NCC beta version as a whole was quite limited. Current effort has been focused on the validation of the integrated code and the evaluation/optimization of its overall performance on large-scale parallel systems.
Tool for Rapid Analysis of Monte Carlo Simulations

NASA Technical Reports Server (NTRS)

Restrepo, Carolina; McCall, Kurt E.; Hurtado, John E.

2011-01-01

Designing a spacecraft, or any other complex engineering system, requires extensive simulation and analysis work. Oftentimes, the large amounts of simulation data generated are very di cult and time consuming to analyze, with the added risk of overlooking potentially critical problems in the design. The authors have developed a generic data analysis tool that can quickly sort through large data sets and point an analyst to the areas in the data set that cause specific types of failures. The Tool for Rapid Analysis of Monte Carlo simulations (TRAM) has been used in recent design and analysis work for the Orion vehicle, greatly decreasing the time it takes to evaluate performance requirements. A previous version of this tool was developed to automatically identify driving design variables in Monte Carlo data sets. This paper describes a new, parallel version, of TRAM implemented on a graphical processing unit, and presents analysis results for NASA's Orion Monte Carlo data to demonstrate its capabilities.
GPU accelerated implementation of NCI calculations using promolecular density.

PubMed

Rubez, Gaëtan; Etancelin, Jean-Matthieu; Vigouroux, Xavier; Krajecki, Michael; Boisson, Jean-Charles; Hénon, Eric

2017-05-30

The NCI approach is a modern tool to reveal chemical noncovalent interactions. It is particularly attractive to describe ligand-protein binding. A custom implementation for NCI using promolecular density is presented. It is designed to leverage the computational power of NVIDIA graphics processing unit (GPU) accelerators through the CUDA programming model. The code performances of three versions are examined on a test set of 144 systems. NCI calculations are particularly well suited to the GPU architecture, which reduces drastically the computational time. On a single compute node, the dual-GPU version leads to a 39-fold improvement for the biggest instance compared to the optimal OpenMP parallel run (C code, icc compiler) with 16 CPU cores. Energy consumption measurements carried out on both CPU and GPU NCI tests show that the GPU approach provides substantial energy savings. © 2017 Wiley Periodicals, Inc. © 2017 Wiley Periodicals, Inc.
Streaming data analytics via message passing with application to graph algorithms

DOE PAGES

Plimpton, Steven J.; Shead, Tim

2014-05-06

The need to process streaming data, which arrives continuously at high-volume in real-time, arises in a variety of contexts including data produced by experiments, collections of environmental or network sensors, and running simulations. Streaming data can also be formulated as queries or transactions which operate on a large dynamic data store, e.g. a distributed database. We describe a lightweight, portable framework named PHISH which enables a set of independent processes to compute on a stream of data in a distributed-memory parallel manner. Datums are routed between processes in patterns defined by the application. PHISH can run on top of eithermore » message-passing via MPI or sockets via ZMQ. The former means streaming computations can be run on any parallel machine which supports MPI; the latter allows them to run on a heterogeneous, geographically dispersed network of machines. We illustrate how PHISH can support streaming MapReduce operations, and describe streaming versions of three algorithms for large, sparse graph analytics: triangle enumeration, subgraph isomorphism matching, and connected component finding. Lastly, we also provide benchmark timings for MPI versus socket performance of several kernel operations useful in streaming algorithms.« less
64 x 64 thresholding photodetector array for optical pattern recognition

NASA Astrophysics Data System (ADS)

Langenbacher, Harry; Chao, Tien-Hsin; Shaw, Timothy; Yu, Jeffrey W.

1993-10-01

A high performance 32 X 32 peak detector array is introduced. This detector consists of a 32 X 32 array of thresholding photo-transistor cells, manufactured with a standard MOSIS digital 2-micron CMOS process. A built-in thresholding function that is able to perform 1024 thresholding operations in parallel strongly distinguishes this chip from available CCD detectors. This high speed detector offers responses from one to 10 milliseconds that is much higher than the commercially available CCD detectors operating at a TV frame rate. The parallel multiple peaks thresholding detection capability makes it particularly suitable for optical correlator and optoelectronically implemented neural networks. The principle of operation, circuit design and the performance characteristics are described. Experimental demonstration of correlation peak detection is also provided. Recently, we have also designed and built an advanced version of a 64 X 64 thresholding photodetector array chip. Experimental investigation of using this chip for pattern recognition is ongoing.
Development and performance of a new version of the OASIS coupler, OASIS3-MCT_3.0

NASA Astrophysics Data System (ADS)

Craig, Anthony; Valcke, Sophie; Coquart, Laure

2017-09-01

OASIS is coupling software developed primarily for use in the climate community. It provides the ability to couple different models with low implementation and performance overhead. OASIS3-MCT is the latest version of OASIS. It includes several improvements compared to OASIS3, including elimination of a separate hub coupler process, parallelization of the coupling communication and run-time grid interpolation, and the ability to easily reuse mapping weight files. OASIS3-MCT_3.0 is the latest release and includes the ability to couple between components running sequentially on the same set of tasks as well as to couple within a single component between different grids or decompositions such as physics, dynamics, and I/O. OASIS3-MCT has been tested with different configurations on up to 32 000 processes, with components running on high-resolution grids with up to 1.5 million grid cells, and with over 10 000 2-D coupling fields. Several new features will be available in OASIS3-MCT_4.0, and some of those are also described.
Current and planned numerical development for improving computing performance for long duration and/or low pressure transients

DOE Office of Scientific and Technical Information (OSTI.GOV)

Faydide, B.

1997-07-01

This paper presents the current and planned numerical development for improving computing performance in case of Cathare applications needing real time, like simulator applications. Cathare is a thermalhydraulic code developed by CEA (DRN), IPSN, EDF and FRAMATOME for PWR safety analysis. First, the general characteristics of the code are presented, dealing with physical models, numerical topics, and validation strategy. Then, the current and planned applications of Cathare in the field of simulators are discussed. Some of these applications were made in the past, using a simplified and fast-running version of Cathare (Cathare-Simu); the status of the numerical improvements obtained withmore » Cathare-Simu is presented. The planned developments concern mainly the Simulator Cathare Release (SCAR) project which deals with the use of the most recent version of Cathare inside simulators. In this frame, the numerical developments are related with the speed up of the calculation process, using parallel processing and improvement of code reliability on a large set of NPP transients.« less
Fast prediction of RNA-RNA interaction using heuristic algorithm.

PubMed

Montaseri, Soheila

2015-01-01

Interaction between two RNA molecules plays a crucial role in many medical and biological processes such as gene expression regulation. In this process, an RNA molecule prohibits the translation of another RNA molecule by establishing stable interactions with it. Some algorithms have been formed to predict the structure of the RNA-RNA interaction. High computational time is a common challenge in most of the presented algorithms. In this context, a heuristic method is introduced to accurately predict the interaction between two RNAs based on minimum free energy (MFE). This algorithm uses a few dot matrices for finding the secondary structure of each RNA and binding sites between two RNAs. Furthermore, a parallel version of this method is presented. We describe the algorithm's concurrency and parallelism for a multicore chip. The proposed algorithm has been performed on some datasets including CopA-CopT, R1inv-R2inv, Tar-Tar*, DIS-DIS, and IncRNA54-RepZ in Escherichia coli bacteria. The method has high validity and efficiency, and it is run in low computational time in comparison to other approaches.
CUDAMPF: a multi-tiered parallel framework for accelerating protein sequence search in HMMER on CUDA-enabled GPU.

PubMed

Jiang, Hanyu; Ganesan, Narayan

2016-02-27

HMMER software suite is widely used for analysis of homologous protein and nucleotide sequences with high sensitivity. The latest version of hmmsearch in HMMER 3.x, utilizes heuristic-pipeline which consists of MSV/SSV (Multiple/Single ungapped Segment Viterbi) stage, P7Viterbi stage and the Forward scoring stage to accelerate homology detection. Since the latest version is highly optimized for performance on modern multi-core CPUs with SSE capabilities, only a few acceleration attempts report speedup. However, the most compute intensive tasks within the pipeline (viz., MSV/SSV and P7Viterbi stages) still stand to benefit from the computational capabilities of massively parallel processors. A Multi-Tiered Parallel Framework (CUDAMPF) implemented on CUDA-enabled GPUs presented here, offers a finer-grained parallelism for MSV/SSV and Viterbi algorithms. We couple SIMT (Single Instruction Multiple Threads) mechanism with SIMD (Single Instructions Multiple Data) video instructions with warp-synchronism to achieve high-throughput processing and eliminate thread idling. We also propose a hardware-aware optimal allocation scheme of scarce resources like on-chip memory and caches in order to boost performance and scalability of CUDAMPF. In addition, runtime compilation via NVRTC available with CUDA 7.0 is incorporated into the presented framework that not only helps unroll innermost loop to yield upto 2 to 3-fold speedup than static compilation but also enables dynamic loading and switching of kernels depending on the query model size, in order to achieve optimal performance. CUDAMPF is designed as a hardware-aware parallel framework for accelerating computational hotspots within the hmmsearch pipeline as well as other sequence alignment applications. It achieves significant speedup by exploiting hierarchical parallelism on single GPU and takes full advantage of limited resources based on their own performance features. In addition to exceeding performance of other acceleration attempts, comprehensive evaluations against high-end CPUs (Intel i5, i7 and Xeon) shows that CUDAMPF yields upto 440 GCUPS for SSV, 277 GCUPS for MSV and 14.3 GCUPS for P7Viterbi all with 100 % accuracy, which translates to a maximum speedup of 37.5, 23.1 and 11.6-fold for MSV, SSV and P7Viterbi respectively. The source code is available at https://github.com/Super-Hippo/CUDAMPF.
Adjusting process count on demand for petascale global optimization

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sosonkina, Masha; Watson, Layne T.; Radcliffe, Nicholas R.

2012-11-23

There are many challenges that need to be met before efficient and reliable computation at the petascale is possible. Many scientific and engineering codes running at the petascale are likely to be memory intensive, which makes thrashing a serious problem for many petascale applications. One way to overcome this challenge is to use a dynamic number of processes, so that the total amount of memory available for the computation can be increased on demand. This paper describes modifications made to the massively parallel global optimization code pVTdirect in order to allow for a dynamic number of processes. In particular, themore » modified version of the code monitors memory use and spawns new processes if the amount of available memory is determined to be insufficient. The primary design challenges are discussed, and performance results are presented and analyzed.« less
StrAuto: automation and parallelization of STRUCTURE analysis.

PubMed

Chhatre, Vikram E; Emerson, Kevin J

2017-03-24

Population structure inference using the software STRUCTURE has become an integral part of population genetic studies covering a broad spectrum of taxa including humans. The ever-expanding size of genetic data sets poses computational challenges for this analysis. Although at least one tool currently implements parallel computing to reduce computational overload of this analysis, it does not fully automate the use of replicate STRUCTURE analysis runs required for downstream inference of optimal K. There is pressing need for a tool that can deploy population structure analysis on high performance computing clusters. We present an updated version of the popular Python program StrAuto, to streamline population structure analysis using parallel computing. StrAuto implements a pipeline that combines STRUCTURE analysis with the Evanno Δ K analysis and visualization of results using STRUCTURE HARVESTER. Using benchmarking tests, we demonstrate that StrAuto significantly reduces the computational time needed to perform iterative STRUCTURE analysis by distributing runs over two or more processors. StrAuto is the first tool to integrate STRUCTURE analysis with post-processing using a pipeline approach in addition to implementing parallel computation - a set up ideal for deployment on computing clusters. StrAuto is distributed under the GNU GPL (General Public License) and available to download from http://strauto.popgen.org .
Using Serial and Discrete Digit Naming to Unravel Word Reading Processes

PubMed Central

Altani, Angeliki; Protopapas, Athanassios; Georgiou, George K.

2018-01-01

During reading acquisition, word recognition is assumed to undergo a developmental shift from slow serial/sublexical processing of letter strings to fast parallel processing of whole word forms. This shift has been proposed to be detected by examining the size of the relationship between serial- and discrete-trial versions of word reading and rapid naming tasks. Specifically, a strong association between serial naming of symbols and single word reading suggests that words are processed serially, whereas a strong association between discrete naming of symbols and single word reading suggests that words are processed in parallel as wholes. In this study, 429 Grade 1, 3, and 5 English-speaking Canadian children were tested on serial and discrete digit naming and word reading. Across grades, single word reading was more strongly associated with discrete naming than with serial naming of digits, indicating that short high-frequency words are processed as whole units early in the development of reading ability in English. In contrast, serial naming was not a unique predictor of single word reading across grades, suggesting that within-word sequential processing was not required for the successful recognition for this set of words. Factor mixture analysis revealed that our participants could be clustered into two classes, namely beginning and more advanced readers. Serial naming uniquely predicted single word reading only among the first class of readers, indicating that novice readers rely on a serial strategy to decode words. Yet, a considerable proportion of Grade 1 students were assigned to the second class, evidently being able to process short high-frequency words as unitized symbols. We consider these findings together with those from previous studies to challenge the hypothesis of a binary distinction between serial/sublexical and parallel/lexical processing in word reading. We argue instead that sequential processing in word reading operates on a continuum, depending on the level of reading proficiency, the degree of orthographic transparency, and word-specific characteristics. PMID:29706918
Using Serial and Discrete Digit Naming to Unravel Word Reading Processes.

PubMed

Altani, Angeliki; Protopapas, Athanassios; Georgiou, George K

2018-01-01

During reading acquisition, word recognition is assumed to undergo a developmental shift from slow serial/sublexical processing of letter strings to fast parallel processing of whole word forms. This shift has been proposed to be detected by examining the size of the relationship between serial- and discrete-trial versions of word reading and rapid naming tasks. Specifically, a strong association between serial naming of symbols and single word reading suggests that words are processed serially, whereas a strong association between discrete naming of symbols and single word reading suggests that words are processed in parallel as wholes. In this study, 429 Grade 1, 3, and 5 English-speaking Canadian children were tested on serial and discrete digit naming and word reading. Across grades, single word reading was more strongly associated with discrete naming than with serial naming of digits, indicating that short high-frequency words are processed as whole units early in the development of reading ability in English. In contrast, serial naming was not a unique predictor of single word reading across grades, suggesting that within-word sequential processing was not required for the successful recognition for this set of words. Factor mixture analysis revealed that our participants could be clustered into two classes, namely beginning and more advanced readers. Serial naming uniquely predicted single word reading only among the first class of readers, indicating that novice readers rely on a serial strategy to decode words. Yet, a considerable proportion of Grade 1 students were assigned to the second class, evidently being able to process short high-frequency words as unitized symbols. We consider these findings together with those from previous studies to challenge the hypothesis of a binary distinction between serial/sublexical and parallel/lexical processing in word reading. We argue instead that sequential processing in word reading operates on a continuum, depending on the level of reading proficiency, the degree of orthographic transparency, and word-specific characteristics.
Next-generation acceleration and code optimization for light transport in turbid media using GPUs

PubMed Central

Alerstam, Erik; Lo, William Chun Yip; Han, Tianyi David; Rose, Jonathan; Andersson-Engels, Stefan; Lilge, Lothar

2010-01-01

A highly optimized Monte Carlo (MC) code package for simulating light transport is developed on the latest graphics processing unit (GPU) built for general-purpose computing from NVIDIA - the Fermi GPU. In biomedical optics, the MC method is the gold standard approach for simulating light transport in biological tissue, both due to its accuracy and its flexibility in modelling realistic, heterogeneous tissue geometry in 3-D. However, the widespread use of MC simulations in inverse problems, such as treatment planning for PDT, is limited by their long computation time. Despite its parallel nature, optimizing MC code on the GPU has been shown to be a challenge, particularly when the sharing of simulation result matrices among many parallel threads demands the frequent use of atomic instructions to access the slow GPU global memory. This paper proposes an optimization scheme that utilizes the fast shared memory to resolve the performance bottleneck caused by atomic access, and discusses numerous other optimization techniques needed to harness the full potential of the GPU. Using these techniques, a widely accepted MC code package in biophotonics, called MCML, was successfully accelerated on a Fermi GPU by approximately 600x compared to a state-of-the-art Intel Core i7 CPU. A skin model consisting of 7 layers was used as the standard simulation geometry. To demonstrate the possibility of GPU cluster computing, the same GPU code was executed on four GPUs, showing a linear improvement in performance with an increasing number of GPUs. The GPU-based MCML code package, named GPU-MCML, is compatible with a wide range of graphics cards and is released as an open-source software in two versions: an optimized version tuned for high performance and a simplified version for beginners (http://code.google.com/p/gpumcml). PMID:21258498
PFLOTRAN-E4D: A parallel open source PFLOTRAN module for simulating time-lapse electrical resistivity data

DOE Office of Scientific and Technical Information (OSTI.GOV)

Johnson, Timothy C.; Hammond, Glenn E.; Chen, Xingyuan

Time-lapse electrical resistivity tomography (ERT) is finding increased application for remotely monitoring processes occurring in the near subsurface in three-dimensions (i.e. 4D monitoring). However, there are few codes capable of simulating the evolution of subsurface resistivity and corresponding tomographic measurements arising from a particular process, particularly in parallel and with an open source license. Herein we describe and demonstrate an electrical resistivity tomography module for the PFLOTRAN subsurface simulation code, named PFLOTRAN-E4D. The PFLOTRAN-E4D module operates in parallel using a dedicated set of compute cores in a master-slave configuration. At each time step, the master processes receives subsurface states frommore » PFLOTRAN, converts those states to bulk electrical conductivity, and instructs the slave processes to simulate a tomographic data set. The resulting multi-physics simulation capability enables accurate feasibility studies for ERT imaging, the identification of the ERT signatures that are unique to a given process, and facilitates the joint inversion of ERT data with hydrogeological data for subsurface characterization. PFLOTRAN-E4D is demonstrated herein using a field study of stage-driven groundwater/river water interaction ERT monitoring along the Columbia River, Washington, USA. Results demonstrate the complex nature of changes subsurface electrical conductivity, in both the saturated and unsaturated zones, arising from water table changes and from river water intrusion into the aquifer. The results also demonstrate the sensitivity of surface based ERT measurements to those changes over time. PFLOTRAN-E4D is available with the PFLOTRAN development version with an open-source license at https://bitbucket.org/pflotran/pflotran-dev .« less
PFLOTRAN-E4D: A parallel open source PFLOTRAN module for simulating time-lapse electrical resistivity data

NASA Astrophysics Data System (ADS)

Johnson, Timothy C.; Hammond, Glenn E.; Chen, Xingyuan

2017-02-01

Time-lapse electrical resistivity tomography (ERT) is finding increased application for remotely monitoring processes occurring in the near subsurface in three-dimensions (i.e. 4D monitoring). However, there are few codes capable of simulating the evolution of subsurface resistivity and corresponding tomographic measurements arising from a particular process, particularly in parallel and with an open source license. Herein we describe and demonstrate an electrical resistivity tomography module for the PFLOTRAN subsurface flow and reactive transport simulation code, named PFLOTRAN-E4D. The PFLOTRAN-E4D module operates in parallel using a dedicated set of compute cores in a master-slave configuration. At each time step, the master processes receives subsurface states from PFLOTRAN, converts those states to bulk electrical conductivity, and instructs the slave processes to simulate a tomographic data set. The resulting multi-physics simulation capability enables accurate feasibility studies for ERT imaging, the identification of the ERT signatures that are unique to a given process, and facilitates the joint inversion of ERT data with hydrogeological data for subsurface characterization. PFLOTRAN-E4D is demonstrated herein using a field study of stage-driven groundwater/river water interaction ERT monitoring along the Columbia River, Washington, USA. Results demonstrate the complex nature of subsurface electrical conductivity changes, in both the saturated and unsaturated zones, arising from river stage fluctuations and associated river water intrusion into the aquifer. The results also demonstrate the sensitivity of surface based ERT measurements to those changes over time. PFLOTRAN-E4D is available with the PFLOTRAN development version with an open-source license at https://bitbucket.org/pflotran/pflotran-dev.
High-performance computational fluid dynamics: a custom-code approach

NASA Astrophysics Data System (ADS)

Fannon, James; Loiseau, Jean-Christophe; Valluri, Prashant; Bethune, Iain; Náraigh, Lennon Ó.

2016-07-01

We introduce a modified and simplified version of the pre-existing fully parallelized three-dimensional Navier-Stokes flow solver known as TPLS. We demonstrate how the simplified version can be used as a pedagogical tool for the study of computational fluid dynamics (CFDs) and parallel computing. TPLS is at its heart a two-phase flow solver, and uses calls to a range of external libraries to accelerate its performance. However, in the present context we narrow the focus of the study to basic hydrodynamics and parallel computing techniques, and the code is therefore simplified and modified to simulate pressure-driven single-phase flow in a channel, using only relatively simple Fortran 90 code with MPI parallelization, but no calls to any other external libraries. The modified code is analysed in order to both validate its accuracy and investigate its scalability up to 1000 CPU cores. Simulations are performed for several benchmark cases in pressure-driven channel flow, including a turbulent simulation, wherein the turbulence is incorporated via the large-eddy simulation technique. The work may be of use to advanced undergraduate and graduate students as an introductory study in CFDs, while also providing insight for those interested in more general aspects of high-performance computing.
Deterministic and stochastic methods of calculation of polarization characteristics of radiation in natural environment

NASA Astrophysics Data System (ADS)

Strelkov, S. A.; Sushkevich, T. A.; Maksakova, S. V.

2017-11-01

We are talking about russian achievements of the world level in the theory of radiation transfer, taking into account its polarization in natural media and the current scientific potential developing in Russia, which adequately provides the methodological basis for theoretically-calculated research of radiation processes and radiation fields in natural media using supercomputers and mass parallelism. A new version of the matrix transfer operator is proposed for solving problems of polarized radiation transfer in heterogeneous media by the method of influence functions, when deterministic and stochastic methods can be combined.
Parallel performance of TORT on the CRAY J90: Model and measurement

DOE Office of Scientific and Technical Information (OSTI.GOV)

Barnett, A.; Azmy, Y.Y.

1997-10-01

A limitation on the parallel performance of TORT on the CRAY J90 is the amount of extra work introduced by the multitasking algorithm itself. The extra work beyond that of the serial version of the code, called overhead, arises from the synchronization of the parallel tasks and the accumulation of results by the master task. The goal of recent updates to TORT was to reduce the time consumed by these activities. To help understand which components of the multitasking algorithm contribute significantly to the overhead, a parallel performance model was constructed and compared to measurements of actual timings of themore » code.« less

Accelerating solutions of one-dimensional unsteady PDEs with GPU-based swept time-space decomposition

NASA Astrophysics Data System (ADS)

Magee, Daniel J.; Niemeyer, Kyle E.

2018-03-01

The expedient design of precision components in aerospace and other high-tech industries requires simulations of physical phenomena often described by partial differential equations (PDEs) without exact solutions. Modern design problems require simulations with a level of resolution difficult to achieve in reasonable amounts of time-even in effectively parallelized solvers. Though the scale of the problem relative to available computing power is the greatest impediment to accelerating these applications, significant performance gains can be achieved through careful attention to the details of memory communication and access. The swept time-space decomposition rule reduces communication between sub-domains by exhausting the domain of influence before communicating boundary values. Here we present a GPU implementation of the swept rule, which modifies the algorithm for improved performance on this processing architecture by prioritizing use of private (shared) memory, avoiding interblock communication, and overwriting unnecessary values. It shows significant improvement in the execution time of finite-difference solvers for one-dimensional unsteady PDEs, producing speedups of 2 - 9 × for a range of problem sizes, respectively, compared with simple GPU versions and 7 - 300 × compared with parallel CPU versions. However, for a more sophisticated one-dimensional system of equations discretized with a second-order finite-volume scheme, the swept rule performs 1.2 - 1.9 × worse than a standard implementation for all problem sizes.
Blood Pressure Control

NASA Technical Reports Server (NTRS)

1992-01-01

Engineering Development Lab., Inc.'s E-2000 Neck Baro Reflex System was developed for cardiovascular studies of astronauts. It is regularly used on Space Shuttle Missions, and a parallel version has been developed as a research tool to facilitate studies of blood pressure reflex controls in patients with congestive heart failure, diabetes, etc. An advanced version, the PPC-1000, was developed in 1991, and the technology has been refined substantially. The PPC provides an accurate means of generating pressure for a broad array of laboratory applications. An improved version, the E2010 Barosystem, is anticipated.
Random number generators for large-scale parallel Monte Carlo simulations on FPGA

NASA Astrophysics Data System (ADS)

Lin, Y.; Wang, F.; Liu, B.

2018-05-01

Through parallelization, field programmable gate array (FPGA) can achieve unprecedented speeds in large-scale parallel Monte Carlo (LPMC) simulations. FPGA presents both new constraints and new opportunities for the implementations of random number generators (RNGs), which are key elements of any Monte Carlo (MC) simulation system. Using empirical and application based tests, this study evaluates all of the four RNGs used in previous FPGA based MC studies and newly proposed FPGA implementations for two well-known high-quality RNGs that are suitable for LPMC studies on FPGA. One of the newly proposed FPGA implementations: a parallel version of additive lagged Fibonacci generator (Parallel ALFG) is found to be the best among the evaluated RNGs in fulfilling the needs of LPMC simulations on FPGA.
New NAS Parallel Benchmarks Results

NASA Technical Reports Server (NTRS)

Yarrow, Maurice; Saphir, William; VanderWijngaart, Rob; Woo, Alex; Kutler, Paul (Technical Monitor)

1997-01-01

NPB2 (NAS (NASA Advanced Supercomputing) Parallel Benchmarks 2) is an implementation, based on Fortran and the MPI (message passing interface) message passing standard, of the original NAS Parallel Benchmark specifications. NPB2 programs are run with little or no tuning, in contrast to NPB vendor implementations, which are highly optimized for specific architectures. NPB2 results complement, rather than replace, NPB results. Because they have not been optimized by vendors, NPB2 implementations approximate the performance a typical user can expect for a portable parallel program on distributed memory parallel computers. Together these results provide an insightful comparison of the real-world performance of high-performance computers. New NPB2 features: New implementation (CG), new workstation class problem sizes, new serial sample versions, more performance statistics.
FLY MPI-2: a parallel tree code for LSS

NASA Astrophysics Data System (ADS)

Becciani, U.; Comparato, M.; Antonuccio-Delogu, V.

2006-04-01

New version program summaryProgram title: FLY 3.1 Catalogue identifier: ADSC_v2_0 Licensing provisions: yes Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADSC_v2_0 Program obtainable from: CPC Program Library, Queen's University of Belfast, N. Ireland No. of lines in distributed program, including test data, etc.: 158 172 No. of bytes in distributed program, including test data, etc.: 4 719 953 Distribution format: tar.gz Programming language: Fortran 90, C Computer: Beowulf cluster, PC, MPP systems Operating system: Linux, Aix RAM: 100M words Catalogue identifier of previous version: ADSC_v1_0 Journal reference of previous version: Comput. Phys. Comm. 155 (2003) 159 Does the new version supersede the previous version?: yes Nature of problem: FLY is a parallel collisionless N-body code for the calculation of the gravitational force Solution method: FLY is based on the hierarchical oct-tree domain decomposition introduced by Barnes and Hut (1986) Reasons for the new version: The new version of FLY is implemented by using the MPI-2 standard: the distributed version 3.1 was developed by using the MPICH2 library on a PC Linux cluster. Today the FLY performance allows us to consider the FLY code among the most powerful parallel codes for tree N-body simulations. Another important new feature regards the availability of an interface with hydrodynamical Paramesh based codes. Simulations must follow a box large enough to accurately represent the power spectrum of fluctuations on very large scales so that we may hope to compare them meaningfully with real data. The number of particles then sets the mass resolution of the simulation, which we would like to make as fine as possible. The idea to build an interface between two codes, that have different and complementary cosmological tasks, allows us to execute complex cosmological simulations with FLY, specialized for DM evolution, and a code specialized for hydrodynamical components that uses a Paramesh block structure. Summary of revisions: The parallel communication schema was totally changed. The new version adopts the MPICH2 library. Now FLY can be executed on all Unix systems having an MPI-2 standard library. The main data structure, is declared in a module procedure of FLY (fly_h.F90 routine). FLY creates the MPI Window object for one-sided communication for all the shared arrays, with a call like the following: CALL MPI_WIN_CREATE(POS, SIZE, REAL8, MPI_INFO_NULL, MPI_COMM_WORLD, WIN_POS, IERR) the following main window objects are created: win_pos, win_vel, win_acc: particles positions velocities and accelerations, win_pos_cell, win_mass_cell, win_quad, win_subp, win_grouping: cells positions, masses, quadrupole momenta, tree structure and grouping cells. Other windows are created for dynamic load balance and global counters. Restrictions: The program uses the leapfrog integrator schema, but could be changed by the user. Unusual features: FLY uses the MPI-2 standard: the MPICH2 library on Linux systems was adopted. To run this version of FLY the working directory must be shared among all the processors that execute FLY. Additional comments: Full documentation for the program is included in the distribution in the form of a README file, a User Guide and a Reference manuscript. Running time: IBM Linux Cluster 1350, 512 nodes with 2 processors for each node and 2 GB RAM for each processor, at Cineca, was adopted to make performance tests. Processor type: Intel Xeon Pentium IV 3.0 GHz and 512 KB cache (128 nodes have Nocona processors). Internal Network: Myricom LAN Card "C" Version and "D" Version. Operating System: Linux SuSE SLES 8. The code was compiled using the mpif90 compiler version 8.1 and with basic optimization options in order to have performances that could be useful compared with other generic clusters Processors
A new conformal absorbing boundary condition for finite element meshes and parallelization of FEMATS

NASA Technical Reports Server (NTRS)

Chatterjee, A.; Volakis, J. L.; Nguyen, J.; Nurnberger, M.; Ross, D.

1993-01-01

Some of the progress toward the development and parallelization of an improved version of the finite element code FEMATS is described. This is a finite element code for computing the scattering by arbitrarily shaped three dimensional surfaces composite scatterers. The following tasks were worked on during the report period: (1) new absorbing boundary conditions (ABC's) for truncating the finite element mesh; (2) mixed mesh termination schemes; (3) hierarchical elements and multigridding; (4) parallelization; and (5) various modeling enhancements (antenna feeds, anisotropy, and higher order GIBC).
Implementation of a 3D mixing layer code on parallel computers

NASA Technical Reports Server (NTRS)

Roe, K.; Thakur, R.; Dang, T.; Bogucz, E.

1995-01-01

This paper summarizes our progress and experience in the development of a Computational-Fluid-Dynamics code on parallel computers to simulate three-dimensional spatially-developing mixing layers. In this initial study, the three-dimensional time-dependent Euler equations are solved using a finite-volume explicit time-marching algorithm. The code was first programmed in Fortran 77 for sequential computers. The code was then converted for use on parallel computers using the conventional message-passing technique, while we have not been able to compile the code with the present version of HPF compilers.
HPC Institutional Computing Project: W15_lesreactiveflow KIVA-hpFE Development: A Robust and Accurate Engine Modeling Software

DOE Office of Scientific and Technical Information (OSTI.GOV)

Carrington, David Bradley; Waters, Jiajia

KIVA-hpFE is a high performance computer software for solving the physics of multi-species and multiphase turbulent reactive flow in complex geometries having immersed moving parts. The code is written in Fortran 90/95 and can be used on any computer platform with any popular complier. The code is in two versions, a serial version and a parallel version utilizing MPICH2 type Message Passing Interface (MPI or Intel MPI) for solving distributed domains. The parallel version is at least 30x faster than the serial version and much faster than our previous generation of parallel engine modeling software, by many factors. The 5thmore » generation algorithm construction is a Galerkin type Finite Element Method (FEM) solving conservative momentum, species, and energy transport equations along with two-equation turbulent model k-ω Reynolds Averaged Navier-Stokes (RANS) model and a Vreman type dynamic Large Eddy Simulation (LES) method. The LES method is capable modeling transitional flow from laminar to fully turbulent; therefore, this LES method does not require special hybrid or blending to walls. The FEM projection method also uses a Petrov-Galerkin (P-G) stabilization along with pressure stabilization. We employ hierarchical basis sets, constructed on the fly with enrichment in areas associated with relatively larger error as determined by error estimation methods. In addition, when not using the hp-adaptive module, the code employs Lagrangian basis or shape functions. The shape functions are constructed for hexahedral, prismatic and tetrahedral elements. The software is designed to solve many types of reactive flow problems, from burners to internal combustion engines and turbines. In addition, the formulation allows for direct integration of solid bodies (conjugate heat transfer), as in heat transfer through housings, parts, cylinders. It can also easily be extended to stress modeling of solids, used in fluid structure interactions problems, solidification, porous media modeling and magneto hydrodynamics.« less
The mathematical statement for the solving of the problem of N-version software system design

NASA Astrophysics Data System (ADS)

Kovalev, I. V.; Kovalev, D. I.; Zelenkov, P. V.; Voroshilova, A. A.

2015-10-01

The N-version programming, as a methodology of the fault-tolerant software systems design, allows successful solving of the mentioned tasks. The use of N-version programming approach turns out to be effective, since the system is constructed out of several parallel executed versions of some software module. Those versions are written to meet the same specification but by different programmers. The problem of developing an optimal structure of N-version software system presents a kind of very complex optimization problem. This causes the use of deterministic optimization methods inappropriate for solving the stated problem. In this view, exploiting heuristic strategies looks more rational. In the field of pseudo-Boolean optimization theory, the so called method of varied probabilities (MVP) has been developed to solve problems with a large dimensionality.
GPU-Accelerated Stony-Brook University 5-class Microphysics Scheme in WRF

NASA Astrophysics Data System (ADS)

Mielikainen, J.; Huang, B.; Huang, A.

2011-12-01

The Weather Research and Forecasting (WRF) model is a next-generation mesoscale numerical weather prediction system. Microphysics plays an important role in weather and climate prediction. Several bulk water microphysics schemes are available within the WRF, with different numbers of simulated hydrometeor classes and methods for estimating their size fall speeds, distributions and densities. Stony-Brook University scheme (SBU-YLIN) is a 5-class scheme with riming intensity predicted to account for mixed-phase processes. In the past few years, co-processing on Graphics Processing Units (GPUs) has been a disruptive technology in High Performance Computing (HPC). GPUs use the ever increasing transistor count for adding more processor cores. Therefore, GPUs are well suited for massively data parallel processing with high floating point arithmetic intensity. Thus, it is imperative to update legacy scientific applications to take advantage of this unprecedented increase in computing power. CUDA is an extension to the C programming language offering programming GPU's directly. It is designed so that its constructs allow for natural expression of data-level parallelism. A CUDA program is organized into two parts: a serial program running on the CPU and a CUDA kernel running on the GPU. The CUDA code consists of three computational phases: transmission of data into the global memory of the GPU, execution of the CUDA kernel, and transmission of results from the GPU into the memory of CPU. CUDA takes a bottom-up point of view of parallelism is which thread is an atomic unit of parallelism. Individual threads are part of groups called warps, within which every thread executes exactly the same sequence of instructions. To test SBU-YLIN, we used a CONtinental United States (CONUS) benchmark data set for 12 km resolution domain for October 24, 2001. A WRF domain is a geographic region of interest discretized into a 2-dimensional grid parallel to the ground. Each grid point has multiple levels, which correspond to various vertical heights in the atmosphere. The size of the CONUS 12 km domain is 433 x 308 horizontal grid points with 35 vertical levels. First, the entire SBU-YLIN Fortran code was rewritten in C in preparation of GPU accelerated version. After that, C code was verified against Fortran code for identical outputs. Default compiler options from WRF were used for gfortran and gcc compilers. The processing time for the original Fortran code is 12274 ms and 12893 ms for C version. The processing times for GPU implementation of SBU-YLIN microphysics scheme with I/O are 57.7 ms and 37.2 ms for 1 and 2 GPUs, respectively. The corresponding speedups are 213x and 330x compared to a Fortran implementation. Without I/O the speedup is 896x on 1 GPU. Obviously, ignoring I/O time speedup scales linearly with GPUs. Thus, 2 GPUs have a speedup of 1788x without I/O. Microphysics computation is just a small part of the whole WRF model. After having completely implemented WRF on GPU, the inputs for SBU-YLIN do not have to be transferred from CPU. Instead they are results of previous WRF modules. Therefore, the role of I/O is greatly diminished once all of WRF have been converted to run on GPUs. In the near future, we expect to have a WRF running completely on GPUs for a superior performance.
XDATA

DTIC Science & Technology

2017-05-01

Parallelizing PINT The main focus of our research into the parallelization of the PINT algorithm has been to find appropriately scalable matrix math algorithms...leading eigenvector of the adjacency matrix of the pairwise affinity graph. We reviewed the matrix math implementation currently being used in PINT and...the new versions support a feature called matrix.distributed, which is some level of support for distributed matrix math ; however our code is not
Adding Resistances and Capacitances in Introductory Electricity

NASA Astrophysics Data System (ADS)

Efthimiou, C. J.; Llewellyn, R. A.

2005-09-01

All introductory physics textbooks, with or without calculus, cover the addition of both resistances and capacitances in series and in parallel as discrete summations. However, none includes problems that involve continuous versions of resistors in parallel or capacitors in series. This paper introduces a method for solving the continuous problems that is logical, straightforward, and within the mathematical preparation of students at the introductory level.
Adaptive multi-GPU Exchange Monte Carlo for the 3D Random Field Ising Model

NASA Astrophysics Data System (ADS)

Navarro, Cristóbal A.; Huang, Wei; Deng, Youjin

2016-08-01

This work presents an adaptive multi-GPU Exchange Monte Carlo approach for the simulation of the 3D Random Field Ising Model (RFIM). The design is based on a two-level parallelization. The first level, spin-level parallelism, maps the parallel computation as optimal 3D thread-blocks that simulate blocks of spins in shared memory with minimal halo surface, assuming a constant block volume. The second level, replica-level parallelism, uses multi-GPU computation to handle the simulation of an ensemble of replicas. CUDA's concurrent kernel execution feature is used in order to fill the occupancy of each GPU with many replicas, providing a performance boost that is more notorious at the smallest values of L. In addition to the two-level parallel design, the work proposes an adaptive multi-GPU approach that dynamically builds a proper temperature set free of exchange bottlenecks. The strategy is based on mid-point insertions at the temperature gaps where the exchange rate is most compromised. The extra work generated by the insertions is balanced across the GPUs independently of where the mid-point insertions were performed. Performance results show that spin-level performance is approximately two orders of magnitude faster than a single-core CPU version and one order of magnitude faster than a parallel multi-core CPU version running on 16-cores. Multi-GPU performance is highly convenient under a weak scaling setting, reaching up to 99 % efficiency as long as the number of GPUs and L increase together. The combination of the adaptive approach with the parallel multi-GPU design has extended our possibilities of simulation to sizes of L = 32 , 64 for a workstation with two GPUs. Sizes beyond L = 64 can eventually be studied using larger multi-GPU systems.
Fully Parallel MHD Stability Analysis Tool

NASA Astrophysics Data System (ADS)

Svidzinski, Vladimir; Galkin, Sergei; Kim, Jin-Soo; Liu, Yueqiang

2014-10-01

Progress on full parallelization of the plasma stability code MARS will be reported. MARS calculates eigenmodes in 2D axisymmetric toroidal equilibria in MHD-kinetic plasma models. It is a powerful tool for studying MHD and MHD-kinetic instabilities and it is widely used by fusion community. Parallel version of MARS is intended for simulations on local parallel clusters. It will be an efficient tool for simulation of MHD instabilities with low, intermediate and high toroidal mode numbers within both fluid and kinetic plasma models, already implemented in MARS. Parallelization of the code includes parallelization of the construction of the matrix for the eigenvalue problem and parallelization of the inverse iterations algorithm, implemented in MARS for the solution of the formulated eigenvalue problem. Construction of the matrix is parallelized by distributing the load among processors assigned to different magnetic surfaces. Parallelization of the solution of the eigenvalue problem is made by repeating steps of the present MARS algorithm using parallel libraries and procedures. Initial results of the code parallelization will be reported. Work is supported by the U.S. DOE SBIR program.
GPU-accelerated adjoint algorithmic differentiation

NASA Astrophysics Data System (ADS)

Gremse, Felix; Höfter, Andreas; Razik, Lukas; Kiessling, Fabian; Naumann, Uwe

2016-03-01

Many scientific problems such as classifier training or medical image reconstruction can be expressed as minimization of differentiable real-valued cost functions and solved with iterative gradient-based methods. Adjoint algorithmic differentiation (AAD) enables automated computation of gradients of such cost functions implemented as computer programs. To backpropagate adjoint derivatives, excessive memory is potentially required to store the intermediate partial derivatives on a dedicated data structure, referred to as the ;tape;. Parallelization is difficult because threads need to synchronize their accesses during taping and backpropagation. This situation is aggravated for many-core architectures, such as Graphics Processing Units (GPUs), because of the large number of light-weight threads and the limited memory size in general as well as per thread. We show how these limitations can be mediated if the cost function is expressed using GPU-accelerated vector and matrix operations which are recognized as intrinsic functions by our AAD software. We compare this approach with naive and vectorized implementations for CPUs. We use four increasingly complex cost functions to evaluate the performance with respect to memory consumption and gradient computation times. Using vectorization, CPU and GPU memory consumption could be substantially reduced compared to the naive reference implementation, in some cases even by an order of complexity. The vectorization allowed usage of optimized parallel libraries during forward and reverse passes which resulted in high speedups for the vectorized CPU version compared to the naive reference implementation. The GPU version achieved an additional speedup of 7.5 ± 4.4, showing that the processing power of GPUs can be utilized for AAD using this concept. Furthermore, we show how this software can be systematically extended for more complex problems such as nonlinear absorption reconstruction for fluorescence-mediated tomography.
GPU-Accelerated Adjoint Algorithmic Differentiation.

PubMed

Gremse, Felix; Höfter, Andreas; Razik, Lukas; Kiessling, Fabian; Naumann, Uwe

2016-03-01

Many scientific problems such as classifier training or medical image reconstruction can be expressed as minimization of differentiable real-valued cost functions and solved with iterative gradient-based methods. Adjoint algorithmic differentiation (AAD) enables automated computation of gradients of such cost functions implemented as computer programs. To backpropagate adjoint derivatives, excessive memory is potentially required to store the intermediate partial derivatives on a dedicated data structure, referred to as the "tape". Parallelization is difficult because threads need to synchronize their accesses during taping and backpropagation. This situation is aggravated for many-core architectures, such as Graphics Processing Units (GPUs), because of the large number of light-weight threads and the limited memory size in general as well as per thread. We show how these limitations can be mediated if the cost function is expressed using GPU-accelerated vector and matrix operations which are recognized as intrinsic functions by our AAD software. We compare this approach with naive and vectorized implementations for CPUs. We use four increasingly complex cost functions to evaluate the performance with respect to memory consumption and gradient computation times. Using vectorization, CPU and GPU memory consumption could be substantially reduced compared to the naive reference implementation, in some cases even by an order of complexity. The vectorization allowed usage of optimized parallel libraries during forward and reverse passes which resulted in high speedups for the vectorized CPU version compared to the naive reference implementation. The GPU version achieved an additional speedup of 7.5 ± 4.4, showing that the processing power of GPUs can be utilized for AAD using this concept. Furthermore, we show how this software can be systematically extended for more complex problems such as nonlinear absorption reconstruction for fluorescence-mediated tomography.
Apollo: AN Automatic Procedure to Forecast Transport and Deposition of Tephra

NASA Astrophysics Data System (ADS)

Folch, A.; Costa, A.; Macedonio, G.

2007-05-01

Volcanic ash fallout represents a serious threat to communities around active volcanoes. Reliable short term predictions constitute a valuable support for to mitigate the effects of fallout on the surrounding area during an episode of crisis. We present a platform-independent automatic procedure aimed to daily forecast volcanic ash dispersal. The procedure builds on a series of programs and interfaces that allow an automatic data/results flow. Firstly the procedure downloads mesoscale meteorological forecasts for the region and period of interest, filters and converts data from its native format (typically GRIB format files), and sets up the CALMET diagnostic meteorological model to obtain hourly wind field and micro-meteorological variables on a finer mesh. Secondly a 1-D version of the buoyant plume equations assesses the distribution of mass along the eruptive column depending on the obtained wind field and on the conditions at the vent (granulometry, mass flow rate, etc.). All these data are used as input for the ash dispersion model(s). Any model able to face physical complexity and coupling processes with adequate solving times can be plugged into the system by means of an interface. Currently, the procedure contains the models HAZMAP, TEPHRA and FALL3D, the latter in both serial and parallel versions. Parallelization of FALL3D is done at two levels one for particle classes and one for spatial domain. The last step is to post-processes the model(s) outcomes to end up with homogeneous maps written on portable format files. Maps plot relevant quantities such as predicted ground load, expected deposit thickness or visual and flight safety concentration thresholds. Several applications are shown as examples.
GPU-Accelerated Adjoint Algorithmic Differentiation

PubMed Central

Gremse, Felix; Höfter, Andreas; Razik, Lukas; Kiessling, Fabian; Naumann, Uwe

2015-01-01

Many scientific problems such as classifier training or medical image reconstruction can be expressed as minimization of differentiable real-valued cost functions and solved with iterative gradient-based methods. Adjoint algorithmic differentiation (AAD) enables automated computation of gradients of such cost functions implemented as computer programs. To backpropagate adjoint derivatives, excessive memory is potentially required to store the intermediate partial derivatives on a dedicated data structure, referred to as the “tape”. Parallelization is difficult because threads need to synchronize their accesses during taping and backpropagation. This situation is aggravated for many-core architectures, such as Graphics Processing Units (GPUs), because of the large number of light-weight threads and the limited memory size in general as well as per thread. We show how these limitations can be mediated if the cost function is expressed using GPU-accelerated vector and matrix operations which are recognized as intrinsic functions by our AAD software. We compare this approach with naive and vectorized implementations for CPUs. We use four increasingly complex cost functions to evaluate the performance with respect to memory consumption and gradient computation times. Using vectorization, CPU and GPU memory consumption could be substantially reduced compared to the naive reference implementation, in some cases even by an order of complexity. The vectorization allowed usage of optimized parallel libraries during forward and reverse passes which resulted in high speedups for the vectorized CPU version compared to the naive reference implementation. The GPU version achieved an additional speedup of 7.5 ± 4.4, showing that the processing power of GPUs can be utilized for AAD using this concept. Furthermore, we show how this software can be systematically extended for more complex problems such as nonlinear absorption reconstruction for fluorescence-mediated tomography. PMID:26941443
Comparison of Alternate and Original Items on the Montreal Cognitive Assessment

PubMed Central

Lebedeva, Elena; Huang, Mei; Koski, Lisa

2016-01-01

Background The Montreal Cognitive Assessment (MoCA) is a screening tool for mild cognitive impairment (MCI) in elderly individuals. We hypothesized that measurement error when using the new alternate MoCA versions to monitor change over time could be related to the use of items that are not of comparable difficulty to their corresponding originals of similar content. The objective of this study was to compare the difficulty of the alternate MoCA items to the original ones. Methods Five selected items from alternate versions of the MoCA were included with items from the original MoCA administered adaptively to geriatric outpatients (N = 78). Rasch analysis was used to estimate the difficulty level of the items. Results None of the five items from the alternate versions matched the difficulty level of their corresponding original items. Conclusions This study demonstrates the potential benefits of a Rasch analysis-based approach for selecting items during the process of development of parallel forms. The results suggest that better match of the items from different MoCA forms by their difficulty would result in higher sensitivity to changes in cognitive function over time. PMID:27076861
Static analysis of the hull plate using the finite element method

NASA Astrophysics Data System (ADS)

Ion, A.

2015-11-01

This paper aims at presenting the static analysis for two levels of a container ship's construction as follows: the first level is at the girder / hull plate and the second level is conducted at the entire strength hull of the vessel. This article will describe the work for the static analysis of a hull plate. We shall use the software package ANSYS Mechanical 14.5. The program is run on a computer with four Intel Xeon X5260 CPU processors at 3.33 GHz, 32 GB memory installed. In terms of software, the shared memory parallel version of ANSYS refers to running ANSYS across multiple cores on a SMP system. The distributed memory parallel version of ANSYS (Distributed ANSYS) refers to running ANSYS across multiple processors on SMP systems or DMP systems.

Astrophysical data mining with GPU. A case study: Genetic classification of globular clusters

NASA Astrophysics Data System (ADS)

Cavuoti, S.; Garofalo, M.; Brescia, M.; Paolillo, M.; Pescape', A.; Longo, G.; Ventre, G.

2014-01-01

We present a multi-purpose genetic algorithm, designed and implemented with GPGPU/CUDA parallel computing technology. The model was derived from our CPU serial implementation, named GAME (Genetic Algorithm Model Experiment). It was successfully tested and validated on the detection of candidate Globular Clusters in deep, wide-field, single band HST images. The GPU version of GAME will be made available to the community by integrating it into the web application DAMEWARE (DAta Mining Web Application REsource, http://dame.dsf.unina.it/beta_info.html), a public data mining service specialized on massive astrophysical data. Since genetic algorithms are inherently parallel, the GPGPU computing paradigm leads to a speedup of a factor of 200× in the training phase with respect to the CPU based version.
Pteros 2.0: Evolution of the fast parallel molecular analysis library for C++ and python.

PubMed

Yesylevskyy, Semen O

2015-07-15

Pteros is the high-performance open-source library for molecular modeling and analysis of molecular dynamics trajectories. Starting from version 2.0 Pteros is available for C++ and Python programming languages with very similar interfaces. This makes it suitable for writing complex reusable programs in C++ and simple interactive scripts in Python alike. New version improves the facilities for asynchronous trajectory reading and parallel execution of analysis tasks by introducing analysis plugins which could be written in either C++ or Python in completely uniform way. The high level of abstraction provided by analysis plugins greatly simplifies prototyping and implementation of complex analysis algorithms. Pteros is available for free under Artistic License from http://sourceforge.net/projects/pteros/. © 2015 Wiley Periodicals, Inc.
Geospatial Applications on Different Parallel and Distributed Systems in enviroGRIDS Project

NASA Astrophysics Data System (ADS)

Rodila, D.; Bacu, V.; Gorgan, D.

2012-04-01

The execution of Earth Science applications and services on parallel and distributed systems has become a necessity especially due to the large amounts of Geospatial data these applications require and the large geographical areas they cover. The parallelization of these applications comes to solve important performance issues and can spread from task parallelism to data parallelism as well. Parallel and distributed architectures such as Grid, Cloud, Multicore, etc. seem to offer the necessary functionalities to solve important problems in the Earth Science domain: storing, distribution, management, processing and security of Geospatial data, execution of complex processing through task and data parallelism, etc. A main goal of the FP7-funded project enviroGRIDS (Black Sea Catchment Observation and Assessment System supporting Sustainable Development) [1] is the development of a Spatial Data Infrastructure targeting this catchment region but also the development of standardized and specialized tools for storing, analyzing, processing and visualizing the Geospatial data concerning this area. For achieving these objectives, the enviroGRIDS deals with the execution of different Earth Science applications, such as hydrological models, Geospatial Web services standardized by the Open Geospatial Consortium (OGC) and others, on parallel and distributed architecture to maximize the obtained performance. This presentation analysis the integration and execution of Geospatial applications on different parallel and distributed architectures and the possibility of choosing among these architectures based on application characteristics and user requirements through a specialized component. Versions of the proposed platform have been used in enviroGRIDS project on different use cases such as: the execution of Geospatial Web services both on Web and Grid infrastructures [2] and the execution of SWAT hydrological models both on Grid and Multicore architectures [3]. The current focus is to integrate in the proposed platform the Cloud infrastructure, which is still a paradigm with critical problems to be solved despite the great efforts and investments. Cloud computing comes as a new way of delivering resources while using a large set of old as well as new technologies and tools for providing the necessary functionalities. The main challenges in the Cloud computing, most of them identified also in the Open Cloud Manifesto 2009, address resource management and monitoring, data and application interoperability and portability, security, scalability, software licensing, etc. We propose a platform able to execute different Geospatial applications on different parallel and distributed architectures such as Grid, Cloud, Multicore, etc. with the possibility of choosing among these architectures based on application characteristics and complexity, user requirements, necessary performances, cost support, etc. The execution redirection on a selected architecture is realized through a specialized component and has the purpose of offering a flexible way in achieving the best performances considering the existing restrictions.
High-throughput sequence alignment using Graphics Processing Units

PubMed Central

Schatz, Michael C; Trapnell, Cole; Delcher, Arthur L; Varshney, Amitabh

2007-01-01

Background The recent availability of new, less expensive high-throughput DNA sequencing technologies has yielded a dramatic increase in the volume of sequence data that must be analyzed. These data are being generated for several purposes, including genotyping, genome resequencing, metagenomics, and de novo genome assembly projects. Sequence alignment programs such as MUMmer have proven essential for analysis of these data, but researchers will need ever faster, high-throughput alignment tools running on inexpensive hardware to keep up with new sequence technologies. Results This paper describes MUMmerGPU, an open-source high-throughput parallel pairwise local sequence alignment program that runs on commodity Graphics Processing Units (GPUs) in common workstations. MUMmerGPU uses the new Compute Unified Device Architecture (CUDA) from nVidia to align multiple query sequences against a single reference sequence stored as a suffix tree. By processing the queries in parallel on the highly parallel graphics card, MUMmerGPU achieves more than a 10-fold speedup over a serial CPU version of the sequence alignment kernel, and outperforms the exact alignment component of MUMmer on a high end CPU by 3.5-fold in total application time when aligning reads from recent sequencing projects using Solexa/Illumina, 454, and Sanger sequencing technologies. Conclusion MUMmerGPU is a low cost, ultra-fast sequence alignment program designed to handle the increasing volume of data produced by new, high-throughput sequencing technologies. MUMmerGPU demonstrates that even memory-intensive applications can run significantly faster on the relatively low-cost GPU than on the CPU. PMID:18070356
Performance of OVERFLOW-D Applications based on Hybrid and MPI Paradigms on IBM Power4 System

NASA Technical Reports Server (NTRS)

Djomehri, M. Jahed; Biegel, Bryan (Technical Monitor)

2002-01-01

This report briefly discusses our preliminary performance experiments with parallel versions of OVERFLOW-D applications. These applications are based on MPI and hybrid paradigms on the IBM Power4 system here at the NAS Division. This work is part of an effort to determine the suitability of the system and its parallel libraries (MPI/OpenMP) for specific scientific computing objectives.
Xyce Parallel Electronic Simulator : reference guide, version 2.0.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hoekstra, Robert John; Waters, Lon J.; Rankin, Eric Lamont

This document is a reference guide to the Xyce Parallel Electronic Simulator, and is a companion document to the Xyce Users' Guide. The focus of this document is (to the extent possible) exhaustively list device parameters, solver options, parser options, and other usage details of Xyce. This document is not intended to be a tutorial. Users who are new to circuit simulation are better served by the Xyce Users' Guide.
Solving Navier-Stokes equations on a massively parallel processor; The 1 GFLOP performance

DOE Office of Scientific and Technical Information (OSTI.GOV)

Saati, A.; Biringen, S.; Farhat, C.

This paper reports on experience in solving large-scale fluid dynamics problems on the Connection Machine model CM-2. The authors have implemented a parallel version of the MacCormack scheme for the solution of the Navier-Stokes equations. By using triad floating point operations and reducing the number of interprocessor communications, they have achieved a sustained performance rate of 1.42 GFLOPS.
Xyce™ Parallel Electronic Simulator Reference Guide Version 6.8

DOE Office of Scientific and Technical Information (OSTI.GOV)

Keiter, Eric R.; Aadithya, Karthik Venkatraman; Mei, Ting

This document is a reference guide to the Xyce Parallel Electronic Simulator, and is a companion document to the Xyce Users' Guide. The focus of this document is (to the extent possible) exhaustively list device parameters, solver options, parser options, and other usage details of Xyce . This document is not intended to be a tutorial. Users who are new to circuit simulation are better served by the Xyce Users' Guide.
Hybrid Parallel Contour Trees, Version 1.0

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sewell, Christopher; Fasel, Patricia; Carr, Hamish

A common operation in scientific visualization is to compute and render a contour of a data set. Given a function of the form f : R^d -> R, a level set is defined as an inverse image f^-1(h) for an isovalue h, and a contour is a single connected component of a level set. The Reeb graph can then be defined to be the result of contracting each contour to a single point, and is well defined for Euclidean spaces or for general manifolds. For simple domains, the graph is guaranteed to be a tree, and is called the contourmore » tree. Analysis can then be performed on the contour tree in order to identify isovalues of particular interest, based on various metrics, and render the corresponding contours, without having to know such isovalues a priori. This code is intended to be the first data-parallel algorithm for computing contour trees. Our implementation will use the portable data-parallel primitives provided by Nvidia’s Thrust library, allowing us to compile our same code for both GPUs and multi-core CPUs. Native OpenMP and purely serial versions of the code will likely also be included. It will also be extended to provide a hybrid data-parallel / distributed algorithm, allowing scaling beyond a single GPU or CPU.« less
275 C Downhole Microcomputer System

DOE Office of Scientific and Technical Information (OSTI.GOV)

Chris Hutchens; Hooi Miin Soo

2008-08-31

An HC11 controller IC and along with serial SRAM and ROM support ICs chip set were developed to support a data acquisition and control for extreme temperature/harsh environment conditions greater than 275 C. The 68HC11 microprocessor is widely used in well logging tools for control, data acquisition, and signal processing applications and was the logical choice for a downhole controller. This extreme temperature version of the 68HC11 enables new high temperature designs and additionally allows 68HC11-based well logging tools and MWD tools to be upgraded for high temperature operation in deep gas reservoirs, The microcomputer chip consists of the microprocessormore » ALU, a small boot ROM, 4 kbyte data RAM, counter/timer unit, serial peripheral interface (SPI), asynchronous serial interface (SCI), and the A, B, C, and D parallel ports. The chip is code compatible with the single chip mode commercial 68HC11 except for the absence of the analog to digital converter system. To avoid mask programmed internal ROM, a boot program is used to load the microcomputer program from an external mask SPI ROM. A SPI RAM IC completes the chip set and allows data RAM to be added in 4 kbyte increments. The HC11 controller IC chip set is implemented in the Peregrine Semiconductor 0.5 micron Silicon-on-Sapphire (SOS) process using a custom high temperature cell library developed at Oklahoma State University. Yield data is presented for all, the HC11, SPI-RAM and ROM. The lessons learned in this project were extended to the successful development of two high temperature versions of the LEON3 and a companion 8 Kbyte SRAM, a 200 C version for the Navy and a 275 C version for the gas industry.« less
Implementation of a 3D version of ponderomotive guiding center solver in particle-in-cell code OSIRIS

NASA Astrophysics Data System (ADS)

Helm, Anton; Vieira, Jorge; Silva, Luis; Fonseca, Ricardo

2016-10-01

Laser-driven accelerators gained an increased attention over the past decades. Typical modeling techniques for laser wakefield acceleration (LWFA) are based on particle-in-cell (PIC) simulations. PIC simulations, however, are very computationally expensive due to the disparity of the relevant scales ranging from the laser wavelength, in the micrometer range, to the acceleration length, currently beyond the ten centimeter range. To minimize the gap between these despair scales the ponderomotive guiding center (PGC) algorithm is a promising approach. By describing the evolution of the laser pulse envelope separately, only the scales larger than the plasma wavelength are required to be resolved in the PGC algorithm, leading to speedups in several orders of magnitude. Previous work was limited to two dimensions. Here we present the implementation of the 3D version of a PGC solver into the massively parallel, fully relativistic PIC code OSIRIS. We extended the solver to include periodic boundary conditions and parallelization in all spatial dimensions. We present benchmarks for distributed and shared memory parallelization. We also discuss the stability of the PGC solver.
Enhancing Scalability and Efficiency of the TOUGH2_MP for LinuxClusters

DOE Office of Scientific and Technical Information (OSTI.GOV)

Zhang, Keni; Wu, Yu-Shu

2006-04-17

TOUGH2{_}MP, the parallel version TOUGH2 code, has been enhanced by implementing more efficient communication schemes. This enhancement is achieved through reducing the amount of small-size messages and the volume of large messages. The message exchange speed is further improved by using non-blocking communications for both linear and nonlinear iterations. In addition, we have modified the AZTEC parallel linear-equation solver to nonblocking communication. Through the improvement of code structuring and bug fixing, the new version code is now more stable, while demonstrating similar or even better nonlinear iteration converging speed than the original TOUGH2 code. As a result, the new versionmore » of TOUGH2{_}MP is improved significantly in its efficiency. In this paper, the scalability and efficiency of the parallel code are demonstrated by solving two large-scale problems. The testing results indicate that speedup of the code may depend on both problem size and complexity. In general, the code has excellent scalability in memory requirement as well as computing time.« less
GPU-accelerated Tersoff potentials for massively parallel Molecular Dynamics simulations

NASA Astrophysics Data System (ADS)

Nguyen, Trung Dac

2017-03-01

The Tersoff potential is one of the empirical many-body potentials that has been widely used in simulation studies at atomic scales. Unlike pair-wise potentials, the Tersoff potential involves three-body terms, which require much more arithmetic operations and data dependency. In this contribution, we have implemented the GPU-accelerated version of several variants of the Tersoff potential for LAMMPS, an open-source massively parallel Molecular Dynamics code. Compared to the existing MPI implementation in LAMMPS, the GPU implementation exhibits a better scalability and offers a speedup of 2.2X when run on 1000 compute nodes on the Titan supercomputer. On a single node, the speedup ranges from 2.0 to 8.0 times, depending on the number of atoms per GPU and hardware configurations. The most notable features of our GPU-accelerated version include its design for MPI/accelerator heterogeneous parallelism, its compatibility with other functionalities in LAMMPS, its ability to give deterministic results and to support both NVIDIA CUDA- and OpenCL-enabled accelerators. Our implementation is now part of the GPU package in LAMMPS and accessible for public use.
NEQAIRv14.0 Release Notes: Nonequilibrium and Equilibrium Radiative Transport Spectra Program

NASA Technical Reports Server (NTRS)

Brandis, Aaron Michael; Cruden, Brett A.

2014-01-01

NEQAIR v14.0 is the first parallelized version of NEQAIR. Starting from the last version of the code that went through the internal software release process at NASA Ames (NEQAIR 2008), there have been significant updates to the physics in the code and the computational efficiency. NEQAIR v14.0 supersedes NEQAIR v13.2, v13.1 and the suite of NEQAIR2009 versions. These updates have predominantly been performed by Brett Cruden and Aaron Brandis from ERC Inc at NASA Ames Research Center in 2013 and 2014. A new naming convention is being adopted with this current release. The current and future versions of the code will be named NEQAIR vY.X. The Y will refer to a major release increment. Minor revisions and update releases will involve incrementing X. This is to keep NEQAIR more in line with common software release practices. NEQAIR v14.0 is a standalone software tool for line-by-line spectral computation of radiative intensities and/or radiative heat flux, with one-dimensional transport of radiation. In order to accomplish this, NEQAIR v14.0, as in previous versions, requires the specification of distances (in cm), temperatures (in K) and number densities (in parts/cc) of constituent species along lines of sight. Therefore, it is assumed that flow quantities have been extracted from flow fields computed using other tools, such as CFD codes like DPLR or LAURA, and that lines of sight have been constructed and written out in the format required by NEQAIR v14.0. There are two principal modes for running NEQAIR v14.0. In the first mode NEQAIR v14.0 is used as a tool for creating synthetic spectra of any desired resolution (including convolution with a specified instrument/slit function). The first mode is typically exercised in simulating/interpreting spectroscopic measurements of different sources (e.g. shock tube data, plasma torches, etc.). In the second mode, NEQAIR v14.0 is used as a radiative heat flux prediction tool for flight projects. Correspondingly, NEQAIR has also been used to simulate the radiance measured on previous flight missions. This report summarizes the database updates, corrections that have been made to the code, changes to input files, parallelization, the current usage recommendations, including test cases, and an indication of the performance enhancements achieved.
Deaf Students' Reading and Writing in College: Fluency, Coherence, and Comprehension.

PubMed

Albertini, John A; Marschark, Marc; Kincheloe, Pamela J

2016-07-01

Research in discourse reveals numerous cognitive connections between reading and writing. Rather than one being the inverse of the other, there are parallels and interactions between them. To understand the variables and possible connections in the reading and writing of adult deaf students, we manipulated writing conditions and reading texts. First, to test the hypothesis that a fluent writing process leads to richer content and a higher degree of coherence in a written summary, we interrupted the writing process with verbal and nonverbal intervening tasks. The negligible effect of the interference indicated that the stimuli texts were not equivalent in terms of coherence and revealed a relationship between coherence of the stimuli texts, amount of content recalled, and coherence of the written summaries. To test for a possible effect of coherence on reading comprehension, we manipulated the coherence of the texts. We found that students understood the more coherent versions of the passages better than the less coherent versions and were able to accurately distinguish between them. However, they were not able to judge comprehensibility. Implications for further research and classroom application are discussed. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
A neural-network-based approach to the double traveling salesman problem.

PubMed

Plebe, Alessio; Anile, Angelo Marcello

2002-02-01

The double traveling salesman problem is a variation of the basic traveling salesman problem where targets can be reached by two salespersons operating in parallel. The real problem addressed by this work concerns the optimization of the harvest sequence for the two independent arms of a fruit-harvesting robot. This application poses further constraints, like a collision-avoidance function. The proposed solution is based on a self-organizing map structure, initialized with as many artificial neurons as the number of targets to be reached. One of the key components of the process is the combination of competitive relaxation with a mechanism for deleting and creating artificial neurons. Moreover, in the competitive relaxation process, information about the trajectory connecting the neurons is combined with the distance of neurons from the target. This strategy prevents tangles in the trajectory and collisions between the two tours. Results of tests indicate that the proposed approach is efficient and reliable for harvest sequence planning. Moreover, the enhancements added to the pure self-organizing map concept are of wider importance, as proved by a traveling salesman problem version of the program, simplified from the double version for comparison.
A Two-dimensional Version of the Niblett-Bostick Transformation for Magnetotelluric Interpretations

NASA Astrophysics Data System (ADS)

Esparza, F.

2005-05-01

An imaging technique for two-dimensional magnetotelluric interpretations is developed following the well known Niblett-Bostick transformation for one-dimensional profiles. The algorithm uses a Hopfield artificial neural network to process series and parallel magnetotelluric impedances along with their analytical influence functions. The adaptive, weighted average approximation preserves part of the nonlinearity of the original problem. No initial model in the usual sense is required for the recovery of a functional model. Rather, the built-in relationship between model and data considers automatically, all at the same time, many half spaces whose electrical conductivities vary according to the data. The use of series and parallel impedances, a self-contained pair of invariants of the impedance tensor, avoids the need to decide on best angles of rotation for TE and TM separations. Field data from a given profile can thus be fed directly into the algorithm without much processing. The solutions offered by the Hopfield neural network correspond to spatial averages computed through rectangular windows that can be chosen at will. Applications of the algorithm to simple synthetic models and to the COPROD2 data set illustrate the performance of the approximation.
Evaluating the improvements of the BOLAM meteorological model operational at ISPRA: A case study approach - preliminary results

NASA Astrophysics Data System (ADS)

Mariani, S.; Casaioli, M.; Lastoria, B.; Accadia, C.; Flavoni, S.

2009-04-01

The Institute for Environmental Protection and Research - ISPRA (former Agency for Environmental Protection and Technical Services - APAT) runs operationally since 2000 an integrated meteo-marine forecasting chain, named the Hydro-Meteo-Marine Forecasting System (Sistema Idro-Meteo-Mare - SIMM), formed by a cascade of four numerical models, telescoping from the Mediterranean basin to the Venice Lagoon, and initialized by means of analyses and forecasts from the European Centre for Medium-Range Weather Forecasts (ECMWF). The operational integrated system consists of a meteorological model, the parallel verision of BOlogna Limited Area Model (BOLAM), coupled over the Mediterranean sea with a WAve Model (WAM), a high-resolution shallow-water model of the Adriatic and Ionian Sea, namely the Princeton Ocean Model (POM), and a finite-element version of the same model (VL-FEM) on the Venice Lagoon, aimed to forecast the acqua alta events. Recently, the physically based, fully distributed, rainfall-runoff TOPographic Kinematic APproximation and Integration (TOPKAPI) model has been integrated into the system, coupled to BOLAM, over two river basins, located in the central and northeastern part of Italy, respectively. However, at the present time, this latter part of the forecasting chain is not operational and it is used in a research configuration. BOLAM was originally implemented in 2000 onto the Quadrics parallel supercomputer (and for this reason referred to as QBOLAM, as well) and only at the end of 2006 it was ported (together with the other operational marine models of the forecasting chain) onto the Silicon Graphics Inc. (SGI) Altix 8-processor machine. In particular, due to the Quadrics implementation, the Kuo scheme was formerly implemented into QBOLAM for the cumulus convection parameterization. On the contrary, when porting SIMM onto the Altix Linux cluster, it was achievable to implement into QBOLAM the more advanced convection parameterization by Kain and Fritsch. A fully updated serial version of the BOLAM code has been recently acquired. Code improvements include a more precise advection scheme (Weighted Average Flux); explicit advection of five hydrometeors, and state-of-the-art parameterization schemes for radiation, convection, boundary layer turbulence and soil processes (also with possible choice among different available schemes). The operational implementation of the new code into the SIMM model chain, which requires the development of a parallel version, will be achieved during 2009. In view of this goal, the comparative verification of the different model versions' skill represents a fundamental task. On this purpose, it has been decided to evaluate the performance improvement of the new BOLAM code (in the available serial version, hereinafter BOLAM 2007) with respect to the version with the Kain-Fritsch scheme (hereinafter KF version) and to the older one employing the Kuo scheme (hereinafter Kuo version). In the present work, verification of precipitation forecasts from the three BOLAM versions is carried on in a case study approach. The intense rainfall episode occurred on 10th - 17th December 2008 over Italy has been considered. This event produced indeed severe damages in Rome and its surrounding areas. Objective and subjective verification methods have been employed in order to evaluate model performance against an observational dataset including rain gauge observations and satellite imagery. Subjective comparison of observed and forecast precipitation fields is suitable to give an overall description of the forecast quality. Spatial errors (e.g., shifting and pattern errors) and rainfall volume error can be assessed quantitatively by means of object-oriented methods. By comparing satellite images with model forecast fields, it is possible to investigate the differences between the evolution of the observed weather system and the predicted ones, and its sensitivity to the improvements in the model code. Finally, the error in forecasting the cyclone evolution can be tentatively related with the precipitation forecast error.
High Performance Geostatistical Modeling of Biospheric Resources

NASA Astrophysics Data System (ADS)

Pedelty, J. A.; Morisette, J. T.; Smith, J. A.; Schnase, J. L.; Crosier, C. S.; Stohlgren, T. J.

2004-12-01

We are using parallel geostatistical codes to study spatial relationships among biospheric resources in several study areas. For example, spatial statistical models based on large- and small-scale variability have been used to predict species richness of both native and exotic plants (hot spots of diversity) and patterns of exotic plant invasion. However, broader use of geostastics in natural resource modeling, especially at regional and national scales, has been limited due to the large computing requirements of these applications. To address this problem, we implemented parallel versions of the kriging spatial interpolation algorithm. The first uses the Message Passing Interface (MPI) in a master/slave paradigm on an open source Linux Beowulf cluster, while the second is implemented with the new proprietary Xgrid distributed processing system on an Xserve G5 cluster from Apple Computer, Inc. These techniques are proving effective and provide the basis for a national decision support capability for invasive species management that is being jointly developed by NASA and the US Geological Survey.
Status of parallel Python-based implementation of UEDGE

NASA Astrophysics Data System (ADS)

Umansky, M. V.; Pankin, A. Y.; Rognlien, T. D.; Dimits, A. M.; Friedman, A.; Joseph, I.

2017-10-01

The tokamak edge transport code UEDGE has long used the code-development and run-time framework Basis. However, with the support for Basis expected to terminate in the coming years, and with the advent of the modern numerical language Python, it has become desirable to move UEDGE to Python, to ensure its long-term viability. Our new Python-based UEDGE implementation takes advantage of the portable build system developed for FACETS. The new implementation gives access to Python's graphical libraries and numerical packages for pre- and post-processing, and support of HDF5 simplifies exchanging data. The older serial version of UEDGE has used for time-stepping the Newton-Krylov solver NKSOL. The renovated implementation uses backward Euler discretization with nonlinear solvers from PETSc, which has the promise to significantly improve the UEDGE parallel performance. We will report on assessment of some of the extended UEDGE capabilities emerging in the new implementation, and will discuss the future directions. Work performed for U.S. DOE by LLNL under contract DE-AC52-07NA27344.

Parallel processing a real code: A case history

DOE Office of Scientific and Technical Information (OSTI.GOV)

Mandell, D.A.; Trease, H.E.

1988-01-01

A three-dimensional, time-dependent Free-Lagrange hydrodynamics code has been multitasked and autotasked on a Cray X-MP/416. The multitasking was done by using the Los Alamos Multitasking Control Library, which is a superset of the Cray multitasking library. Autotasking is done by using constructs which are only comment cards if the source code is not run through a preprocessor. The 3-D algorithm has presented a number of problems that simpler algorithms, such as 1-D hydrodynamics, did not exhibit. Problems in converting the serial code, originally written for a Cray 1, to a multitasking code are discussed, Autotasking of a rewritten version ofmore » the code is discussed. Timing results for subroutines and hot spots in the serial code are presented and suggestions for additional tools and debugging aids are given. Theoretical speedup results obtained from Amdahl's law and actual speedup results obtained on a dedicated machine are presented. Suggestions for designing large parallel codes are given. 8 refs., 13 figs.« less
Acceleration of Particles Near Earth's Bow Shock

NASA Astrophysics Data System (ADS)

Sandroos, A.

2012-12-01

Collisionless shock waves, for example, near planetary bodies or driven by coronal mass ejections, are a key source of energetic particles in the heliosphere. When the solar wind hits Earth's bow shock, some of the incident particles get reflected back towards the Sun and are accelerated in the process. Reflected ions are responsible for the creation of a turbulent foreshock in quasi-parallel regions of Earth's bow shock. We present first results of foreshock macroscopic structure and of particle distributions upstream of Earth's bow shock, obtained with a new 2.5-dimensional self-consistent diffusive shock acceleration model. In the model particles' pitch angle scattering rates are calculated from Alfvén wave power spectra using quasilinear theory. Wave power spectra in turn are modified by particles' energy changes due to the scatterings. The new model has been implemented on massively parallel simulation platform Corsair. We have used an earlier version of the model to study ion acceleration in a shock-shock interaction event (Hietala, Sandroos, and Vainio, 2012).
EvoGraph: On-The-Fly Efficient Mining of Evolving Graphs on GPU

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sengupta, Dipanjan; Song, Shuaiwen

With the prevalence of the World Wide Web and social networks, there has been a growing interest in high performance analytics for constantly-evolving dynamic graphs. Modern GPUs provide massive AQ1 amount of parallelism for efficient graph processing, but the challenges remain due to their lack of support for the near real-time streaming nature of dynamic graphs. Specifically, due to the current high volume and velocity of graph data combined with the complexity of user queries, traditional processing methods by first storing the updates and then repeatedly running static graph analytics on a sequence of versions or snapshots are deemed undesirablemore » and computational infeasible on GPU. We present EvoGraph, a highly efficient and scalable GPU- based dynamic graph analytics framework.« less
Visualization Software for VisIT Java Client

DOE Office of Scientific and Technical Information (OSTI.GOV)

Billings, Jay Jay; Smith, Robert W

The VisIT Java Client (JVC) library is a lightweight thin client that is designed and written purely in the native language of Java (the Python & JavaScript versions of the library use the same concept) and communicates with any new unmodified standalone version of VisIT, a high performance computing parallel visualization toolkit, over traditional or web sockets and dynamically determines capabilities of the running VisIT instance whether local or remote.
The seasonal-cycle climate model

NASA Technical Reports Server (NTRS)

Marx, L.; Randall, D. A.

1981-01-01

The seasonal cycle run which will become the control run for the comparison with runs utilizing codes and parameterizations developed by outside investigators is discussed. The climate model currently exists in two parallel versions: one running on the Amdahl and the other running on the CYBER 203. These two versions are as nearly identical as machine capability and the requirement for high speed performance will allow. Developmental changes are made on the Amdahl/CMS version for ease of testing and rapidity of turnaround. The changes are subsequently incorporated into the CYBER 203 version using vectorization techniques where speed improvement can be realized. The 400 day seasonal cycle run serves as a control run for both medium and long range climate forecasts alsensitivity studies.
Xyce parallel electronic simulator reference guide, Version 6.0.1.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Keiter, Eric R; Mei, Ting; Russo, Thomas V.

2014-01-01

This document is a reference guide to the Xyce Parallel Electronic Simulator, and is a companion document to the Xyce Users Guide [1] . The focus of this document is (to the extent possible) exhaustively list device parameters, solver options, parser options, and other usage details of Xyce. This document is not intended to be a tutorial. Users who are new to circuit simulation are better served by the Xyce Users Guide [1] .
Xyce parallel electronic simulator reference guide, version 6.0.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Keiter, Eric R; Mei, Ting; Russo, Thomas V.

2013-08-01

This document is a reference guide to the Xyce Parallel Electronic Simulator, and is a companion document to the Xyce Users Guide [1] . The focus of this document is (to the extent possible) exhaustively list device parameters, solver options, parser options, and other usage details of Xyce. This document is not intended to be a tutorial. Users who are new to circuit simulation are better served by the Xyce Users Guide [1] .
Capabilities of Fully Parallelized MHD Stability Code MARS

NASA Astrophysics Data System (ADS)

Svidzinski, Vladimir; Galkin, Sergei; Kim, Jin-Soo; Liu, Yueqiang

2016-10-01

Results of full parallelization of the plasma stability code MARS will be reported. MARS calculates eigenmodes in 2D axisymmetric toroidal equilibria in MHD-kinetic plasma models. Parallel version of MARS, named PMARS, has been recently developed at FAR-TECH. Parallelized MARS is an efficient tool for simulation of MHD instabilities with low, intermediate and high toroidal mode numbers within both fluid and kinetic plasma models, implemented in MARS. Parallelization of the code included parallelization of the construction of the matrix for the eigenvalue problem and parallelization of the inverse vector iterations algorithm, implemented in MARS for the solution of the formulated eigenvalue problem. Construction of the matrix is parallelized by distributing the load among processors assigned to different magnetic surfaces. Parallelization of the solution of the eigenvalue problem is made by repeating steps of the MARS algorithm using parallel libraries and procedures. Parallelized MARS is capable of calculating eigenmodes with significantly increased spatial resolution: up to 5,000 adapted radial grid points with up to 500 poloidal harmonics. Such resolution is sufficient for simulation of kink, tearing and peeling-ballooning instabilities with physically relevant parameters. Work is supported by the U.S. DOE SBIR program.
Fully Parallel MHD Stability Analysis Tool

NASA Astrophysics Data System (ADS)

Svidzinski, Vladimir; Galkin, Sergei; Kim, Jin-Soo; Liu, Yueqiang

2015-11-01

Progress on full parallelization of the plasma stability code MARS will be reported. MARS calculates eigenmodes in 2D axisymmetric toroidal equilibria in MHD-kinetic plasma models. It is a powerful tool for studying MHD and MHD-kinetic instabilities and it is widely used by fusion community. Parallel version of MARS is intended for simulations on local parallel clusters. It will be an efficient tool for simulation of MHD instabilities with low, intermediate and high toroidal mode numbers within both fluid and kinetic plasma models, already implemented in MARS. Parallelization of the code includes parallelization of the construction of the matrix for the eigenvalue problem and parallelization of the inverse iterations algorithm, implemented in MARS for the solution of the formulated eigenvalue problem. Construction of the matrix is parallelized by distributing the load among processors assigned to different magnetic surfaces. Parallelization of the solution of the eigenvalue problem is made by repeating steps of the present MARS algorithm using parallel libraries and procedures. Results of MARS parallelization and of the development of a new fix boundary equilibrium code adapted for MARS input will be reported. Work is supported by the U.S. DOE SBIR program.
Implementation of a Message Passing Interface into a Cloud-Resolving Model for Massively Parallel Computing

NASA Technical Reports Server (NTRS)

Juang, Hann-Ming Henry; Tao, Wei-Kuo; Zeng, Xi-Ping; Shie, Chung-Lin; Simpson, Joanne; Lang, Steve

2004-01-01

The capability for massively parallel programming (MPP) using a message passing interface (MPI) has been implemented into a three-dimensional version of the Goddard Cumulus Ensemble (GCE) model. The design for the MPP with MPI uses the concept of maintaining similar code structure between the whole domain as well as the portions after decomposition. Hence the model follows the same integration for single and multiple tasks (CPUs). Also, it provides for minimal changes to the original code, so it is easily modified and/or managed by the model developers and users who have little knowledge of MPP. The entire model domain could be sliced into one- or two-dimensional decomposition with a halo regime, which is overlaid on partial domains. The halo regime requires that no data be fetched across tasks during the computational stage, but it must be updated before the next computational stage through data exchange via MPI. For reproducible purposes, transposing data among tasks is required for spectral transform (Fast Fourier Transform, FFT), which is used in the anelastic version of the model for solving the pressure equation. The performance of the MPI-implemented codes (i.e., the compressible and anelastic versions) was tested on three different computing platforms. The major results are: 1) both versions have speedups of about 99% up to 256 tasks but not for 512 tasks; 2) the anelastic version has better speedup and efficiency because it requires more computations than that of the compressible version; 3) equal or approximately-equal numbers of slices between the x- and y- directions provide the fastest integration due to fewer data exchanges; and 4) one-dimensional slices in the x-direction result in the slowest integration due to the need for more memory relocation for computation.
MPI, HPF or OpenMP: A Study with the NAS Benchmarks

NASA Technical Reports Server (NTRS)

Jin, Hao-Qiang; Frumkin, Michael; Hribar, Michelle; Waheed, Abdul; Yan, Jerry; Saini, Subhash (Technical Monitor)

1999-01-01

Porting applications to new high performance parallel and distributed platforms is a challenging task. Writing parallel code by hand is time consuming and costly, but the task can be simplified by high level languages and would even better be automated by parallelizing tools and compilers. The definition of HPF (High Performance Fortran, based on data parallel model) and OpenMP (based on shared memory parallel model) standards has offered great opportunity in this respect. Both provide simple and clear interfaces to language like FORTRAN and simplify many tedious tasks encountered in writing message passing programs. In our study we implemented the parallel versions of the NAS Benchmarks with HPF and OpenMP directives. Comparison of their performance with the MPI implementation and pros and cons of different approaches will be discussed along with experience of using computer-aided tools to help parallelize these benchmarks. Based on the study,potentials of applying some of the techniques to realistic aerospace applications will be presented
MPI, HPF or OpenMP: A Study with the NAS Benchmarks

NASA Technical Reports Server (NTRS)

Jin, H.; Frumkin, M.; Hribar, M.; Waheed, A.; Yan, J.; Saini, Subhash (Technical Monitor)

1999-01-01

Porting applications to new high performance parallel and distributed platforms is a challenging task. Writing parallel code by hand is time consuming and costly, but this task can be simplified by high level languages and would even better be automated by parallelizing tools and compilers. The definition of HPF (High Performance Fortran, based on data parallel model) and OpenMP (based on shared memory parallel model) standards has offered great opportunity in this respect. Both provide simple and clear interfaces to language like FORTRAN and simplify many tedious tasks encountered in writing message passing programs. In our study, we implemented the parallel versions of the NAS Benchmarks with HPF and OpenMP directives. Comparison of their performance with the MPI implementation and pros and cons of different approaches will be discussed along with experience of using computer-aided tools to help parallelize these benchmarks. Based on the study, potentials of applying some of the techniques to realistic aerospace applications will be presented.
Parallelization of PANDA discrete ordinates code using spatial decomposition

DOE Office of Scientific and Technical Information (OSTI.GOV)

Humbert, P.

2006-07-01

We present the parallel method, based on spatial domain decomposition, implemented in the 2D and 3D versions of the discrete Ordinates code PANDA. The spatial mesh is orthogonal and the spatial domain decomposition is Cartesian. For 3D problems a 3D Cartesian domain topology is created and the parallel method is based on a domain diagonal plane ordered sweep algorithm. The parallel efficiency of the method is improved by directions and octants pipelining. The implementation of the algorithm is straightforward using MPI blocking point to point communications. The efficiency of the method is illustrated by an application to the 3D-Ext C5G7more » benchmark of the OECD/NEA. (authors)« less
Portable programming on parallel/networked computers using the Application Portable Parallel Library (APPL)

NASA Technical Reports Server (NTRS)

Quealy, Angela; Cole, Gary L.; Blech, Richard A.

1993-01-01

The Application Portable Parallel Library (APPL) is a subroutine-based library of communication primitives that is callable from applications written in FORTRAN or C. APPL provides a consistent programmer interface to a variety of distributed and shared-memory multiprocessor MIMD machines. The objective of APPL is to minimize the effort required to move parallel applications from one machine to another, or to a network of homogeneous machines. APPL encompasses many of the message-passing primitives that are currently available on commercial multiprocessor systems. This paper describes APPL (version 2.3.1) and its usage, reports the status of the APPL project, and indicates possible directions for the future. Several applications using APPL are discussed, as well as performance and overhead results.
Optimization technique for problems with an inequality constraint

NASA Technical Reports Server (NTRS)

Russell, K. J.

1972-01-01

General technique uses a modified version of an existing technique termed the pattern search technique. New procedure called the parallel move strategy permits pattern search technique to be used with problems involving a constraint.
Xyce parallel electronic simulator reference guide, version 6.1

DOE Office of Scientific and Technical Information (OSTI.GOV)

Keiter, Eric R; Mei, Ting; Russo, Thomas V.

2014-03-01

This document is a reference guide to the Xyce Parallel Electronic Simulator, and is a companion document to the Xyce Users Guide [1] . The focus of this document is (to the extent possible) exhaustively list device parameters, solver options, parser options, and other usage details of Xyce. This document is not intended to be a tutorial. Users who are new to circuit simulation are better served by the Xyce Users Guide [1] .
Visual Analysis of North Atlantic Hurricane Trends Using Parallel Coordinates and Statistical Techniques

DTIC Science & Technology

2008-07-07

analyzing multivariate data sets. The system was developed using the Java Development Kit (JDK) version 1.5; and it yields interactive performance on a... script and captures output from the MATLAB’s “regress” and “stepwisefit” utilities that perform simple and stepwise regression, respectively. The MATLAB...Statistical Association, vol. 85, no. 411, pp. 664–675, 1990. [9] H. Hauser, F. Ledermann, and H. Doleisch, “ Angular brushing of extended parallel coordinates
The Parallel Implementation of Algorithms for Finding the Reflection Symmetry of the Binary Images

NASA Astrophysics Data System (ADS)

Fedotova, S.; Seredin, O.; Kushnir, O.

2017-05-01

In this paper, we investigate the exact method of searching an axis of binary image symmetry, based on brute-force search among all potential symmetry axes. As a measure of symmetry, we use the set-theoretic Jaccard similarity applied to two subsets of pixels of the image which is divided by some axis. Brute-force search algorithm definitely finds the axis of approximate symmetry which could be considered as ground-truth, but it requires quite a lot of time to process each image. As a first step of our contribution we develop the parallel version of the brute-force algorithm. It allows us to process large image databases and obtain the desired axis of approximate symmetry for each shape in database. Experimental studies implemented on "Butterflies" and "Flavia" datasets have shown that the proposed algorithm takes several minutes per image to find a symmetry axis. However, in case of real-world applications we need computational efficiency which allows solving the task of symmetry axis search in real or quasi-real time. So, for the task of fast shape symmetry calculation on the common multicore PC we elaborated another parallel program, which based on the procedure suggested before in (Fedotova, 2016). That method takes as an initial axis the axis obtained by superfast comparison of two skeleton primitive sub-chains. This process takes about 0.5 sec on the common PC, it is considerably faster than any of the optimized brute-force methods including ones implemented in supercomputer. In our experiments for 70 percent of cases the found axis coincides with the ground-truth one absolutely, and for the rest of cases it is very close to the ground-truth.
On efficiency of fire simulation realization: parallelization with greater number of computational meshes

NASA Astrophysics Data System (ADS)

Valasek, Lukas; Glasa, Jan

2017-12-01

Current fire simulation systems are capable to utilize advantages of high-performance computer (HPC) platforms available and to model fires efficiently in parallel. In this paper, efficiency of a corridor fire simulation on a HPC computer cluster is discussed. The parallel MPI version of Fire Dynamics Simulator is used for testing efficiency of selected strategies of allocation of computational resources of the cluster using a greater number of computational cores. Simulation results indicate that if the number of cores used is not equal to a multiple of the total number of cluster node cores there are allocation strategies which provide more efficient calculations.
ATPP: A Pipeline for Automatic Tractography-Based Brain Parcellation

PubMed Central

Li, Hai; Fan, Lingzhong; Zhuo, Junjie; Wang, Jiaojian; Zhang, Yu; Yang, Zhengyi; Jiang, Tianzi

2017-01-01

There is a longstanding effort to parcellate brain into areas based on micro-structural, macro-structural, or connectional features, forming various brain atlases. Among them, connectivity-based parcellation gains much emphasis, especially with the considerable progress of multimodal magnetic resonance imaging in the past two decades. The Brainnetome Atlas published recently is such an atlas that follows the framework of connectivity-based parcellation. However, in the construction of the atlas, the deluge of high resolution multimodal MRI data and time-consuming computation poses challenges and there is still short of publically available tools dedicated to parcellation. In this paper, we present an integrated open source pipeline (https://www.nitrc.org/projects/atpp), named Automatic Tractography-based Parcellation Pipeline (ATPP) to realize the framework of parcellation with automatic processing and massive parallel computing. ATPP is developed to have a powerful and flexible command line version, taking multiple regions of interest as input, as well as a user-friendly graphical user interface version for parcellating single region of interest. We demonstrate the two versions by parcellating two brain regions, left precentral gyrus and middle frontal gyrus, on two independent datasets. In addition, ATPP has been successfully utilized and fully validated in a variety of brain regions and the human Brainnetome Atlas, showing the capacity to greatly facilitate brain parcellation. PMID:28611620

The Product and Quotient Rules Revisited

ERIC Educational Resources Information Center

Eggleton, Roger; Kustov, Vladimir

2011-01-01

Mathematical elegance is illustrated by strikingly parallel versions of the product and quotient rules of basic calculus, with some applications. Corresponding rules for second derivatives are given: the product rule is familiar, but the quotient rule is less so.
An accurate, fast, and scalable solver for high-frequency wave propagation

NASA Astrophysics Data System (ADS)

Zepeda-Núñez, L.; Taus, M.; Hewett, R.; Demanet, L.

2017-12-01

In many science and engineering applications, solving time-harmonic high-frequency wave propagation problems quickly and accurately is of paramount importance. For example, in geophysics, particularly in oil exploration, such problems can be the forward problem in an iterative process for solving the inverse problem of subsurface inversion. It is important to solve these wave propagation problems accurately in order to efficiently obtain meaningful solutions of the inverse problems: low order forward modeling can hinder convergence. Additionally, due to the volume of data and the iterative nature of most optimization algorithms, the forward problem must be solved many times. Therefore, a fast solver is necessary to make solving the inverse problem feasible. For time-harmonic high-frequency wave propagation, obtaining both speed and accuracy is historically challenging. Recently, there have been many advances in the development of fast solvers for such problems, including methods which have linear complexity with respect to the number of degrees of freedom. While most methods scale optimally only in the context of low-order discretizations and smooth wave speed distributions, the method of polarized traces has been shown to retain optimal scaling for high-order discretizations, such as hybridizable discontinuous Galerkin methods and for highly heterogeneous (and even discontinuous) wave speeds. The resulting fast and accurate solver is consequently highly attractive for geophysical applications. To date, this method relies on a layered domain decomposition together with a preconditioner applied in a sweeping fashion, which has limited straight-forward parallelization. In this work, we introduce a new version of the method of polarized traces which reveals more parallel structure than previous versions while preserving all of its other advantages. We achieve this by further decomposing each layer and applying the preconditioner to these new components separately and in parallel. We demonstrate that this produces an even more effective and parallelizable preconditioner for a single right-hand side. As before, additional speed can be gained by pipelining several right-hand-sides.
Detection of faults and software reliability analysis

NASA Technical Reports Server (NTRS)

Knight, John C.

1987-01-01

Multi-version or N-version programming is proposed as a method of providing fault tolerance in software. The approach requires the separate, independent preparation of multiple versions of a piece of software for some application. These versions are executed in parallel in the application environment; each receives identical inputs and each produces its version of the required outputs. The outputs are collected by a voter and, in principle, they should all be the same. In practice there may be some disagreement. If this occurs, the results of the majority are taken to be the correct output, and that is the output used by the system. A total of 27 programs were produced. Each of these programs was then subjected to one million randomly-generated test cases. The experiment yielded a number of programs containing faults that are useful for general studies of software reliability as well as studies of N-version programming. Fault tolerance through data diversity and analytic models of comparison testing are discussed.
LB3D: A parallel implementation of the Lattice-Boltzmann method for simulation of interacting amphiphilic fluids

NASA Astrophysics Data System (ADS)

Schmieschek, S.; Shamardin, L.; Frijters, S.; Krüger, T.; Schiller, U. D.; Harting, J.; Coveney, P. V.

2017-08-01

We introduce the lattice-Boltzmann code LB3D, version 7.1. Building on a parallel program and supporting tools which have enabled research utilising high performance computing resources for nearly two decades, LB3D version 7 provides a subset of the research code functionality as an open source project. Here, we describe the theoretical basis of the algorithm as well as computational aspects of the implementation. The software package is validated against simulations of meso-phases resulting from self-assembly in ternary fluid mixtures comprising immiscible and amphiphilic components such as water-oil-surfactant systems. The impact of the surfactant species on the dynamics of spinodal decomposition are tested and quantitative measurement of the permeability of a body centred cubic (BCC) model porous medium for a simple binary mixture is described. Single-core performance and scaling behaviour of the code are reported for simulations on current supercomputer architectures.
Evict on write, a management strategy for a prefetch unit and/or first level cache in a multiprocessor system with speculative execution

DOEpatents

Gara, Alan; Ohmacht, Martin

2014-09-16

In a multiprocessor system with at least two levels of cache, a speculative thread may run on a core processor in parallel with other threads. When the thread seeks to do a write to main memory, this access is to be written through the first level cache to the second level cache. After the write though, the corresponding line is deleted from the first level cache and/or prefetch unit, so that any further accesses to the same location in main memory have to be retrieved from the second level cache. The second level cache keeps track of multiple versions of data, where more than one speculative thread is running in parallel, while the first level cache does not have any of the versions during speculation. A switch allows choosing between modes of operation of a speculation blind first level cache.
Parallel Unsteady Turbopump Simulations for Liquid Rocket Engines

NASA Technical Reports Server (NTRS)

Kiris, Cetin C.; Kwak, Dochan; Chan, William

2000-01-01

This paper reports the progress being made towards complete turbo-pump simulation capability for liquid rocket engines. Space Shuttle Main Engine (SSME) turbo-pump impeller is used as a test case for the performance evaluation of the MPI and hybrid MPI/Open-MP versions of the INS3D code. Then, a computational model of a turbo-pump has been developed for the shuttle upgrade program. Relative motion of the grid system for rotor-stator interaction was obtained by employing overset grid techniques. Time-accuracy of the scheme has been evaluated by using simple test cases. Unsteady computations for SSME turbo-pump, which contains 136 zones with 35 Million grid points, are currently underway on Origin 2000 systems at NASA Ames Research Center. Results from time-accurate simulations with moving boundary capability, and the performance of the parallel versions of the code will be presented in the final paper.
Task-parallel message passing interface implementation of Autodock4 for docking of very large databases of compounds using high-performance super-computers.

PubMed

Collignon, Barbara; Schulz, Roland; Smith, Jeremy C; Baudry, Jerome

2011-04-30

A message passing interface (MPI)-based implementation (Autodock4.lga.MPI) of the grid-based docking program Autodock4 has been developed to allow simultaneous and independent docking of multiple compounds on up to thousands of central processing units (CPUs) using the Lamarkian genetic algorithm. The MPI version reads a single binary file containing precalculated grids that represent the protein-ligand interactions, i.e., van der Waals, electrostatic, and desolvation potentials, and needs only two input parameter files for the entire docking run. In comparison, the serial version of Autodock4 reads ASCII grid files and requires one parameter file per compound. The modifications performed result in significantly reduced input/output activity compared with the serial version. Autodock4.lga.MPI scales up to 8192 CPUs with a maximal overhead of 16.3%, of which two thirds is due to input/output operations and one third originates from MPI operations. The optimal docking strategy, which minimizes docking CPU time without lowering the quality of the database enrichments, comprises the docking of ligands preordered from the most to the least flexible and the assignment of the number of energy evaluations as a function of the number of rotatable bounds. In 24 h, on 8192 high-performance computing CPUs, the present MPI version would allow docking to a rigid protein of about 300K small flexible compounds or 11 million rigid compounds.
Fast hydrological model calibration based on the heterogeneous parallel computing accelerated shuffled complex evolution method

NASA Astrophysics Data System (ADS)

Kan, Guangyuan; He, Xiaoyan; Ding, Liuqian; Li, Jiren; Hong, Yang; Zuo, Depeng; Ren, Minglei; Lei, Tianjie; Liang, Ke

2018-01-01

Hydrological model calibration has been a hot issue for decades. The shuffled complex evolution method developed at the University of Arizona (SCE-UA) has been proved to be an effective and robust optimization approach. However, its computational efficiency deteriorates significantly when the amount of hydrometeorological data increases. In recent years, the rise of heterogeneous parallel computing has brought hope for the acceleration of hydrological model calibration. This study proposed a parallel SCE-UA method and applied it to the calibration of a watershed rainfall-runoff model, the Xinanjiang model. The parallel method was implemented on heterogeneous computing systems using OpenMP and CUDA. Performance testing and sensitivity analysis were carried out to verify its correctness and efficiency. Comparison results indicated that heterogeneous parallel computing-accelerated SCE-UA converged much more quickly than the original serial version and possessed satisfactory accuracy and stability for the task of fast hydrological model calibration.
Large-Scale Low-Cost NGS Library Preparation Using a Robust Tn5 Purification and Tagmentation Protocol

PubMed Central

Hennig, Bianca P.; Velten, Lars; Racke, Ines; Tu, Chelsea Szu; Thoms, Matthias; Rybin, Vladimir; Besir, Hüseyin; Remans, Kim; Steinmetz, Lars M.

2017-01-01

Efficient preparation of high-quality sequencing libraries that well represent the biological sample is a key step for using next-generation sequencing in research. Tn5 enables fast, robust, and highly efficient processing of limited input material while scaling to the parallel processing of hundreds of samples. Here, we present a robust Tn5 transposase purification strategy based on an N-terminal His6-Sumo3 tag. We demonstrate that libraries prepared with our in-house Tn5 are of the same quality as those processed with a commercially available kit (Nextera XT), while they dramatically reduce the cost of large-scale experiments. We introduce improved purification strategies for two versions of the Tn5 enzyme. The first version carries the previously reported point mutations E54K and L372P, and stably produces libraries of constant fragment size distribution, even if the Tn5-to-input molecule ratio varies. The second Tn5 construct carries an additional point mutation (R27S) in the DNA-binding domain. This construct allows for adjustment of the fragment size distribution based on enzyme concentration during tagmentation, a feature that opens new opportunities for use of Tn5 in customized experimental designs. We demonstrate the versatility of our Tn5 enzymes in different experimental settings, including a novel single-cell polyadenylation site mapping protocol as well as ultralow input DNA sequencing. PMID:29118030
3D Kirchhoff depth migration algorithm: A new scalable approach for parallelization on multicore CPU based cluster

NASA Astrophysics Data System (ADS)

Rastogi, Richa; Londhe, Ashutosh; Srivastava, Abhishek; Sirasala, Kirannmayi M.; Khonde, Kiran

2017-03-01

In this article, a new scalable 3D Kirchhoff depth migration algorithm is presented on state of the art multicore CPU based cluster. Parallelization of 3D Kirchhoff depth migration is challenging due to its high demand of compute time, memory, storage and I/O along with the need of their effective management. The most resource intensive modules of the algorithm are traveltime calculations and migration summation which exhibit an inherent trade off between compute time and other resources. The parallelization strategy of the algorithm largely depends on the storage of calculated traveltimes and its feeding mechanism to the migration process. The presented work is an extension of our previous work, wherein a 3D Kirchhoff depth migration application for multicore CPU based parallel system had been developed. Recently, we have worked on improving parallel performance of this application by re-designing the parallelization approach. The new algorithm is capable to efficiently migrate both prestack and poststack 3D data. It exhibits flexibility for migrating large number of traces within the available node memory and with minimal requirement of storage, I/O and inter-node communication. The resultant application is tested using 3D Overthrust data on PARAM Yuva II, which is a Xeon E5-2670 based multicore CPU cluster with 16 cores/node and 64 GB shared memory. Parallel performance of the algorithm is studied using different numerical experiments and the scalability results show striking improvement over its previous version. An impressive 49.05X speedup with 76.64% efficiency is achieved for 3D prestack data and 32.00X speedup with 50.00% efficiency for 3D poststack data, using 64 nodes. The results also demonstrate the effectiveness and robustness of the improved algorithm with high scalability and efficiency on a multicore CPU cluster.
Parallelization of the Flow Field Dependent Variation Scheme for Solving the Triple Shock/Boundary Layer Interaction Problem

NASA Technical Reports Server (NTRS)

Schunk, Richard Gregory; Chung, T. J.

2001-01-01

A parallelized version of the Flowfield Dependent Variation (FDV) Method is developed to analyze a problem of current research interest, the flowfield resulting from a triple shock/boundary layer interaction. Such flowfields are often encountered in the inlets of high speed air-breathing vehicles including the NASA Hyper-X research vehicle. In order to resolve the complex shock structure and to provide adequate resolution for boundary layer computations of the convective heat transfer from surfaces inside the inlet, models containing over 500,000 nodes are needed. Efficient parallelization of the computation is essential to achieving results in a timely manner. Results from a parallelization scheme, based upon multi-threading, as implemented on multiple processor supercomputers and workstations is presented.
An Approach Using Parallel Architecture to Storage DICOM Images in Distributed File System

NASA Astrophysics Data System (ADS)

Soares, Tiago S.; Prado, Thiago C.; Dantas, M. A. R.; de Macedo, Douglas D. J.; Bauer, Michael A.

2012-02-01

Telemedicine is a very important area in medical field that is expanding daily motivated by many researchers interested in improving medical applications. In Brazil was started in 2005, in the State of Santa Catarina has a developed server called the CyclopsDCMServer, which the purpose to embrace the HDF for the manipulation of medical images (DICOM) using a distributed file system. Since then, many researches were initiated in order to seek better performance. Our approach for this server represents an additional parallel implementation in I/O operations since HDF version 5 has an essential feature for our work which supports parallel I/O, based upon the MPI paradigm. Early experiments using four parallel nodes, provide good performance when compare to the serial HDF implemented in the CyclopsDCMServer.
A simple computational algorithm of model-based choice preference.

PubMed

Toyama, Asako; Katahira, Kentaro; Ohira, Hideki

2017-08-01

A broadly used computational framework posits that two learning systems operate in parallel during the learning of choice preferences-namely, the model-free and model-based reinforcement-learning systems. In this study, we examined another possibility, through which model-free learning is the basic system and model-based information is its modulator. Accordingly, we proposed several modified versions of a temporal-difference learning model to explain the choice-learning process. Using the two-stage decision task developed by Daw, Gershman, Seymour, Dayan, and Dolan (2011), we compared their original computational model, which assumes a parallel learning process, and our proposed models, which assume a sequential learning process. Choice data from 23 participants showed a better fit with the proposed models. More specifically, the proposed eligibility adjustment model, which assumes that the environmental model can weight the degree of the eligibility trace, can explain choices better under both model-free and model-based controls and has a simpler computational algorithm than the original model. In addition, the forgetting learning model and its variation, which assume changes in the values of unchosen actions, substantially improved the fits to the data. Overall, we show that a hybrid computational model best fits the data. The parameters used in this model succeed in capturing individual tendencies with respect to both model use in learning and exploration behavior. This computational model provides novel insights into learning with interacting model-free and model-based components.
I/O Parallelization for the Goddard Earth Observing System Data Assimilation System (GEOS DAS)

NASA Technical Reports Server (NTRS)

Lucchesi, Rob; Sawyer, W.; Takacs, L. L.; Lyster, P.; Zero, J.

1998-01-01

The National Aeronautics and Space Administration (NASA) Data Assimilation Office (DAO) at the Goddard Space Flight Center (GSFC) has developed the GEOS DAS, a data assimilation system that provides production support for NASA missions and will support NASA's Earth Observing System (EOS) in the coming years. The GEOS DAS will be used to provide background fields of meteorological quantities to EOS satellite instrument teams for use in their data algorithms as well as providing assimilated data sets for climate studies on decadal time scales. The DAO has been involved in prototyping parallel implementations of the GEOS DAS for a number of years and is now embarking on an effort to convert the production version from shared-memory parallelism to distributed-memory parallelism using the portable Message-Passing Interface (MPI). The GEOS DAS consists of two main components, an atmospheric General Circulation Model (GCM) and a Physical-space Statistical Analysis System (PSAS). The GCM operates on data that are stored on a regular grid while PSAS works with observational data that are scattered irregularly throughout the atmosphere. As a result, the two components have different data decompositions. The GCM is decomposed horizontally as a checkerboard with all vertical levels of each box existing on the same processing element(PE). The dynamical core of the GCM can also operate on a rotated grid, which requires communication-intensive grid transformations during GCM integration. PSAS groups observations on PEs in a more irregular and dynamic fashion.
Efficient Parallelization of a Dynamic Unstructured Application on the Tera MTA

NASA Technical Reports Server (NTRS)

Oliker, Leonid; Biswas, Rupak

1999-01-01

The success of parallel computing in solving real-life computationally-intensive problems relies on their efficient mapping and execution on large-scale multiprocessor architectures. Many important applications are both unstructured and dynamic in nature, making their efficient parallel implementation a daunting task. This paper presents the parallelization of a dynamic unstructured mesh adaptation algorithm using three popular programming paradigms on three leading supercomputers. We examine an MPI message-passing implementation on the Cray T3E and the SGI Origin2OOO, a shared-memory implementation using cache coherent nonuniform memory access (CC-NUMA) of the Origin2OOO, and a multi-threaded version on the newly-released Tera Multi-threaded Architecture (MTA). We compare several critical factors of this parallel code development, including runtime, scalability, programmability, and memory overhead. Our overall results demonstrate that multi-threaded systems offer tremendous potential for quickly and efficiently solving some of the most challenging real-life problems on parallel computers.
A parallel simulated annealing algorithm for standard cell placement on a hypercube computer

NASA Technical Reports Server (NTRS)

Jones, Mark Howard

1987-01-01

A parallel version of a simulated annealing algorithm is presented which is targeted to run on a hypercube computer. A strategy for mapping the cells in a two dimensional area of a chip onto processors in an n-dimensional hypercube is proposed such that both small and large distance moves can be applied. Two types of moves are allowed: cell exchanges and cell displacements. The computation of the cost function in parallel among all the processors in the hypercube is described along with a distributed data structure that needs to be stored in the hypercube to support parallel cost evaluation. A novel tree broadcasting strategy is used extensively in the algorithm for updating cell locations in the parallel environment. Studies on the performance of the algorithm on example industrial circuits show that it is faster and gives better final placement results than the uniprocessor simulated annealing algorithms. An improved uniprocessor algorithm is proposed which is based on the improved results obtained from parallelization of the simulated annealing algorithm.
Quantum image pseudocolor coding based on the density-stratified method

NASA Astrophysics Data System (ADS)

Jiang, Nan; Wu, Wenya; Wang, Luo; Zhao, Na

2015-05-01

Pseudocolor processing is a branch of image enhancement. It dyes grayscale images to color images to make the images more beautiful or to highlight some parts on the images. This paper proposes a quantum image pseudocolor coding scheme based on the density-stratified method which defines a colormap and changes the density value from gray to color parallel according to the colormap. Firstly, two data structures: quantum image GQIR and quantum colormap QCR are reviewed or proposed. Then, the quantum density-stratified algorithm is presented. Based on them, the quantum realization in the form of circuits is given. The main advantages of the quantum version for pseudocolor processing over the classical approach are that it needs less memory and can speed up the computation. Two kinds of examples help us to describe the scheme further. Finally, the future work are analyzed.
Aerobraking Maneuver (ABM) Report Generator

NASA Technical Reports Server (NTRS)

Fisher, Forrest; Gladden, Roy; Khanampornpan, Teerapat

2008-01-01

abmREPORT Version 3.1 is a Perl script that extracts vital summarization information from the Mars Reconnaissance Orbiter (MRO) aerobraking ABM build process. This information facilitates sequence reviews, and provides a high-level summarization of the sequence for mission management. The script extracts information from the ENV, SSF, FRF, SCMFmax, and OPTG files and burn magnitude configuration files and presents them in a single, easy-to-check report that provides the majority of the parameters necessary for cross check and verification during the sequence review process. This means that needed information, formerly spread across a number of different files and each in a different format, is all available in this one application. This program is built on the capabilities developed in dragReport and then the scripts evolved as the two tools continued to be developed in parallel.
Real-Space Density Functional Theory on Graphical Processing Units: Computational Approach and Comparison to Gaussian Basis Set Methods.

PubMed

Andrade, Xavier; Aspuru-Guzik, Alán

2013-10-08

We discuss the application of graphical processing units (GPUs) to accelerate real-space density functional theory (DFT) calculations. To make our implementation efficient, we have developed a scheme to expose the data parallelism available in the DFT approach; this is applied to the different procedures required for a real-space DFT calculation. We present results for current-generation GPUs from AMD and Nvidia, which show that our scheme, implemented in the free code Octopus, can reach a sustained performance of up to 90 GFlops for a single GPU, representing a significant speed-up when compared to the CPU version of the code. Moreover, for some systems, our implementation can outperform a GPU Gaussian basis set code, showing that the real-space approach is a competitive alternative for DFT simulations on GPUs.
Command/response protocols and concurrent software

NASA Technical Reports Server (NTRS)

Bynum, W. L.

1987-01-01

A version of the program to control the parallel jaw gripper is documented. The parallel jaw end-effector hardware and the Intel 8031 processor that is used to control the end-effector are briefly described. A general overview of the controller program is given and a complete description of the program's structure and design are contained. There are three appendices: a memory map of the on-chip RAM, a cross-reference listing of the self-scheduling routines, and a summary of the top-level and monitor commands.

Fast analysis of molecular dynamics trajectories with graphics processing units-Radial distribution function histogramming

DOE Office of Scientific and Technical Information (OSTI.GOV)

Levine, Benjamin G., E-mail: ben.levine@temple.ed; Stone, John E., E-mail: johns@ks.uiuc.ed; Kohlmeyer, Axel, E-mail: akohlmey@temple.ed

2011-05-01

The calculation of radial distribution functions (RDFs) from molecular dynamics trajectory data is a common and computationally expensive analysis task. The rate limiting step in the calculation of the RDF is building a histogram of the distance between atom pairs in each trajectory frame. Here we present an implementation of this histogramming scheme for multiple graphics processing units (GPUs). The algorithm features a tiling scheme to maximize the reuse of data at the fastest levels of the GPU's memory hierarchy and dynamic load balancing to allow high performance on heterogeneous configurations of GPUs. Several versions of the RDF algorithm aremore » presented, utilizing the specific hardware features found on different generations of GPUs. We take advantage of larger shared memory and atomic memory operations available on state-of-the-art GPUs to accelerate the code significantly. The use of atomic memory operations allows the fast, limited-capacity on-chip memory to be used much more efficiently, resulting in a fivefold increase in performance compared to the version of the algorithm without atomic operations. The ultimate version of the algorithm running in parallel on four NVIDIA GeForce GTX 480 (Fermi) GPUs was found to be 92 times faster than a multithreaded implementation running on an Intel Xeon 5550 CPU. On this multi-GPU hardware, the RDF between two selections of 1,000,000 atoms each can be calculated in 26.9 s per frame. The multi-GPU RDF algorithms described here are implemented in VMD, a widely used and freely available software package for molecular dynamics visualization and analysis.« less
Fast Analysis of Molecular Dynamics Trajectories with Graphics Processing Units—Radial Distribution Function Histogramming

PubMed Central

Stone, John E.; Kohlmeyer, Axel

2011-01-01

The calculation of radial distribution functions (RDFs) from molecular dynamics trajectory data is a common and computationally expensive analysis task. The rate limiting step in the calculation of the RDF is building a histogram of the distance between atom pairs in each trajectory frame. Here we present an implementation of this histogramming scheme for multiple graphics processing units (GPUs). The algorithm features a tiling scheme to maximize the reuse of data at the fastest levels of the GPU’s memory hierarchy and dynamic load balancing to allow high performance on heterogeneous configurations of GPUs. Several versions of the RDF algorithm are presented, utilizing the specific hardware features found on different generations of GPUs. We take advantage of larger shared memory and atomic memory operations available on state-of-the-art GPUs to accelerate the code significantly. The use of atomic memory operations allows the fast, limited-capacity on-chip memory to be used much more efficiently, resulting in a fivefold increase in performance compared to the version of the algorithm without atomic operations. The ultimate version of the algorithm running in parallel on four NVIDIA GeForce GTX 480 (Fermi) GPUs was found to be 92 times faster than a multithreaded implementation running on an Intel Xeon 5550 CPU. On this multi-GPU hardware, the RDF between two selections of 1,000,000 atoms each can be calculated in 26.9 seconds per frame. The multi-GPU RDF algorithms described here are implemented in VMD, a widely used and freely available software package for molecular dynamics visualization and analysis. PMID:21547007
EOS MLS Level 2 Data Processing Software Version 3

NASA Technical Reports Server (NTRS)

Livesey, Nathaniel J.; VanSnyder, Livesey W.; Read, William G.; Schwartz, Michael J.; Lambert, Alyn; Santee, Michelle L.; Nguyen, Honghanh T.; Froidevaux, Lucien; wang, Shuhui; Manney, Gloria L.;

2011-01-01

This software accepts the EOS MLS calibrated measurements of microwave radiances products and operational meteorological data, and produces a set of estimates of atmospheric temperature and composition. This version has been designed to be as flexible as possible. The software is controlled by a Level 2 Configuration File that controls all aspects of the software: defining the contents of state and measurement vectors, defining the configurations of the various forward models available, reading appropriate a priori spectroscopic and calibration data, performing retrievals, post-processing results, computing diagnostics, and outputting results in appropriate files. In production mode, the software operates in a parallel form, with one instance of the program acting as a master, coordinating the work of multiple slave instances on a cluster of computers, each computing the results for individual chunks of data. In addition, to do conventional retrieval calculations and producing geophysical products, the Level 2 Configuration File can instruct the software to produce files of simulated radiances based on a state vector formed from a set of geophysical product files taken as input. Combining both the retrieval and simulation tasks in a single piece of software makes it far easier to ensure that identical forward model algorithms and parameters are used in both tasks. This also dramatically reduces the complexity of the code maintenance effort.

A new augmentation based algorithm for extracting maximal chordal subgraphs

DOE PAGES

Bhowmick, Sanjukta; Chen, Tzu-Yi; Halappanavar, Mahantesh

2014-10-18

If every cycle of a graph is chordal length greater than three then it contains an edge between non-adjacent vertices. Chordal graphs are of interest both theoretically, since they admit polynomial time solutions to a range of NP-hard graph problems, and practically, since they arise in many applications including sparse linear algebra, computer vision, and computational biology. A maximal chordal subgraph is a chordal subgraph that is not a proper subgraph of any other chordal subgraph. Existing algorithms for computing maximal chordal subgraphs depend on dynamically ordering the vertices, which is an inherently sequential process and therefore limits the algorithms’more » parallelizability. In our paper we explore techniques to develop a scalable parallel algorithm for extracting a maximal chordal subgraph. We demonstrate that an earlier attempt at developing a parallel algorithm may induce a non-optimal vertex ordering and is therefore not guaranteed to terminate with a maximal chordal subgraph. We then give a new algorithm that first computes and then repeatedly augments a spanning chordal subgraph. After proving that the algorithm terminates with a maximal chordal subgraph, we then demonstrate that this algorithm is more amenable to parallelization and that the parallel version also terminates with a maximal chordal subgraph. That said, the complexity of the new algorithm is higher than that of the previous parallel algorithm, although the earlier algorithm computes a chordal subgraph which is not guaranteed to be maximal. Finally, we experimented with our augmentation-based algorithm on both synthetic and real-world graphs. We provide scalability results and also explore the effect of different choices for the initial spanning chordal subgraph on both the running time and on the number of edges in the maximal chordal subgraph.« less
A New Augmentation Based Algorithm for Extracting Maximal Chordal Subgraphs.

PubMed

Bhowmick, Sanjukta; Chen, Tzu-Yi; Halappanavar, Mahantesh

2015-02-01

A graph is chordal if every cycle of length greater than three contains an edge between non-adjacent vertices. Chordal graphs are of interest both theoretically, since they admit polynomial time solutions to a range of NP-hard graph problems, and practically, since they arise in many applications including sparse linear algebra, computer vision, and computational biology. A maximal chordal subgraph is a chordal subgraph that is not a proper subgraph of any other chordal subgraph. Existing algorithms for computing maximal chordal subgraphs depend on dynamically ordering the vertices, which is an inherently sequential process and therefore limits the algorithms' parallelizability. In this paper we explore techniques to develop a scalable parallel algorithm for extracting a maximal chordal subgraph. We demonstrate that an earlier attempt at developing a parallel algorithm may induce a non-optimal vertex ordering and is therefore not guaranteed to terminate with a maximal chordal subgraph. We then give a new algorithm that first computes and then repeatedly augments a spanning chordal subgraph. After proving that the algorithm terminates with a maximal chordal subgraph, we then demonstrate that this algorithm is more amenable to parallelization and that the parallel version also terminates with a maximal chordal subgraph. That said, the complexity of the new algorithm is higher than that of the previous parallel algorithm, although the earlier algorithm computes a chordal subgraph which is not guaranteed to be maximal. We experimented with our augmentation-based algorithm on both synthetic and real-world graphs. We provide scalability results and also explore the effect of different choices for the initial spanning chordal subgraph on both the running time and on the number of edges in the maximal chordal subgraph.
Real-time text extraction based on the page layout analysis system

NASA Astrophysics Data System (ADS)

Soua, M.; Benchekroun, A.; Kachouri, R.; Akil, M.

2017-05-01

Several approaches were proposed in order to extract text from scanned documents. However, text extraction in heterogeneous documents stills a real challenge. Indeed, text extraction in this context is a difficult task because of the variation of the text due to the differences of sizes, styles and orientations, as well as to the complexity of the document region background. Recently, we have proposed the improved hybrid binarization based on Kmeans method (I-HBK)5 to extract suitably the text from heterogeneous documents. In this method, the Page Layout Analysis (PLA), part of the Tesseract OCR engine, is used to identify text and image regions. Afterwards our hybrid binarization is applied separately on each kind of regions. In one side, gamma correction is employed before to process image regions. In the other side, binarization is performed directly on text regions. Then, a foreground and background color study is performed to correct inverted region colors. Finally, characters are located from the binarized regions based on the PLA algorithm. In this work, we extend the integration of the PLA algorithm within the I-HBK method. In addition, to speed up the separation of text and image step, we employ an efficient GPU acceleration. Through the performed experiments, we demonstrate the high F-measure accuracy of the PLA algorithm reaching 95% on the LRDE dataset. In addition, we illustrate the sequential and the parallel compared PLA versions. The obtained results give a speedup of 3.7x when comparing the parallel PLA implementation on GPU GTX 660 to the CPU version.
GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers

DOE PAGES

Abraham, Mark James; Murtola, Teemu; Schulz, Roland; ...

2015-07-15

GROMACS is one of the most widely used open-source and free software codes in chemistry, used primarily for dynamical simulations of biomolecules. It provides a rich set of calculation types, preparation and analysis tools. Several advanced techniques for free-energy calculations are supported. In version 5, it reaches new performance heights, through several new and enhanced parallelization algorithms. This work on every level; SIMD registers inside cores, multithreading, heterogeneous CPU–GPU acceleration, state-of-the-art 3D domain decomposition, and ensemble-level parallelization through built-in replica exchange and the separate Copernicus framework. Finally, the latest best-in-class compressed trajectory storage format is supported.
GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers

DOE Office of Scientific and Technical Information (OSTI.GOV)

Abraham, Mark James; Murtola, Teemu; Schulz, Roland

GROMACS is one of the most widely used open-source and free software codes in chemistry, used primarily for dynamical simulations of biomolecules. It provides a rich set of calculation types, preparation and analysis tools. Several advanced techniques for free-energy calculations are supported. In version 5, it reaches new performance heights, through several new and enhanced parallelization algorithms. This work on every level; SIMD registers inside cores, multithreading, heterogeneous CPU–GPU acceleration, state-of-the-art 3D domain decomposition, and ensemble-level parallelization through built-in replica exchange and the separate Copernicus framework. Finally, the latest best-in-class compressed trajectory storage format is supported.
Data Parallel Line Relaxation (DPLR) Code User Manual: Acadia - Version 4.01.1

NASA Technical Reports Server (NTRS)

Wright, Michael J.; White, Todd; Mangini, Nancy

2009-01-01

Data-Parallel Line Relaxation (DPLR) code is a computational fluid dynamic (CFD) solver that was developed at NASA Ames Research Center to help mission support teams generate high-value predictive solutions for hypersonic flow field problems. The DPLR Code Package is an MPI-based, parallel, full three-dimensional Navier-Stokes CFD solver with generalized models for finite-rate reaction kinetics, thermal and chemical non-equilibrium, accurate high-temperature transport coefficients, and ionized flow physics incorporated into the code. DPLR also includes a large selection of generalized realistic surface boundary conditions and links to enable loose coupling with external thermal protection system (TPS) material response and shock layer radiation codes.
An Adaptive Kalman Filter Using a Simple Residual Tuning Method

NASA Technical Reports Server (NTRS)

Harman, Richard R.

1999-01-01

One difficulty in using Kalman filters in real world situations is the selection of the correct process noise, measurement noise, and initial state estimate and covariance. These parameters are commonly referred to as tuning parameters. Multiple methods have been developed to estimate these parameters. Most of those methods such as maximum likelihood, subspace, and observer Kalman Identification require extensive offline processing and are not suitable for real time processing. One technique, which is suitable for real time processing, is the residual tuning method. Any mismodeling of the filter tuning parameters will result in a non-white sequence for the filter measurement residuals. The residual tuning technique uses this information to estimate corrections to those tuning parameters. The actual implementation results in a set of sequential equations that run in parallel with the Kalman filter. A. H. Jazwinski developed a specialized version of this technique for estimation of process noise. Equations for the estimation of the measurement noise have also been developed. These algorithms are used to estimate the process noise and measurement noise for the Wide Field Infrared Explorer star tracker and gyro.
Advantages of GPU technology in DFT calculations of intercalated graphene

NASA Astrophysics Data System (ADS)

Pešić, J.; Gajić, R.

2014-09-01

Over the past few years, the expansion of general-purpose graphic-processing unit (GPGPU) technology has had a great impact on computational science. GPGPU is the utilization of a graphics-processing unit (GPU) to perform calculations in applications usually handled by the central processing unit (CPU). Use of GPGPUs as a way to increase computational power in the material sciences has significantly decreased computational costs in already highly demanding calculations. A level of the acceleration and parallelization depends on the problem itself. Some problems can benefit from GPU acceleration and parallelization, such as the finite-difference time-domain algorithm (FTDT) and density-functional theory (DFT), while others cannot take advantage of these modern technologies. A number of GPU-supported applications had emerged in the past several years (www.nvidia.com/object/gpu-applications.html). Quantum Espresso (QE) is reported as an integrated suite of open source computer codes for electronic-structure calculations and materials modeling at the nano-scale. It is based on DFT, the use of a plane-waves basis and a pseudopotential approach. Since the QE 5.0 version, it has been implemented as a plug-in component for standard QE packages that allows exploiting the capabilities of Nvidia GPU graphic cards (www.qe-forge.org/gf/proj). In this study, we have examined the impact of the usage of GPU acceleration and parallelization on the numerical performance of DFT calculations. Graphene has been attracting attention worldwide and has already shown some remarkable properties. We have studied an intercalated graphene, using the QE package PHonon, which employs GPU. The term ‘intercalation’ refers to a process whereby foreign adatoms are inserted onto a graphene lattice. In addition, by intercalating different atoms between graphene layers, it is possible to tune their physical properties. Our experiments have shown there are benefits from using GPUs, and we reached an acceleration of several times compared to standard CPU calculations.
F3D Image Processing and Analysis for Many - and Multi-core Platforms

DOE Office of Scientific and Technical Information (OSTI.GOV)

F3D is written in OpenCL, so it achieve[sic] platform-portable parallelism on modern mutli-core CPUs and many-core GPUs. The interface and mechanims to access F3D core are written in Java as a plugin for Fiji/ImageJ to deliver several key image-processing algorithms necessary to remove artifacts from micro-tomography data. The algorithms consist of data parallel aware filters that can efficiently utilizes[sic] resources and can work on out of core datasets and scale efficiently across multiple accelerators. Optimizing for data parallel filters, streaming out of core datasets, and efficient resource and memory and data managements over complex execution sequence of filters greatly expeditesmore » any scientific workflow with image processing requirements. F3D performs several different types of 3D image processing operations, such as non-linear filtering using bilateral filtering and/or median filtering and/or morphological operators (MM). F3D gray-level MM operators are one-pass constant time methods that can perform morphological transformations with a line-structuring element oriented in discrete directions. Additionally, MM operators can be applied to gray-scale images, and consist of two parts: (a) a reference shape or structuring element, which is translated over the image, and (b) a mechanism, or operation, that defines the comparisons to be performed between the image and the structuring element. This tool provides a critical component within many complex pipelines such as those for performing automated segmentation of image stacks. F3D is also called a "descendent" of Quant-CT, another software we developed in the past. These two modules are to be integrated in a next version. Further details were reported in: D.M. Ushizima, T. Perciano, H. Krishnan, B. Loring, H. Bale, D. Parkinson, and J. Sethian. Structure recognition from high-resolution images of ceramic composites. IEEE International Conference on Big Data, October 2014.« less
Code Parallelization with CAPO: A User Manual

NASA Technical Reports Server (NTRS)

Jin, Hao-Qiang; Frumkin, Michael; Yan, Jerry; Biegel, Bryan (Technical Monitor)

2001-01-01

A software tool has been developed to assist the parallelization of scientific codes. This tool, CAPO, extends an existing parallelization toolkit, CAPTools developed at the University of Greenwich, to generate OpenMP parallel codes for shared memory architectures. This is an interactive toolkit to transform a serial Fortran application code to an equivalent parallel version of the software - in a small fraction of the time normally required for a manual parallelization. We first discuss the way in which loop types are categorized and how efficient OpenMP directives can be defined and inserted into the existing code using the in-depth interprocedural analysis. The use of the toolkit on a number of application codes ranging from benchmark to real-world application codes is presented. This will demonstrate the great potential of using the toolkit to quickly parallelize serial programs as well as the good performance achievable on a large number of toolkit to quickly parallelize serial programs as well as the good performance achievable on a large number of processors. The second part of the document gives references to the parameters and the graphic user interface implemented in the toolkit. Finally a set of tutorials is included for hands-on experiences with this toolkit.
Development of a GPU Compatible Version of the Fast Radiation Code RRTMG

NASA Astrophysics Data System (ADS)

Iacono, M. J.; Mlawer, E. J.; Berthiaume, D.; Cady-Pereira, K. E.; Suarez, M.; Oreopoulos, L.; Lee, D.

2012-12-01

The absorption of solar radiation and emission/absorption of thermal radiation are crucial components of the physics that drive Earth's climate and weather. Therefore, accurate radiative transfer calculations are necessary for realistic climate and weather simulations. Efficient radiation codes have been developed for this purpose, but their accuracy requirements still necessitate that as much as 30% of the computational time of a GCM is spent computing radiative fluxes and heating rates. The overall computational expense constitutes a limitation on a GCM's predictive ability if it becomes an impediment to adding new physics to or increasing the spatial and/or vertical resolution of the model. The emergence of Graphics Processing Unit (GPU) technology, which will allow the parallel computation of multiple independent radiative calculations in a GCM, will lead to a fundamental change in the competition between accuracy and speed. Processing time previously consumed by radiative transfer will now be available for the modeling of other processes, such as physics parameterizations, without any sacrifice in the accuracy of the radiative transfer. Furthermore, fast radiation calculations can be performed much more frequently and will allow the modeling of radiative effects of rapid changes in the atmosphere. The fast radiation code RRTMG, developed at Atmospheric and Environmental Research (AER), is utilized operationally in many dynamical models throughout the world. We will present the results from the first stage of an effort to create a version of the RRTMG radiation code designed to run efficiently in a GPU environment. This effort will focus on the RRTMG implementation in GEOS-5. RRTMG has an internal pseudo-spectral vector of length of order 100 that, when combined with the much greater length of the global horizontal grid vector from which the radiation code is called in GEOS-5, makes RRTMG/GEOS-5 particularly suited to achieving a significant speed improvement through GPU technology. This large number of independent cases will allow us to take full advantage of the computational power of the latest GPUs, ensuring that all thread cores in the GPU remain active, a key criterion for obtaining significant speedup. The CUDA (Compute Unified Device Architecture) Fortran compiler developed by PGI and Nvidia will allow us to construct this parallel implementation on the GPU while remaining in the Fortran language. This implementation will scale very well across various CUDA-supported GPUs such as the recently released Fermi Nvidia cards. We will present the computational speed improvements of the GPU-compatible code relative to the standard CPU-based RRTMG with respect to a very large and diverse suite of atmospheric profiles. This suite will also be utilized to demonstrate the minimal impact of the code restructuring on the accuracy of radiation calculations. The GPU-compatible version of RRTMG will be directly applicable to future versions of GEOS-5, but it is also likely to provide significant associated benefits for other GCMs that employ RRTMG.
An efficient parallel algorithm for the calculation of canonical MP2 energies.

PubMed

Baker, Jon; Pulay, Peter

2002-09-01

We present the parallel version of a previous serial algorithm for the efficient calculation of canonical MP2 energies (Pulay, P.; Saebo, S.; Wolinski, K. Chem Phys Lett 2001, 344, 543). It is based on the Saebo-Almlöf direct-integral transformation, coupled with an efficient prescreening of the AO integrals. The parallel algorithm avoids synchronization delays by spawning a second set of slaves during the bin-sort prior to the second half-transformation. Results are presented for systems with up to 2000 basis functions. MP2 energies for molecules with 400-500 basis functions can be routinely calculated to microhartree accuracy on a small number of processors (6-8) in a matter of minutes with modern PC-based parallel computers. Copyright 2002 Wiley Periodicals, Inc. J Comput Chem 23: 1150-1156, 2002
VO-KOREL: A Fourier Disentangling Service of the Virtual Observatory

NASA Astrophysics Data System (ADS)

Škoda, Petr; Hadrava, Petr; Fuchs, Jan

2012-04-01

VO-KOREL is a web service exploiting the technology of the Virtual Observatory for providing astronomers with the intuitive graphical front-end and distributed computing back-end running the most recent version of the Fourier disentangling code KOREL. The system integrates the ideas of the e-shop basket, conserving the privacy of every user by transfer encryption and access authentication, with features of laboratory notebook, allowing the easy housekeeping of both input parameters and final results, as well as it explores a newly emerging technology of cloud computing. While the web-based front-end allows the user to submit data and parameter files, edit parameters, manage a job list, resubmit or cancel running jobs and mainly watching the text and graphical results of a disentangling process, the main part of the back-end is a simple job queue submission system executing in parallel multiple instances of the FORTRAN code KOREL. This may be easily extended for GRID-based deployment on massively parallel computing clusters. The short introduction into underlying technologies is given, briefly mentioning advantages as well as bottlenecks of the design used.
Parallel halftoning technique using dot diffusion optimization

NASA Astrophysics Data System (ADS)

Molina-Garcia, Javier; Ponomaryov, Volodymyr I.; Reyes-Reyes, Rogelio; Cruz-Ramos, Clara

2017-05-01

In this paper, a novel approach for halftone images is proposed and implemented for images that are obtained by the Dot Diffusion (DD) method. Designed technique is based on an optimization of the so-called class matrix used in DD algorithm and it consists of generation new versions of class matrix, which has no baron and near-baron in order to minimize inconsistencies during the distribution of the error. Proposed class matrix has different properties and each is designed for two different applications: applications where the inverse-halftoning is necessary, and applications where this method is not required. The proposed method has been implemented in GPU (NVIDIA GeForce GTX 750 Ti), multicore processors (AMD FX(tm)-6300 Six-Core Processor and in Intel core i5-4200U), using CUDA and OpenCV over a PC with linux. Experimental results have shown that novel framework generates a good quality of the halftone images and the inverse halftone images obtained. The simulation results using parallel architectures have demonstrated the efficiency of the novel technique when it is implemented in real-time processing.
The Surface Ocean CO2 Atlas: Stewarding Underway Carbon Data from Collection to Archival

NASA Astrophysics Data System (ADS)

O'Brien, K.; Smith, K. M.; Pfeil, B.; Landa, C.; Bakker, D. C. E.; Olsen, A.; Jones, S.; Shrestha, B.; Kozyr, A.; Manke, A. B.; Schweitzer, R.; Burger, E. F.

2016-02-01

The Surface Ocean CO2 Atlas (SOCAT, www.socat.info) is a quality controlled, global surface ocean carbon dioxide (CO2) data set gathered on research vessels, SOOP and buoys. To the degree feasible SOCAT is comprehensive; it draws together and applies uniform QC procedures to all such observations made across the international community. The first version of SOCAT (version 1.5) was publicly released September 2011(Bakker et al., 2011) with 6.3 million observations. This was followed by the release of SOCAT version 2, expanded to over 10 million observations, in June 2013 (Bakker et al., 2013). Most recently, in September 2015 SOCAT version 3 was released containing over 14 millions observations spanning almost 60 years! The process of assembling, QC'ing and publishing V1.5 and V2 of SOCAT required an unsustainable level of manual effort. To ease the burden on data managers and data providers, the SOCAT community agreed to embark an automated data ingestion process which would create a streamlined workflow to improve data stewardship from ingestion to quality control and from publishing to archival. To that end, for version 3 and beyond, the SOCAT automation team created a framework which was based upon standards and conventions, yet at the same time allows scientists to work in the data formats they felt most comfortable with (ie, csv files). This automated workflow provides several advantages: 1) data ingestion into uniform and standards-based file formats; 2) ease of data integration into standard quality control system; 3) data ingestion and quality control can be performed in parallel; 4) provides uniform method of archiving carbon data and generation of digital object identifiers (DOI).In this presentation, we will discuss and demonstrate the SOCAT data ingestion dashboard and the quality control system. We will also discuss the standards, conventions, and tools that were leveraged to create a workflow that allows scientists to work in their own formats, yet provides a framework for creating high quality data products on an annual basis, while meeting or exceeding data requirements for access, documentation and archival.
Evaluation of SNS Beamline Shielding Configurations using MCNPX Accelerated by ADVANTG

DOE Office of Scientific and Technical Information (OSTI.GOV)

Risner, Joel M; Johnson, Seth R.; Remec, Igor

2015-01-01

Shielding analyses for the Spallation Neutron Source (SNS) at Oak Ridge National Laboratory pose significant computational challenges, including highly anisotropic high-energy sources, a combination of deep penetration shielding and an unshielded beamline, and a desire to obtain well-converged nearly global solutions for mapping of predicted radiation fields. The majority of these analyses have been performed using MCNPX with manually generated variance reduction parameters (source biasing and cell-based splitting and Russian roulette) that were largely based on the analyst's insight into the problem specifics. Development of the variance reduction parameters required extensive analyst time, and was often tailored to specific portionsmore » of the model phase space. We previously applied a developmental version of the ADVANTG code to an SNS beamline study to perform a hybrid deterministic/Monte Carlo analysis and showed that we could obtain nearly global Monte Carlo solutions with essentially uniform relative errors for mesh tallies that cover extensive portions of the model with typical voxel spacing of a few centimeters. The use of weight window maps and consistent biased sources produced using the FW-CADIS methodology in ADVANTG allowed us to obtain these solutions using substantially less computer time than the previous cell-based splitting approach. While those results were promising, the process of using the developmental version of ADVANTG was somewhat laborious, requiring user-developed Python scripts to drive much of the analysis sequence. In addition, limitations imposed by the size of weight-window files in MCNPX necessitated the use of relatively coarse spatial and energy discretization for the deterministic Denovo calculations that we used to generate the variance reduction parameters. We recently applied the production version of ADVANTG to this beamline analysis, which substantially streamlined the analysis process. We also tested importance function collapsing (in space and energy) capabilities in ADVANTG. These changes, along with the support for parallel Denovo calculations using the current version of ADVANTG, give us the capability to improve the fidelity of the deterministic portion of the hybrid analysis sequence, obtain improved weight-window maps, and reduce both the analyst and computational time required for the analysis process.« less
TRANSMISSION NETWORK PLANNING METHOD FOR COMPARATIVE STUDIES (JOURNAL VERSION)

EPA Science Inventory

An automated transmission network planning method for comparative studies is presented. This method employs logical steps that may closely parallel those taken in practice by the planning engineers. Use is made of a sensitivity matrix to simulate the engineers' experience in sele...

Factorial validity and reliability of the Malaysian simplified Chinese version of Multidimensional Scale of Perceived Social Support (MSPSS-SCV) among a group of university students.

PubMed

Guan, Ng Chong; Seng, Loh Huai; Hway Ann, Anne Yee; Hui, Koh Ong

2015-03-01

This study was aimed at validating the simplified Chinese version of the Multidimensional Scale of Perceived Support (MSPSS-SCV) among a group of medical and dental students in University Malaya. Two hundred and two students who took part in this study were given the MSPSS-SCV, the Medical Outcome Study social support survey, the Malay version of the Beck Depression Inventory, the Malay version of the General Health Questionnaire, and the English version of the MSPSS. After 1 week, these students were again required to complete the MSPSS-SCV but with the item sequences shuffled. This scale displayed excellent internal consistency (Cronbach's α = .924), high test-retest reliability (.71), parallel form reliability (.92; Spearman's ρ, P < .01), and validity. In conclusion, the MSPSS-SCV demonstrated sound psychometric properties in measuring social support among a group of medical and dental students. It could therefore be used as a simple screening tool among young educated Malaysian adolescents. © 2013 APJPH.
APINetworks Java. A Java approach to the efficient treatment of large-scale complex networks

NASA Astrophysics Data System (ADS)

Muñoz-Caro, Camelia; Niño, Alfonso; Reyes, Sebastián; Castillo, Miriam

2016-10-01

We present a new version of the core structural package of our Application Programming Interface, APINetworks, for the treatment of complex networks in arbitrary computational environments. The new version is written in Java and presents several advantages over the previous C++ version: the portability of the Java code, the easiness of object-oriented design implementations, and the simplicity of memory management. In addition, some additional data structures are introduced for storing the sets of nodes and edges. Also, by resorting to the different garbage collectors currently available in the JVM the Java version is much more efficient than the C++ one with respect to memory management. In particular, the G1 collector is the most efficient one because of the parallel execution of G1 and the Java application. Using G1, APINetworks Java outperforms the C++ version and the well-known NetworkX and JGraphT packages in the building and BFS traversal of linear and complete networks. The better memory management of the present version allows for the modeling of much larger networks.
TES Validation Reports

Atmospheric Science Data Center

2014-06-30

... Reports: TES Data Versions: TES Validation Report Version 6.0 (PDF) R13 processing version; F07_10 file versions TES Validation Report Version 5.0 (PDF) R12 processing version; F06_08, F06_09 file ...
POM.gpu-v1.0: a GPU-based Princeton Ocean Model

NASA Astrophysics Data System (ADS)

Xu, S.; Huang, X.; Oey, L.-Y.; Xu, F.; Fu, H.; Zhang, Y.; Yang, G.

2015-09-01

Graphics processing units (GPUs) are an attractive solution in many scientific applications due to their high performance. However, most existing GPU conversions of climate models use GPUs for only a few computationally intensive regions. In the present study, we redesign the mpiPOM (a parallel version of the Princeton Ocean Model) with GPUs. Specifically, we first convert the model from its original Fortran form to a new Compute Unified Device Architecture C (CUDA-C) code, then we optimize the code on each of the GPUs, the communications between the GPUs, and the I / O between the GPUs and the central processing units (CPUs). We show that the performance of the new model on a workstation containing four GPUs is comparable to that on a powerful cluster with 408 standard CPU cores, and it reduces the energy consumption by a factor of 6.8.
Application Characterization at Scale: Lessons learned from developing a distributed Open Community Runtime system for High Performance Computing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Landwehr, Joshua B.; Suetterlein, Joshua D.; Marquez, Andres

2016-05-16

Since 2012, the U.S. Department of Energy’s X-Stack program has been developing solutions including runtime systems, programming models, languages, compilers, and tools for the Exascale system software to address crucial performance and power requirements. Fine grain programming models and runtime systems show a great potential to efficiently utilize the underlying hardware. Thus, they are essential to many X-Stack efforts. An abundant amount of small tasks can better utilize the vast parallelism available on current and future machines. Moreover, finer tasks can recover faster and adapt better, due to a decrease in state and control. Nevertheless, current applications have been writtenmore » to exploit old paradigms (such as Communicating Sequential Processor and Bulk Synchronous Parallel processing). To fully utilize the advantages of these new systems, applications need to be adapted to these new paradigms. As part of the applications’ porting process, in-depth characterization studies, focused on both application characteristics and runtime features, need to take place to fully understand the application performance bottlenecks and how to resolve them. This paper presents a characterization study for a novel high performance runtime system, called the Open Community Runtime, using key HPC kernels as its vehicle. This study has the following contributions: one of the first high performance, fine grain, distributed memory runtime system implementing the OCR standard (version 0.99a); and a characterization study of key HPC kernels in terms of runtime primitives running on both intra and inter node environments. Running on a general purpose cluster, we have found up to 1635x relative speed-up for a parallel tiled Cholesky Kernels on 128 nodes with 16 cores each and a 1864x relative speed-up for a parallel tiled Smith-Waterman kernel on 128 nodes with 30 cores.« less
SDA 7: A modular and parallel implementation of the simulation of diffusional association software

PubMed Central

Martinez, Michael; Romanowska, Julia; Kokh, Daria B.; Ozboyaci, Musa; Yu, Xiaofeng; Öztürk, Mehmet Ali; Richter, Stefan

2015-01-01

The simulation of diffusional association (SDA) Brownian dynamics software package has been widely used in the study of biomacromolecular association. Initially developed to calculate bimolecular protein–protein association rate constants, it has since been extended to study electron transfer rates, to predict the structures of biomacromolecular complexes, to investigate the adsorption of proteins to inorganic surfaces, and to simulate the dynamics of large systems containing many biomacromolecular solutes, allowing the study of concentration‐dependent effects. These extensions have led to a number of divergent versions of the software. In this article, we report the development of the latest version of the software (SDA 7). This release was developed to consolidate the existing codes into a single framework, while improving the parallelization of the code to better exploit modern multicore shared memory computer architectures. It is built using a modular object‐oriented programming scheme, to allow for easy maintenance and extension of the software, and includes new features, such as adding flexible solute representations. We discuss a number of application examples, which describe some of the methods available in the release, and provide benchmarking data to demonstrate the parallel performance. © 2015 The Authors. Journal of Computational Chemistry Published by Wiley Periodicals, Inc. PMID:26123630
Parallel software support for computational structural mechanics

NASA Technical Reports Server (NTRS)

Jordan, Harry F.

1987-01-01

The application of the parallel programming methodology known as the Force was conducted. Two application issues were addressed. The first involves the efficiency of the implementation and its completeness in terms of satisfying the needs of other researchers implementing parallel algorithms. Support for, and interaction with, other Computational Structural Mechanics (CSM) researchers using the Force was the main issue, but some independent investigation of the Barrier construct, which is extremely important to overall performance, was also undertaken. Another efficiency issue which was addressed was that of relaxing the strong synchronization condition imposed on the self-scheduled parallel DO loop. The Force was extended by the addition of logical conditions to the cases of a parallel case construct and by the inclusion of a self-scheduled version of this construct. The second issue involved applying the Force to the parallelization of finite element codes such as those found in the NICE/SPAR testbed system. One of the more difficult problems encountered is the determination of what information in COMMON blocks is actually used outside of a subroutine and when a subroutine uses a COMMON block merely as scratch storage for internal temporary results.
Tycho 2: A Proxy Application for Kinetic Transport Sweeps

DOE Office of Scientific and Technical Information (OSTI.GOV)

Garrett, Charles Kristopher; Warsa, James S.

2016-09-14

Tycho 2 is a proxy application that implements discrete ordinates (SN) kinetic transport sweeps on unstructured, 3D, tetrahedral meshes. It has been designed to be small and require minimal dependencies to make collaboration and experimentation as easy as possible. Tycho 2 has been released as open source software. The software is currently in a beta release with plans for a stable release (version 1.0) before the end of the year. The code is parallelized via MPI across spatial cells and OpenMP across angles. Currently, several parallelization algorithms are implemented.
MPACT Standard Input User s Manual, Version 2.2.0

DOE Office of Scientific and Technical Information (OSTI.GOV)

Collins, Benjamin S.; Downar, Thomas; Fitzgerald, Andrew

The MPACT (Michigan PArallel Charactistics based Transport) code is designed to perform high-fidelity light water reactor (LWR) analysis using whole-core pin-resolved neutron transport calculations on modern parallel-computing hardware. The code consists of several libraries which provide the functionality necessary to solve steady-state eigenvalue problems. Several transport capabilities are available within MPACT including both 2-D and 3-D Method of Characteristics (MOC). A three-dimensional whole core solution based on the 2D-1D solution method provides the capability for full core depletion calculations.
Xyce™ Parallel Electronic Simulator Reference Guide, Version 6.5

DOE Office of Scientific and Technical Information (OSTI.GOV)

Keiter, Eric R.; Aadithya, Karthik V.; Mei, Ting

2016-06-01

This document is a reference guide to the Xyce Parallel Electronic Simulator, and is a companion document to the Xyce Users’ Guide. The focus of this document is (to the extent possible) exhaustively list device parameters, solver options, parser options, and other usage details of Xyce. This document is not intended to be a tutorial. Users who are new to circuit simulation are better served by the Xyce Users’ Guide. The information herein is subject to change without notice. Copyright © 2002-2016 Sandia Corporation. All rights reserved.
A Massively Parallel Code for Polarization Calculations

NASA Astrophysics Data System (ADS)

Akiyama, Shizuka; Höflich, Peter

2001-03-01

We present an implementation of our Monte-Carlo radiation transport method for rapidly expanding, NLTE atmospheres for massively parallel computers which utilizes both the distributed and shared memory models. This allows us to take full advantage of the fast communication and low latency inherent to nodes with multiple CPUs, and to stretch the limits of scalability with the number of nodes compared to a version which is based on the shared memory model. Test calculations on a local 20-node Beowulf cluster with dual CPUs showed an improved scalability by about 40%.
Tools for Atmospheric Radiative Transfer: Streamer and FluxNet. Revised

NASA Technical Reports Server (NTRS)

Key, Jeffrey R.; Schweiger, Axel J.

1998-01-01

Two tools for the solution of radiative transfer problems are presented. Streamer is a highly flexible medium spectral resolution radiative transfer model based on the plane-parallel theory of radiative transfer. Capable of computing either fluxes or radiances, it is suitable for studying radiative processes at the surface or within the atmosphere and for the development of remote-sensing algorithms. FluxNet is a fast neural network-based implementation of Streamer for computing surface fluxes. It allows for a sophisticated treatment of radiative processes in the analysis of large data sets and potential integration into geophysical models where computational efficiency is an issue. Documentation and tools for the development of alternative versions of Fluxnet are available. Collectively, Streamer and FluxNet solve a wide variety of problems related to radiative transfer: Streamer provides the detail and sophistication needed to perform basic research on most aspects of complex radiative processes while the efficiency and simplicity of FluxNet make it ideal for operational use.
Fast Segmentation From Blurred Data in 3D Fluorescence Microscopy.

PubMed

Storath, Martin; Rickert, Dennis; Unser, Michael; Weinmann, Andreas

2017-10-01

We develop a fast algorithm for segmenting 3D images from linear measurements based on the Potts model (or piecewise constant Mumford-Shah model). To that end, we first derive suitable space discretizations of the 3D Potts model, which are capable of dealing with 3D images defined on non-cubic grids. Our discretization allows us to utilize a specific splitting approach, which results in decoupled subproblems of moderate size. The crucial point in the 3D setup is that the number of independent subproblems is so large that we can reasonably exploit the parallel processing capabilities of the graphics processing units (GPUs). Our GPU implementation is up to 18 times faster than the sequential CPU version. This allows to process even large volumes in acceptable runtimes. As a further contribution, we extend the algorithm in order to deal with non-negativity constraints. We demonstrate the efficiency of our method for combined image deconvolution and segmentation on simulated data and on real 3D wide field fluorescence microscopy data.
Investigating a method of producing "red and dead" galaxies

NASA Astrophysics Data System (ADS)

Skory, Stephen

2010-08-01

In optical wavelengths, galaxies are observed to be either red or blue. The overall color of a galaxy is due to the distribution of the ages of its stellar population. Galaxies with currently active star formation appear blue, while those with no recent star formation at all (greater than about a Gyr) have only old, red stars. This strong bimodality has lead to the idea of star formation quenching, and various proposed physical mechanisms. In this dissertation, I attempt to reproduce with Enzo the results of Naab et al. (2007), in which red and dead galaxies are formed using gravitational quenching, rather than with one of the more typical methods of quenching. My initial attempts are unsuccessful, and I explore the reasons why I think they failed. Then using simpler methods better suited to Enzo + AMR, I am successful in producing a galaxy that appears to be similar in color and formation history to those in Naab et al. However, quenching is achieved using unphysically high star formation efficiencies, which is a different mechanism than Naab et al. suggests. Preliminary results of a much higher resolution, follow-on simulation of the above show some possible contradiction with the results of Naab et al. Cold gas is streaming into the galaxy to fuel starbursts, while at a similar epoch the galaxies in Naab et al. have largely already ceased forming stars in the galaxy. On the other hand, the results of the high resolution simulation are qualitatively similar to other works in the literature that show a somewhat different gravitational quenching mechanism than Naab et al. I also discuss my work using halo finders to analyze simulated cosmological data, and my work improving the Enzo/AMR analysis tool "yt". This includes two parallelizations of the halo finder HOP (Eisenstein and Hut, 1998) which allows analysis of very large cosmological datasets on parallel machines. The first version is "yt-HOP," which works well for datasets between about 2563 and 5123 particles, but has memory bottlenecks as the datasets get larger. These bottlenecks inspired the second version, "Parallel HOP," which is a fully parallelized method and implementation of HOP that has worked on datasets with more than 20483 particles on hundreds of processing cores. Both methods are described in detail, as are the various effects of performance-related runtime options. Additionally, both halo finders are subjected to a full suite of performance benchmarks varying both dataset sizes and computational resources used. I conclude with descriptions of four new tools I added to yt. A Parallel Structure Function Generator allows analysis of two-point functions, such as correlation functions, using memory- and workload-parallelism. A Parallel Merger Tree Generator leverages the parallel halo finders in yt, such as Parallel HOP, to build the merger tree of halos in a cosmological simulation, and outputs the result to a SQLite database for simple and powerful data extraction. A Star Particle Analysis toolkit takes a group of star particles and can output the rate of formation as a function of time, and/or a synthetic Spectral Energy Distribution (S.E.D.) using the Bruzual and Charlot (2003) data tables. Finally, a Halo Mass Function toolkit takes as input a list of halo masses and can output the halo mass function for the halos, as well as an analytical fit for those halos using several previously published fits.
An Automatic Measure of Cross-Language Text Structures

ERIC Educational Resources Information Center

Kim, Kyung

2018-01-01

In order to further validate and extend the application of "GIKS" (Graphical Interface of Knowledge Structure) beyond English, this investigation applies the "GIKS" to capture, visually represent, and compare text structures inherent in two "contrasting" languages. The English and parallel Korean versions of 50…
Large-scale enrichment and discovery of gene-associated SNPs

USDA-ARS?s Scientific Manuscript database

With the recent advent of massively parallel pyrosequencing by 454 Life Sciences it has become feasible to cost-effectively identify numerous single nucleotide polymorphisms (SNPs) within the recombinogenic regions of the maize (Zea mays L.) genome. We developed a modified version of hypomethylated...
4P: fast computing of population genetics statistics from large DNA polymorphism panels

PubMed Central

Benazzo, Andrea; Panziera, Alex; Bertorelle, Giorgio

2015-01-01

Massive DNA sequencing has significantly increased the amount of data available for population genetics and molecular ecology studies. However, the parallel computation of simple statistics within and between populations from large panels of polymorphic sites is not yet available, making the exploratory analyses of a set or subset of data a very laborious task. Here, we present 4P (parallel processing of polymorphism panels), a stand-alone software program for the rapid computation of genetic variation statistics (including the joint frequency spectrum) from millions of DNA variants in multiple individuals and multiple populations. It handles a standard input file format commonly used to store DNA variation from empirical or simulation experiments. The computational performance of 4P was evaluated using large SNP (single nucleotide polymorphism) datasets from human genomes or obtained by simulations. 4P was faster or much faster than other comparable programs, and the impact of parallel computing using multicore computers or servers was evident. 4P is a useful tool for biologists who need a simple and rapid computer program to run exploratory population genetics analyses in large panels of genomic data. It is also particularly suitable to analyze multiple data sets produced in simulation studies. Unix, Windows, and MacOs versions are provided, as well as the source code for easier pipeline implementations. PMID:25628874
An experiment in hurricane track prediction using parallel computing methods

NASA Technical Reports Server (NTRS)

Song, Chang G.; Jwo, Jung-Sing; Lakshmivarahan, S.; Dhall, S. K.; Lewis, John M.; Velden, Christopher S.

1994-01-01

The barotropic model is used to explore the advantages of parallel processing in deterministic forecasting. We apply this model to the track forecasting of hurricane Elena (1985). In this particular application, solutions to systems of elliptic equations are the essence of the computational mechanics. One set of equations is associated with the decomposition of the wind into irrotational and nondivergent components - this determines the initial nondivergent state. Another set is associated with recovery of the streamfunction from the forecasted vorticity. We demonstrate that direct parallel methods based on accelerated block cyclic reduction (BCR) significantly reduce the computational time required to solve the elliptic equations germane to this decomposition and forecast problem. A 72-h track prediction was made using incremental time steps of 16 min on a network of 3000 grid points nominally separated by 100 km. The prediction took 30 sec on the 8-processor Alliant FX/8 computer. This was a speed-up of 3.7 when compared to the one-processor version. The 72-h prediction of Elena's track was made as the storm moved toward Florida's west coast. Approximately 200 km west of Tampa Bay, Elena executed a dramatic recurvature that ultimately changed its course toward the northwest. Although the barotropic track forecast was unable to capture the hurricane's tight cycloidal looping maneuver, the subsequent northwesterly movement was accurately forecasted as was the location and timing of landfall near Mobile Bay.
WARP

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bergmann, Ryan M.; Rowland, Kelly L.

2017-04-12

WARP, which can stand for ``Weaving All the Random Particles,'' is a three-dimensional (3D) continuous energy Monte Carlo neutron transport code developed at UC Berkeley to efficiently execute on NVIDIA graphics processing unit (GPU) platforms. WARP accelerates Monte Carlo simulations while preserving the benefits of using the Monte Carlo method, namely, that very few physical and geometrical simplifications are applied. WARP is able to calculate multiplication factors, neutron flux distributions (in both space and energy), and fission source distributions for time-independent neutron transport problems. It can run in both criticality or fixed source modes, but fixed source mode is currentlymore » not robust, optimized, or maintained in the newest version. WARP can transport neutrons in unrestricted arrangements of parallelepipeds, hexagonal prisms, cylinders, and spheres. The goal of developing WARP is to investigate algorithms that can grow into a full-featured, continuous energy, Monte Carlo neutron transport code that is accelerated by running on GPUs. The crux of the effort is to make Monte Carlo calculations faster while producing accurate results. Modern supercomputers are commonly being built with GPU coprocessor cards in their nodes to increase their computational efficiency and performance. GPUs execute efficiently on data-parallel problems, but most CPU codes, including those for Monte Carlo neutral particle transport, are predominantly task-parallel. WARP uses a data-parallel neutron transport algorithm to take advantage of the computing power GPUs offer.« less
Parallelizing quantum circuit synthesis

NASA Astrophysics Data System (ADS)

Di Matteo, Olivia; Mosca, Michele

2016-03-01

Quantum circuit synthesis is the process in which an arbitrary unitary operation is decomposed into a sequence of gates from a universal set, typically one which a quantum computer can implement both efficiently and fault-tolerantly. As physical implementations of quantum computers improve, the need is growing for tools that can effectively synthesize components of the circuits and algorithms they will run. Existing algorithms for exact, multi-qubit circuit synthesis scale exponentially in the number of qubits and circuit depth, leaving synthesis intractable for circuits on more than a handful of qubits. Even modest improvements in circuit synthesis procedures may lead to significant advances, pushing forward the boundaries of not only the size of solvable circuit synthesis problems, but also in what can be realized physically as a result of having more efficient circuits. We present a method for quantum circuit synthesis using deterministic walks. Also termed pseudorandom walks, these are walks in which once a starting point is chosen, its path is completely determined. We apply our method to construct a parallel framework for circuit synthesis, and implement one such version performing optimal T-count synthesis over the Clifford+T gate set. We use our software to present examples where parallelization offers a significant speedup on the runtime, as well as directly confirm that the 4-qubit 1-bit full adder has optimal T-count 7 and T-depth 3.

Partitioning in parallel processing of production systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Oflazer, K.

1987-01-01

This thesis presents research on certain issues related to parallel processing of production systems. It first presents a parallel production system interpreter that has been implemented on a four-processor multiprocessor. This parallel interpreter is based on Forgy's OPS5 interpreter and exploits production-level parallelism in production systems. Runs on the multiprocessor system indicate that it is possible to obtain speed-up of around 1.7 in the match computation for certain production systems when productions are split into three sets that are processed in parallel. The next issue addressed is that of partitioning a set of rules to processors in a parallel interpretermore » with production-level parallelism, and the extent of additional improvement in performance. The partitioning problem is formulated and an algorithm for approximate solutions is presented. The thesis next presents a parallel processing scheme for OPS5 production systems that allows some redundancy in the match computation. This redundancy enables the processing of a production to be divided into units of medium granularity each of which can be processed in parallel. Subsequently, a parallel processor architecture for implementing the parallel processing algorithm is presented.« less
Parallel processing considerations for image recognition tasks

NASA Astrophysics Data System (ADS)

Simske, Steven J.

2011-01-01

Many image recognition tasks are well-suited to parallel processing. The most obvious example is that many imaging tasks require the analysis of multiple images. From this standpoint, then, parallel processing need be no more complicated than assigning individual images to individual processors. However, there are three less trivial categories of parallel processing that will be considered in this paper: parallel processing (1) by task; (2) by image region; and (3) by meta-algorithm. Parallel processing by task allows the assignment of multiple workflows-as diverse as optical character recognition [OCR], document classification and barcode reading-to parallel pipelines. This can substantially decrease time to completion for the document tasks. For this approach, each parallel pipeline is generally performing a different task. Parallel processing by image region allows a larger imaging task to be sub-divided into a set of parallel pipelines, each performing the same task but on a different data set. This type of image analysis is readily addressed by a map-reduce approach. Examples include document skew detection and multiple face detection and tracking. Finally, parallel processing by meta-algorithm allows different algorithms to be deployed on the same image simultaneously. This approach may result in improved accuracy.
Gust Acoustics Computation with a Space-Time CE/SE Parallel 3D Solver

NASA Technical Reports Server (NTRS)

Wang, X. Y.; Himansu, A.; Chang, S. C.; Jorgenson, P. C. E.; Reddy, D. R. (Technical Monitor)

2002-01-01

The benchmark Problem 2 in Category 3 of the Third Computational Aero-Acoustics (CAA) Workshop is solved using the space-time conservation element and solution element (CE/SE) method. This problem concerns the unsteady response of an isolated finite-span swept flat-plate airfoil bounded by two parallel walls to an incident gust. The acoustic field generated by the interaction of the gust with the flat-plate airfoil is computed by solving the 3D (three-dimensional) Euler equations in the time domain using a parallel version of a 3D CE/SE solver. The effect of the gust orientation on the far-field directivity is studied. Numerical solutions are presented and compared with analytical solutions, showing a reasonable agreement.
The Design of a High Performance Earth Imagery and Raster Data Management and Processing Platform

NASA Astrophysics Data System (ADS)

Xie, Qingyun

2016-06-01

This paper summarizes the general requirements and specific characteristics of both geospatial raster database management system and raster data processing platform from a domain-specific perspective as well as from a computing point of view. It also discusses the need of tight integration between the database system and the processing system. These requirements resulted in Oracle Spatial GeoRaster, a global scale and high performance earth imagery and raster data management and processing platform. The rationale, design, implementation, and benefits of Oracle Spatial GeoRaster are described. Basically, as a database management system, GeoRaster defines an integrated raster data model, supports image compression, data manipulation, general and spatial indices, content and context based queries and updates, versioning, concurrency, security, replication, standby, backup and recovery, multitenancy, and ETL. It provides high scalability using computer and storage clustering. As a raster data processing platform, GeoRaster provides basic operations, image processing, raster analytics, and data distribution featuring high performance computing (HPC). Specifically, HPC features include locality computing, concurrent processing, parallel processing, and in-memory computing. In addition, the APIs and the plug-in architecture are discussed.
(2+1)-dimensional spacetimes containing closed timelike curves

NASA Astrophysics Data System (ADS)

Headrick, Matthew P.; Gott, J. Richard, III

1994-12-01

We investigate the global geometries of (2+1)-dimensional spacetimes as characterized by the transformations undergone by tangent spaces upon parallel transport around closed curves. We critically discuss the use of the term ``total energy-momentum'' as a label for such parallel-transport transformations, pointing out several problems with it. We then investigate parallel-transport transformations in the known (2+1)-dimensional spacetimes containing closed timelike curves (CTC's), and introduce a few new such spacetimes. Using the more specific concept of the holonomy of a closed curve, applicable in simply connected spacetimes, we emphasize that Gott's two-particle CTC-containing spacetime does not have a tachyonic geometry. Finally, we prove the following modified version of Kabat's conjecture: if a CTC is deformable to spacelike or null infinity while remaining a CTC, then its parallel-transport transformation cannot be a rotation; therefore its holonomy, if defined, cannot be a rotation other than through a multiple of 2π.
Extending HPF for advanced data parallel applications

NASA Technical Reports Server (NTRS)

Chapman, Barbara; Mehrotra, Piyush; Zima, Hans

1994-01-01

The stated goal of High Performance Fortran (HPF) was to 'address the problems of writing data parallel programs where the distribution of data affects performance'. After examining the current version of the language we are led to the conclusion that HPF has not fully achieved this goal. While the basic distribution functions offered by the language - regular block, cyclic, and block cyclic distributions - can support regular numerical algorithms, advanced applications such as particle-in-cell codes or unstructured mesh solvers cannot be expressed adequately. We believe that this is a major weakness of HPF, significantly reducing its chances of becoming accepted in the numeric community. The paper discusses the data distribution and alignment issues in detail, points out some flaws in the basic language, and outlines possible future paths of development. Furthermore, we briefly deal with the issue of task parallelism and its integration with the data parallel paradigm of HPF.
Xyce parallel electronic simulator users guide, version 6.1

DOE Office of Scientific and Technical Information (OSTI.GOV)

Keiter, Eric R; Mei, Ting; Russo, Thomas V.

This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas; Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). This includes support for most popular parallel and serial computers; A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows one to developmore » new types of analysis without requiring the implementation of analysis-specific device models; Device models that are specifically tailored to meet Sandia's needs, including some radiationaware devices (for Sandia users only); and Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase-a message passing parallel implementation-which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows.« less
Xyce parallel electronic simulator users' guide, Version 6.0.1.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Keiter, Eric R; Mei, Ting; Russo, Thomas V.

This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows one to developmore » new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandias needs, including some radiationaware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase a message passing parallel implementation which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows.« less
Xyce parallel electronic simulator users guide, version 6.0.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Keiter, Eric R; Mei, Ting; Russo, Thomas V.

This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows one to developmore » new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandias needs, including some radiationaware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase a message passing parallel implementation which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows.« less
Impact of data layouts on the efficiency of GPU-accelerated IDW interpolation.

PubMed

Mei, Gang; Tian, Hong

2016-01-01

This paper focuses on evaluating the impact of different data layouts on the computational efficiency of GPU-accelerated Inverse Distance Weighting (IDW) interpolation algorithm. First we redesign and improve our previous GPU implementation that was performed by exploiting the feature of CUDA dynamic parallelism (CDP). Then we implement three versions of GPU implementations, i.e., the naive version, the tiled version, and the improved CDP version, based upon five data layouts, including the Structure of Arrays (SoA), the Array of Structures (AoS), the Array of aligned Structures (AoaS), the Structure of Arrays of aligned Structures (SoAoS), and the Hybrid layout. We also carry out several groups of experimental tests to evaluate the impact. Experimental results show that: the layouts AoS and AoaS achieve better performance than the layout SoA for both the naive version and tiled version, while the layout SoA is the best choice for the improved CDP version. We also observe that: for the two combined data layouts (the SoAoS and the Hybrid), there are no notable performance gains when compared to other three basic layouts. We recommend that: in practical applications, the layout AoaS is the best choice since the tiled version is the fastest one among three versions. The source code of all implementations are publicly available.
Coherent state quantization of quaternions

DOE Office of Scientific and Technical Information (OSTI.GOV)

Muraleetharan, B., E-mail: bbmuraleetharan@jfn.ac.lk, E-mail: santhar@gmail.com; Thirulogasanthar, K., E-mail: bbmuraleetharan@jfn.ac.lk, E-mail: santhar@gmail.com

Parallel to the quantization of the complex plane, using the canonical coherent states of a right quaternionic Hilbert space, quaternion field of quaternionic quantum mechanics is quantized. Associated upper symbols, lower symbols, and related quantities are analyzed. Quaternionic version of the harmonic oscillator and Weyl-Heisenberg algebra are also obtained.
Modern Science and Conservative Islam: An Uneasy Relationship

ERIC Educational Resources Information Center

Edis, Taner

2009-01-01

Familiar Western debates about religion, science, and science education have parallels in the Islamic world. There are difficulties reconciling conservative, traditional versions of Islam with modern science, particularly theories such as evolution. As a result, many conservative Muslim thinkers are drawn toward creationism, hopes of Islamizing…
Argonne simulation framework for intelligent transportation systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ewing, T.; Doss, E.; Hanebutte, U.

1996-04-01

A simulation framework has been developed which defines a high-level architecture for a large-scale, comprehensive, scalable simulation of an Intelligent Transportation System (ITS). The simulator is designed to run on parallel computers and distributed (networked) computer systems; however, a version for a stand alone workstation is also available. The ITS simulator includes an Expert Driver Model (EDM) of instrumented ``smart`` vehicles with in-vehicle navigation units. The EDM is capable of performing optimal route planning and communicating with Traffic Management Centers (TMC). A dynamic road map data base is sued for optimum route planning, where the data is updated periodically tomore » reflect any changes in road or weather conditions. The TMC has probe vehicle tracking capabilities (display position and attributes of instrumented vehicles), and can provide 2-way interaction with traffic to provide advisories and link times. Both the in-vehicle navigation module and the TMC feature detailed graphical user interfaces that includes human-factors studies to support safety and operational research. Realistic modeling of variations of the posted driving speed are based on human factor studies that take into consideration weather, road conditions, driver`s personality and behavior and vehicle type. The simulator has been developed on a distributed system of networked UNIX computers, but is designed to run on ANL`s IBM SP-X parallel computer system for large scale problems. A novel feature of the developed simulator is that vehicles will be represented by autonomous computer processes, each with a behavior model which performs independent route selection and reacts to external traffic events much like real vehicles. Vehicle processes interact with each other and with ITS components by exchanging messages. With this approach, one will be able to take advantage of emerging massively parallel processor (MPP) systems.« less
The ENU-3 protein family members function in the Wnt pathway parallel to UNC-6/Netrin to promote motor neuron axon outgrowth in C. elegans.

PubMed

Florica, Roxana Oriana; Hipolito, Victoria; Bautista, Stephen; Anvari, Homa; Rapp, Chloe; El-Rass, Suzan; Asgharian, Alimohammad; Antonescu, Costin N; Killeen, Marie T

2017-10-01

The axons of the DA and DB classes of motor neurons fail to reach the dorsal cord in the absence of the guidance cue UNC-6/Netrin or its receptor UNC-5 in C. elegans. However, the axonal processes usually exit their cell bodies in the ventral cord in the absence of both molecules. Strains lacking functional versions of UNC-6 or UNC-5 have a low level of DA and DB motor neuron axon outgrowth defects. We found that mutations in the genes for all six of the ENU-3 proteins function to enhance the outgrowth defects of the DA and DB axons in strains lacking either UNC-6 or UNC-5. A mutation in the gene for the MIG-14/Wntless protein also enhances defects in a strain lacking either UNC-5 or UNC-6, suggesting that the ENU-3 and Wnt pathways function parallel to the Netrin pathway in directing motor neuron axon outgrowth. Our evidence suggests that the ENU-3 proteins are novel members of the Wnt pathway in nematodes. Five of the six members of the ENU-3 family are predicted to be single-pass trans-membrane proteins. The expression pattern of ENU-3.1 was consistent with plasma membrane localization. One family member, ENU-3.6, lacks the predicted signal peptide and the membrane-spanning domain. In HeLa cells ENU-3.6 had a cytoplasmic localization and caused actin dependent processes to appear. We conclude that the ENU-3 family proteins function in a pathway parallel to the UNC-6/Netrin pathway for motor neuron axon outgrowth, most likely in the Wnt pathway. Copyright © 2017 Elsevier Inc. All rights reserved.
ICON-MIC: Implementing a CPU/MIC Collaboration Parallel Framework for ICON on Tianhe-2 Supercomputer.

PubMed

Wang, Zihao; Chen, Yu; Zhang, Jingrong; Li, Lun; Wan, Xiaohua; Liu, Zhiyong; Sun, Fei; Zhang, Fa

2018-03-01

Electron tomography (ET) is an important technique for studying the three-dimensional structures of the biological ultrastructure. Recently, ET has reached sub-nanometer resolution for investigating the native and conformational dynamics of macromolecular complexes by combining with the sub-tomogram averaging approach. Due to the limited sampling angles, ET reconstruction typically suffers from the "missing wedge" problem. Using a validation procedure, iterative compressed-sensing optimized nonuniform fast Fourier transform (NUFFT) reconstruction (ICON) demonstrates its power in restoring validated missing information for a low-signal-to-noise ratio biological ET dataset. However, the huge computational demand has become a bottleneck for the application of ICON. In this work, we implemented a parallel acceleration technology ICON-many integrated core (MIC) on Xeon Phi cards to address the huge computational demand of ICON. During this step, we parallelize the element-wise matrix operations and use the efficient summation of a matrix to reduce the cost of matrix computation. We also developed parallel versions of NUFFT on MIC to achieve a high acceleration of ICON by using more efficient fast Fourier transform (FFT) calculation. We then proposed a hybrid task allocation strategy (two-level load balancing) to improve the overall performance of ICON-MIC by making full use of the idle resources on Tianhe-2 supercomputer. Experimental results using two different datasets show that ICON-MIC has high accuracy in biological specimens under different noise levels and a significant acceleration, up to 13.3 × , compared with the CPU version. Further, ICON-MIC has good scalability efficiency and overall performance on Tianhe-2 supercomputer.
Time variation of effective climate sensitivity in GCMs

NASA Astrophysics Data System (ADS)

Williams, K. D.; Ingram, W. J.; Gregory, J. M.

2009-04-01

Effective climate sensitivity is often assumed to be constant (if uncertain), but some previous studies of General Circulation Model (GCM) simulations have found it varying as the simulation progresses. This complicates the fitting of simple models to such simulations, as well as having implications for the estimation of climate sensitivity from observations. This study examines the evolution of the feedbacks determining the climate sensitivity in GCMs submitted to the Coupled Model Intercomparison Project. Apparent centennial-timescale variations of effective climate sensitivity during stabilisation to a forcing can be considered an artefact of using conventional forcings which only allow for instantaneous effects and stratospheric adjustment. If the forcing is adjusted for processes occurring on timescales which are short compared to the climate stabilisation timescale then there is little centennial timescale evolution of effective climate sensitivity in any of the GCMs. We suggest that much of the apparent variation in effective climate sensitivity identified in previous studies is actually due to the comparatively fast forcing adjustment. Persistent differences are found in the strength of the feedbacks between the coupled atmosphere - ocean (AO) versions and their atmosphere - mixed-layer ocean (AML) counterparts, (the latter are often assumed to give the equilibrium climate sensitivity of the AOGCM). The AML model can typically only estimate the equilibrium climate sensitivity of the parallel AO version to within about 0.5K. The adjustment to the forcing to account for comparatively fast processes varies in magnitude and sign between GCMs, as well as differing between AO and AML versions of the same model. There is evidence from one AOGCM that the forcing adjustment may take a couple of decades, with implications for observationally based estimates of equilibrium climate sensitivity. We suggest that at least some of the spread in 21st century global temperature predictions between GCMs is due to differing adjustment processes, hence work to understand these differences should be a priority.
EMAN2: an extensible image processing suite for electron microscopy.

PubMed

Tang, Guang; Peng, Liwei; Baldwin, Philip R; Mann, Deepinder S; Jiang, Wen; Rees, Ian; Ludtke, Steven J

2007-01-01

EMAN is a scientific image processing package with a particular focus on single particle reconstruction from transmission electron microscopy (TEM) images. It was first released in 1999, and new versions have been released typically 2-3 times each year since that time. EMAN2 has been under development for the last two years, with a completely refactored image processing library, and a wide range of features to make it much more flexible and extensible than EMAN1. The user-level programs are better documented, more straightforward to use, and written in the Python scripting language, so advanced users can modify the programs' behavior without any recompilation. A completely rewritten 3D transformation class simplifies translation between Euler angle standards and symmetry conventions. The core C++ library has over 500 functions for image processing and associated tasks, and it is modular with introspection capabilities, so programmers can add new algorithms with minimal effort and programs can incorporate new capabilities automatically. Finally, a flexible new parallelism system has been designed to address the shortcomings in the rigid system in EMAN1.
Employing machine learning for reliable miRNA target identification in plants

PubMed Central

2011-01-01

Background miRNAs are ~21 nucleotide long small noncoding RNA molecules, formed endogenously in most of the eukaryotes, which mainly control their target genes post transcriptionally by interacting and silencing them. While a lot of tools has been developed for animal miRNA target system, plant miRNA target identification system has witnessed limited development. Most of them have been centered around exact complementarity match. Very few of them considered other factors like multiple target sites and role of flanking regions. Result In the present work, a Support Vector Regression (SVR) approach has been implemented for plant miRNA target identification, utilizing position specific dinucleotide density variation information around the target sites, to yield highly reliable result. It has been named as p-TAREF (plant-Target Refiner). Performance comparison for p-TAREF was done with other prediction tools for plants with utmost rigor and where p-TAREF was found better performing in several aspects. Further, p-TAREF was run over the experimentally validated miRNA targets from species like Arabidopsis, Medicago, Rice and Tomato, and detected them accurately, suggesting gross usability of p-TAREF for plant species. Using p-TAREF, target identification was done for the complete Rice transcriptome, supported by expression and degradome based data. miR156 was found as an important component of the Rice regulatory system, where control of genes associated with growth and transcription looked predominant. The entire methodology has been implemented in a multi-threaded parallel architecture in Java, to enable fast processing for web-server version as well as standalone version. This also makes it to run even on a simple desktop computer in concurrent mode. It also provides a facility to gather experimental support for predictions made, through on the spot expression data analysis, in its web-server version. Conclusion A machine learning multivariate feature tool has been implemented in parallel and locally installable form, for plant miRNA target identification. The performance was assessed and compared through comprehensive testing and benchmarking, suggesting a reliable performance and gross usability for transcriptome wide plant miRNA target identification. PMID:22206472
Employing machine learning for reliable miRNA target identification in plants.

PubMed

Jha, Ashwani; Shankar, Ravi

2011-12-29

miRNAs are ~21 nucleotide long small noncoding RNA molecules, formed endogenously in most of the eukaryotes, which mainly control their target genes post transcriptionally by interacting and silencing them. While a lot of tools has been developed for animal miRNA target system, plant miRNA target identification system has witnessed limited development. Most of them have been centered around exact complementarity match. Very few of them considered other factors like multiple target sites and role of flanking regions. In the present work, a Support Vector Regression (SVR) approach has been implemented for plant miRNA target identification, utilizing position specific dinucleotide density variation information around the target sites, to yield highly reliable result. It has been named as p-TAREF (plant-Target Refiner). Performance comparison for p-TAREF was done with other prediction tools for plants with utmost rigor and where p-TAREF was found better performing in several aspects. Further, p-TAREF was run over the experimentally validated miRNA targets from species like Arabidopsis, Medicago, Rice and Tomato, and detected them accurately, suggesting gross usability of p-TAREF for plant species. Using p-TAREF, target identification was done for the complete Rice transcriptome, supported by expression and degradome based data. miR156 was found as an important component of the Rice regulatory system, where control of genes associated with growth and transcription looked predominant. The entire methodology has been implemented in a multi-threaded parallel architecture in Java, to enable fast processing for web-server version as well as standalone version. This also makes it to run even on a simple desktop computer in concurrent mode. It also provides a facility to gather experimental support for predictions made, through on the spot expression data analysis, in its web-server version. A machine learning multivariate feature tool has been implemented in parallel and locally installable form, for plant miRNA target identification. The performance was assessed and compared through comprehensive testing and benchmarking, suggesting a reliable performance and gross usability for transcriptome wide plant miRNA target identification.
T.Node, industrial version of supernode

NASA Astrophysics Data System (ADS)

Flieller, Sylvain

1989-12-01

The Esprit I P1085 "SuperNode" project developed a modular reconfigurable archtecture, based on transputers. This highly parallel machine is now marketed by Telmat Informatique under the name T.Node. This paper presents the P1085 project, the architecture of SuperNode, its industrial implementation and its software enviroment.

PHAST--a program for simulating ground-water flow, solute transport, and multicomponent geochemical reactions

USGS Publications Warehouse

Parkhurst, David L.; Kipp, Kenneth L.; Engesgaard, Peter; Charlton, Scott R.

2004-01-01

The computer program PHAST simulates multi-component, reactive solute transport in three-dimensional saturated ground-water flow systems. PHAST is a versatile ground-water flow and solute-transport simulator with capabilities to model a wide range of equilibrium and kinetic geochemical reactions. The flow and transport calculations are based on a modified version of HST3D that is restricted to constant fluid density and constant temperature. The geochemical reactions are simulated with the geochemical model PHREEQC, which is embedded in PHAST. PHAST is applicable to the study of natural and contaminated ground-water systems at a variety of scales ranging from laboratory experiments to local and regional field scales. PHAST can be used in studies of migration of nutrients, inorganic and organic contaminants, and radionuclides; in projects such as aquifer storage and recovery or engineered remediation; and in investigations of the natural rock-water interactions in aquifers. PHAST is not appropriate for unsaturated-zone flow, multiphase flow, density-dependent flow, or waters with high ionic strengths. A variety of boundary conditions are available in PHAST to simulate flow and transport, including specified-head, flux, and leaky conditions, as well as the special cases of rivers and wells. Chemical reactions in PHAST include (1) homogeneous equilibria using an ion-association thermodynamic model; (2) heterogeneous equilibria between the aqueous solution and minerals, gases, surface complexation sites, ion exchange sites, and solid solutions; and (3) kinetic reactions with rates that are a function of solution composition. The aqueous model (elements, chemical reactions, and equilibrium constants), minerals, gases, exchangers, surfaces, and rate expressions may be defined or modified by the user. A number of options are available to save results of simulations to output files. The data may be saved in three formats: a format suitable for viewing with a text editor; a format suitable for exporting to spreadsheets and post-processing programs; or in Hierarchical Data Format (HDF), which is a compressed binary format. Data in the HDF file can be visualized on Windows computers with the program Model Viewer and extracted with the utility program PHASTHDF; both programs are distributed with PHAST. Operator splitting of the flow, transport, and geochemical equations is used to separate the three processes into three sequential calculations. No iterations between transport and reaction calculations are implemented. A three-dimensional Cartesian coordinate system and finite-difference techniques are used for the spatial and temporal discretization of the flow and transport equations. The non-linear chemical equilibrium equations are solved by a Newton-Raphson method, and the kinetic reaction equations are solved by a Runge-Kutta or an implicit method for integrating ordinary differential equations. The PHAST simulator may require large amounts of memory and long Central Processing Unit (CPU) times. To reduce the long CPU times, a parallel version of PHAST has been developed that runs on a multiprocessor computer or on a collection of computers that are networked. The parallel version requires Message Passing Interface, which is currently (2004) freely available. The parallel version is effective in reducing simulation times. This report documents the use of the PHAST simulator, including running the simulator, preparing the input files, selecting the output files, and visualizing the results. It also presents four examples that verify the numerical method and demonstrate the capabilities of the simulator. PHAST requires three input files. Only the flow and transport file is described in detail in this report. The other two files, the chemistry data file and the database file, are identical to PHREEQC files and the detailed description of these files is found in the PHREEQC documentation.
Version pressure feedback mechanisms for speculative versioning caches

DOEpatents

Eichenberger, Alexandre E.; Gara, Alan; O& #x27; Brien, Kathryn M.; Ohmacht, Martin; Zhuang, Xiaotong

2013-03-12

Mechanisms are provided for controlling version pressure on a speculative versioning cache. Raw version pressure data is collected based on one or more threads accessing cache lines of the speculative versioning cache. One or more statistical measures of version pressure are generated based on the collected raw version pressure data. A determination is made as to whether one or more modifications to an operation of a data processing system are to be performed based on the one or more statistical measures of version pressure, the one or more modifications affecting version pressure exerted on the speculative versioning cache. An operation of the data processing system is modified based on the one or more determined modifications, in response to a determination that one or more modifications to the operation of the data processing system are to be performed, to affect the version pressure exerted on the speculative versioning cache.
Some thoughts about parallel process and psychotherapy supervision: when is a parallel just a parallel?

PubMed

Watkins, C Edward

2012-09-01

In a way not done before, Tracey, Bludworth, and Glidden-Tracey ("Are there parallel processes in psychotherapy supervision: An empirical examination," Psychotherapy, 2011, advance online publication, doi.10.1037/a0026246) have shown us that parallel process in psychotherapy supervision can indeed be rigorously and meaningfully researched, and their groundbreaking investigation provides a nice prototype for future supervision studies to emulate. In what follows, I offer a brief complementary comment to Tracey et al., addressing one matter that seems to be a potentially important conceptual and empirical parallel process consideration: When is a parallel just a parallel? PsycINFO Database Record (c) 2012 APA, all rights reserved.
Description of the NCAR Community Climate Model (CCM3). Technical note

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kiehl, J.T.; Hack, J.J.; Bonan, G.B.

This repor presents the details of the governing equations, physical parameterizations, and numerical algorithms defining the version of the NCAR Community Climate Model designated CCM3. The material provides an overview of the major model components, and the way in which they interact as the numerical integration proceeds. This version of the CCM incorporates significant improvements to the physic package, new capabilities such as the incorporation of a slab ocean component, and a number of enhancements to the implementation (e.g., the ability to integrate the model on parallel distributed-memory computational platforms).
Seeing the forest for the trees: Networked workstations as a parallel processing computer

NASA Technical Reports Server (NTRS)

Breen, J. O.; Meleedy, D. M.

1992-01-01

Unlike traditional 'serial' processing computers in which one central processing unit performs one instruction at a time, parallel processing computers contain several processing units, thereby, performing several instructions at once. Many of today's fastest supercomputers achieve their speed by employing thousands of processing elements working in parallel. Few institutions can afford these state-of-the-art parallel processors, but many already have the makings of a modest parallel processing system. Workstations on existing high-speed networks can be harnessed as nodes in a parallel processing environment, bringing the benefits of parallel processing to many. While such a system can not rival the industry's latest machines, many common tasks can be accelerated greatly by spreading the processing burden and exploiting idle network resources. We study several aspects of this approach, from algorithms to select nodes to speed gains in specific tasks. With ever-increasing volumes of astronomical data, it becomes all the more necessary to utilize our computing resources fully.
Parallel Processing at the High School Level.

ERIC Educational Resources Information Center

Sheary, Kathryn Anne

This study investigated the ability of high school students to cognitively understand and implement parallel processing. Data indicates that most parallel processing is being taught at the university level. Instructional modules on C, Linux, and the parallel processing language, P4, were designed to show that high school students are highly…
Run-time parallelization and scheduling of loops

NASA Technical Reports Server (NTRS)

Saltz, Joel H.; Mirchandaney, Ravi; Crowley, Kay

1991-01-01

Run-time methods are studied to automatically parallelize and schedule iterations of a do loop in certain cases where compile-time information is inadequate. The methods presented involve execution time preprocessing of the loop. At compile-time, these methods set up the framework for performing a loop dependency analysis. At run-time, wavefronts of concurrently executable loop iterations are identified. Using this wavefront information, loop iterations are reordered for increased parallelism. Symbolic transformation rules are used to produce: inspector procedures that perform execution time preprocessing, and executors or transformed versions of source code loop structures. These transformed loop structures carry out the calculations planned in the inspector procedures. Performance results are presented from experiments conducted on the Encore Multimax. These results illustrate that run-time reordering of loop indexes can have a significant impact on performance.
Branson: A Mini-App for Studying Parallel IMC, Version 1.0

DOE Office of Scientific and Technical Information (OSTI.GOV)

Long, Alex

This code solves the gray thermal radiative transfer (TRT) equations in parallel using simple opacities and Cartesian meshes. Although Branson solves the TRT equations it is not designed to model radiation transport: Branson contains simple physics and does not have a multigroup treatment, nor can it use physical material data. The opacities have are simple polynomials in temperature there is a limited ability to specify complex geometries and sources. Branson was designed only to capture the computational demands of production IMC codes, especially in large parallel runs. It was also intended to foster collaboration with vendors, universities and other DOEmore » partners. Branson is similar in character to the neutron transport proxy-app Quicksilver from LLNL, which was recently open-sourced.« less
Soft-output decoding algorithms in iterative decoding of turbo codes

NASA Technical Reports Server (NTRS)

Benedetto, S.; Montorsi, G.; Divsalar, D.; Pollara, F.

1996-01-01

In this article, we present two versions of a simplified maximum a posteriori decoding algorithm. The algorithms work in a sliding window form, like the Viterbi algorithm, and can thus be used to decode continuously transmitted sequences obtained by parallel concatenated codes, without requiring code trellis termination. A heuristic explanation is also given of how to embed the maximum a posteriori algorithms into the iterative decoding of parallel concatenated codes (turbo codes). The performances of the two algorithms are compared on the basis of a powerful rate 1/3 parallel concatenated code. Basic circuits to implement the simplified a posteriori decoding algorithm using lookup tables, and two further approximations (linear and threshold), with a very small penalty, to eliminate the need for lookup tables are proposed.
Simulated parallel annealing within a neighborhood for optimization of biomechanical systems.

PubMed

Higginson, J S; Neptune, R R; Anderson, F C

2005-09-01

Optimization problems for biomechanical systems have become extremely complex. Simulated annealing (SA) algorithms have performed well in a variety of test problems and biomechanical applications; however, despite advances in computer speed, convergence to optimal solutions for systems of even moderate complexity has remained prohibitive. The objective of this study was to develop a portable parallel version of a SA algorithm for solving optimization problems in biomechanics. The algorithm for simulated parallel annealing within a neighborhood (SPAN) was designed to minimize interprocessor communication time and closely retain the heuristics of the serial SA algorithm. The computational speed of the SPAN algorithm scaled linearly with the number of processors on different computer platforms for a simple quadratic test problem and for a more complex forward dynamic simulation of human pedaling.
Analysis and optimization of gyrokinetic toroidal simulations on homogenous and heterogenous platforms

DOE PAGES

Ibrahim, Khaled Z.; Madduri, Kamesh; Williams, Samuel; ...

2013-07-18

The Gyrokinetic Toroidal Code (GTC) uses the particle-in-cell method to efficiently simulate plasma microturbulence. This paper presents novel analysis and optimization techniques to enhance the performance of GTC on large-scale machines. We introduce cell access analysis to better manage locality vs. synchronization tradeoffs on CPU and GPU-based architectures. Finally, our optimized hybrid parallel implementation of GTC uses MPI, OpenMP, and NVIDIA CUDA, achieves up to a 2× speedup over the reference Fortran version on multiple parallel systems, and scales efficiently to tens of thousands of cores.
A Parallel Genetic Algorithm to Discover Patterns in Genetic Markers that Indicate Predisposition to Multifactorial Disease

PubMed Central

Rausch, Tobias; Thomas, Alun; Camp, Nicola J.; Cannon-Albright, Lisa A.; Facelli, Julio C.

2008-01-01

This paper describes a novel algorithm to analyze genetic linkage data using pattern recognition techniques and genetic algorithms (GA). The method allows a search for regions of the chromosome that may contain genetic variations that jointly predispose individuals for a particular disease. The method uses correlation analysis, filtering theory and genetic algorithms (GA) to achieve this goal. Because current genome scans use from hundreds to hundreds of thousands of markers, two versions of the method have been implemented. The first is an exhaustive analysis version that can be used to visualize, explore, and analyze small genetic data sets for two marker correlations; the second is a GA version, which uses a parallel implementation allowing searches of higher-order correlations in large data sets. Results on simulated data sets indicate that the method can be informative in the identification of major disease loci and gene-gene interactions in genome-wide linkage data and that further exploration of these techniques is justified. The results presented for both variants of the method show that it can help genetic epidemiologists to identify promising combinations of genetic factors that might predispose to complex disorders. In particular, the correlation analysis of IBD expression patterns might hint to possible gene-gene interactions and the filtering might be a fruitful approach to distinguish true correlation signals from noise. PMID:18547558
Development of the PedsQL™ Epilepsy Module: Focus group and cognitive interviews.

PubMed

Follansbee-Junger, Katherine W; Mann, Krista A; Guilfoyle, Shanna M; Morita, Diego A; Varni, James W; Modi, Avani C

2016-09-01

Youth with epilepsy have impaired health-related quality of life (HRQOL). Existing epilepsy-specific HRQOL measures are limited by not having parallel self- and parent-proxy versions, having a restricted age range, not being inclusive of children with developmental disabilities, or being too lengthy for use in a clinical setting. Generic HRQOL measures do not adequately capture the idiosyncrasies of epilepsy. The purpose of the present study was to develop items and content validity for the PedsQL™ Epilepsy Module. An iterative qualitative process of conducting focus group interviews with families of children with epilepsy, obtaining expert input, and conducting cognitive interviews and debriefing was utilized to develop empirically derived content for the instrument. Eleven health providers with expertise in pediatric epilepsy from across the country provided feedback on the conceptual model and content, including epileptologists, nurse practitioners, social workers, and psychologists. Ten pediatric patients (age 4-16years) with a diagnosis of epilepsy and 11 parents participated in focus groups. Thirteen pediatric patients (age 5-17years) and 17 parents participated in cognitive interviews. Focus groups, expert input, and cognitive debriefing resulted in 6 final domains including restrictions, seizure management, cognitive/executive functioning, social, sleep/fatigue, and mood/behavior. Patient self-report versions ranged from 30 to 33 items and parent proxy-report versions ranged from 26 to 33 items, with the toddler and young child versions having fewer items. Standardized qualitative methodology was employed to develop the items and content for the novel PedsQL™ Epilepsy Module. The PedsQL™ Epilepsy Module has the potential to enhance clinical decision-making in pediatric epilepsy by capturing and monitoring important patient-identified contributors to HRQOL. Copyright © 2016 Elsevier Inc. All rights reserved.
Scalable Triadic Analysis of Large-Scale Graphs: Multi-Core vs. Multi-Processor vs. Multi-Threaded Shared Memory Architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Chin, George; Marquez, Andres; Choudhury, Sutanay

2012-09-01

Triadic analysis encompasses a useful set of graph mining methods that is centered on the concept of a triad, which is a subgraph of three nodes and the configuration of directed edges across the nodes. Such methods are often applied in the social sciences as well as many other diverse fields. Triadic methods commonly operate on a triad census that counts the number of triads of every possible edge configuration in a graph. Like other graph algorithms, triadic census algorithms do not scale well when graphs reach tens of millions to billions of nodes. To enable the triadic analysis ofmore » large-scale graphs, we developed and optimized a triad census algorithm to efficiently execute on shared memory architectures. We will retrace the development and evolution of a parallel triad census algorithm. Over the course of several versions, we continually adapted the code’s data structures and program logic to expose more opportunities to exploit parallelism on shared memory that would translate into improved computational performance. We will recall the critical steps and modifications that occurred during code development and optimization. Furthermore, we will compare the performances of triad census algorithm versions on three specific systems: Cray XMT, HP Superdome, and AMD multi-core NUMA machine. These three systems have shared memory architectures but with markedly different hardware capabilities to manage parallelism.« less
Efficient Iterative Methods Applied to the Solution of Transonic Flows

NASA Astrophysics Data System (ADS)

Wissink, Andrew M.; Lyrintzis, Anastasios S.; Chronopoulos, Anthony T.

1996-02-01

We investigate the use of an inexact Newton's method to solve the potential equations in the transonic regime. As a test case, we solve the two-dimensional steady transonic small disturbance equation. Approximate factorization/ADI techniques have traditionally been employed for implicit solutions of this nonlinear equation. Instead, we apply Newton's method using an exact analytical determination of the Jacobian with preconditioned conjugate gradient-like iterative solvers for solution of the linear systems in each Newton iteration. Two iterative solvers are tested; a block s-step version of the classical Orthomin(k) algorithm called orthogonal s-step Orthomin (OSOmin) and the well-known GMRES method. The preconditioner is a vectorizable and parallelizable version of incomplete LU (ILU) factorization. Efficiency of the Newton-Iterative method on vector and parallel computer architectures is the main issue addressed. In vectorized tests on a single processor of the Cray C-90, the performance of Newton-OSOmin is superior to Newton-GMRES and a more traditional monotone AF/ADI method (MAF) for a variety of transonic Mach numbers and mesh sizes. Newton-GMRES is superior to MAF for some cases. The parallel performance of the Newton method is also found to be very good on multiple processors of the Cray C-90 and on the massively parallel thinking machine CM-5, where very fast execution rates (up to 9 Gflops) are found for large problems.
Open-Source Software for Modeling of Nanoelectronic Devices

NASA Technical Reports Server (NTRS)

Oyafuso, Fabiano; Hua, Hook; Tisdale, Edwin; Hart, Don

2004-01-01

The Nanoelectronic Modeling 3-D (NEMO 3-D) computer program has been upgraded to open-source status through elimination of license-restricted components. The present version functions equivalently to the version reported in "Software for Numerical Modeling of Nanoelectronic Devices" (NPO-30520), NASA Tech Briefs, Vol. 27, No. 11 (November 2003), page 37. To recapitulate: NEMO 3-D performs numerical modeling of the electronic transport and structural properties of a semiconductor device that has overall dimensions of the order of tens of nanometers. The underlying mathematical model represents the quantum-mechanical behavior of the device resolved to the atomistic level of granularity. NEMO 3-D solves the applicable quantum matrix equation on a Beowulf-class cluster computer by use of a parallel-processing matrix vector multiplication algorithm coupled to a Lanczos and/or Rayleigh-Ritz algorithm that solves for eigenvalues. A prior upgrade of NEMO 3-D incorporated a capability for a strain treatment, parameterized for bulk material properties of GaAs and InAs, for two tight-binding submodels. NEMO 3-D has been demonstrated in atomistic analyses of effects of disorder in alloys and, in particular, in bulk In(x)Ga(1-x)As and in In(0.6)Ga(0.4)As quantum dots.
Parallel Directionally Split Solver Based on Reformulation of Pipelined Thomas Algorithm

NASA Technical Reports Server (NTRS)

Povitsky, A.

1998-01-01

In this research an efficient parallel algorithm for 3-D directionally split problems is developed. The proposed algorithm is based on a reformulated version of the pipelined Thomas algorithm that starts the backward step computations immediately after the completion of the forward step computations for the first portion of lines This algorithm has data available for other computational tasks while processors are idle from the Thomas algorithm. The proposed 3-D directionally split solver is based on the static scheduling of processors where local and non-local, data-dependent and data-independent computations are scheduled while processors are idle. A theoretical model of parallelization efficiency is used to define optimal parameters of the algorithm, to show an asymptotic parallelization penalty and to obtain an optimal cover of a global domain with subdomains. It is shown by computational experiments and by the theoretical model that the proposed algorithm reduces the parallelization penalty about two times over the basic algorithm for the range of the number of processors (subdomains) considered and the number of grid nodes per subdomain.
A Hybrid Shared-Memory Parallel Max-Tree Algorithm for Extreme Dynamic-Range Images.

PubMed

Moschini, Ugo; Meijster, Arnold; Wilkinson, Michael H F

2018-03-01

Max-trees, or component trees, are graph structures that represent the connected components of an image in a hierarchical way. Nowadays, many application fields rely on images with high-dynamic range or floating point values. Efficient sequential algorithms exist to build trees and compute attributes for images of any bit depth. However, we show that the current parallel algorithms perform poorly already with integers at bit depths higher than 16 bits per pixel. We propose a parallel method combining the two worlds of flooding and merging max-tree algorithms. First, a pilot max-tree of a quantized version of the image is built in parallel using a flooding method. Later, this structure is used in a parallel leaf-to-root approach to compute efficiently the final max-tree and to drive the merging of the sub-trees computed by the threads. We present an analysis of the performance both on simulated and actual 2D images and 3D volumes. Execution times are about better than the fastest sequential algorithm and speed-up goes up to on 64 threads.
A heterogeneous computing accelerated SCE-UA global optimization method using OpenMP, OpenCL, CUDA, and OpenACC.

PubMed

Kan, Guangyuan; He, Xiaoyan; Ding, Liuqian; Li, Jiren; Liang, Ke; Hong, Yang

2017-10-01

The shuffled complex evolution optimization developed at the University of Arizona (SCE-UA) has been successfully applied in various kinds of scientific and engineering optimization applications, such as hydrological model parameter calibration, for many years. The algorithm possesses good global optimality, convergence stability and robustness. However, benchmark and real-world applications reveal the poor computational efficiency of the SCE-UA. This research aims at the parallelization and acceleration of the SCE-UA method based on powerful heterogeneous computing technology. The parallel SCE-UA is implemented on Intel Xeon multi-core CPU (by using OpenMP and OpenCL) and NVIDIA Tesla many-core GPU (by using OpenCL, CUDA, and OpenACC). The serial and parallel SCE-UA were tested based on the Griewank benchmark function. Comparison results indicate the parallel SCE-UA significantly improves computational efficiency compared to the original serial version. The OpenCL implementation obtains the best overall acceleration results however, with the most complex source code. The parallel SCE-UA has bright prospects to be applied in real-world applications.
Dust Dynamics in Protoplanetary Disks: Parallel Computing with PVM

NASA Astrophysics Data System (ADS)

de La Fuente Marcos, Carlos; Barge, Pierre; de La Fuente Marcos, Raúl

2002-03-01

We describe a parallel version of our high-order-accuracy particle-mesh code for the simulation of collisionless protoplanetary disks. We use this code to carry out a massively parallel, two-dimensional, time-dependent, numerical simulation, which includes dust particles, to study the potential role of large-scale, gaseous vortices in protoplanetary disks. This noncollisional problem is easy to parallelize on message-passing multicomputer architectures. We performed the simulations on a cache-coherent nonuniform memory access Origin 2000 machine, using both the parallel virtual machine (PVM) and message-passing interface (MPI) message-passing libraries. Our performance analysis suggests that, for our problem, PVM is about 25% faster than MPI. Using PVM and MPI made it possible to reduce CPU time and increase code performance. This allows for simulations with a large number of particles (N ~ 105-106) in reasonable CPU times. The performances of our implementation of the pa! rallel code on an Origin 2000 supercomputer are presented and discussed. They exhibit very good speedup behavior and low load unbalancing. Our results confirm that giant gaseous vortices can play a dominant role in giant planet formation.

ModelMate - A graphical user interface for model analysis

USGS Publications Warehouse

Banta, Edward R.

2011-01-01

ModelMate is a graphical user interface designed to facilitate use of model-analysis programs with models. This initial version of ModelMate supports one model-analysis program, UCODE_2005, and one model software program, MODFLOW-2005. ModelMate can be used to prepare input files for UCODE_2005, run UCODE_2005, and display analysis results. A link to the GW_Chart graphing program facilitates visual interpretation of results. ModelMate includes capabilities for organizing directories used with the parallel-processing capabilities of UCODE_2005 and for maintaining files in those directories to be identical to a set of files in a master directory. ModelMate can be used on its own or in conjunction with ModelMuse, a graphical user interface for MODFLOW-2005 and PHAST.
Bringing Proximate Neighbours into the Study of US Residential Segregation

PubMed Central

Friedman, Samantha

2011-01-01

The race and ethnicity of neighbours are thought to be critical in shaping household mobility underlying residential segregation. However, studies on this topic have used data at the census-tract level of analysis rather than at the proximate-neighbour level. Using a non-publicly available version of the neighbour-cluster sample within the American Housing Survey, this study incorporates data on the race, ethnicity and socioeconomic characteristics of the proximate neighbours of White, Black and Latino households and examines their impact on household residential satisfaction, out- and in-mobility. Results indicate that proximate-neighbour race and ethnicity matter in influencing endpoints of the mobility process and do not necessarily parallel those at the census-tract level. Implications of these findings are discussed as they relate to the study of residential segregation. PMID:21544258
Advancing MODFLOW Applying the Derived Vector Space Method

NASA Astrophysics Data System (ADS)

Herrera, G. S.; Herrera, I.; Lemus-García, M.; Hernandez-Garcia, G. D.

2015-12-01

The most effective domain decomposition methods (DDM) are non-overlapping DDMs. Recently a new approach, the DVS-framework, based on an innovative discretization method that uses a non-overlapping system of nodes (the derived-nodes), was introduced and developed by I. Herrera et al. [1, 2]. Using the DVS-approach a group of four algorithms, referred to as the 'DVS-algorithms', which fulfill the DDM-paradigm (i.e. the solution of global problems is obtained by resolution of local problems exclusively) has been derived. Such procedures are applicable to any boundary-value problem, or system of such equations, for which a standard discretization method is available and then software with a high degree of parallelization can be constructed. In a parallel talk, in this AGU Fall Meeting, Ismael Herrera will introduce the general DVS methodology. The application of the DVS-algorithms has been demonstrated in the solution of several boundary values problems of interest in Geophysics. Numerical examples for a single-equation, for the cases of symmetric, non-symmetric and indefinite problems were demonstrated before [1,2]. For these problems DVS-algorithms exhibited significantly improved numerical performance with respect to standard versions of DDM algorithms. In view of these results our research group is in the process of applying the DVS method to a widely used simulator for the first time, here we present the advances of the application of this method for the parallelization of MODFLOW. Efficiency results for a group of tests will be presented. References [1] I. Herrera, L.M. de la Cruz and A. Rosas-Medina. Non overlapping discretization methods for partial differential equations, Numer Meth Part D E, (2013). [2] Herrera, I., & Contreras Iván "An Innovative Tool for Effectively Applying Highly Parallelized Software To Problems of Elasticity". Geofísica Internacional, 2015 (In press)
Xyce Parallel Electronic Simulator Users' Guide Version 6.8

DOE Office of Scientific and Technical Information (OSTI.GOV)

Keiter, Eric R.; Aadithya, Karthik Venkatraman; Mei, Ting

This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been de- signed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel com- puting platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows onemore » to develop new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandia's needs, including some radiation- aware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase$-$ a message passing parallel implementation $-$ which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows.« less
Solution of the Skyrme-Hartree-Fock-Bogolyubov equations in the Cartesian deformed harmonic-oscillator basis.. (VII) HFODD (v2.49t): A new version of the program

NASA Astrophysics Data System (ADS)

Schunck, N.; Dobaczewski, J.; McDonnell, J.; Satuła, W.; Sheikh, J. A.; Staszczak, A.; Stoitsov, M.; Toivanen, P.

2012-01-01

We describe the new version (v2.49t) of the code HFODD which solves the nuclear Skyrme-Hartree-Fock (HF) or Skyrme-Hartree-Fock-Bogolyubov (HFB) problem by using the Cartesian deformed harmonic-oscillator basis. In the new version, we have implemented the following physics features: (i) the isospin mixing and projection, (ii) the finite-temperature formalism for the HFB and HF + BCS methods, (iii) the Lipkin translational energy correction method, (iv) the calculation of the shell correction. A number of specific numerical methods have also been implemented in order to deal with large-scale multi-constraint calculations and hardware limitations: (i) the two-basis method for the HFB method, (ii) the Augmented Lagrangian Method (ALM) for multi-constraint calculations, (iii) the linear constraint method based on the approximation of the RPA matrix for multi-constraint calculations, (iv) an interface with the axial and parity-conserving Skyrme-HFB code HFBTHO, (v) the mixing of the HF or HFB matrix elements instead of the HF fields. Special care has been paid to using the code on massively parallel leadership class computers. For this purpose, the following features are now available with this version: (i) the Message Passing Interface (MPI) framework, (ii) scalable input data routines, (iii) multi-threading via OpenMP pragmas, (iv) parallel diagonalization of the HFB matrix in the simplex-breaking case using the ScaLAPACK library. Finally, several little significant errors of the previous published version were corrected. New version program summaryProgram title:HFODD (v2.49t) Catalogue identifier: ADFL_v3_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADFL_v3_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GNU General Public Licence v3 No. of lines in distributed program, including test data, etc.: 190 614 No. of bytes in distributed program, including test data, etc.: 985 898 Distribution format: tar.gz Programming language: FORTRAN-90 Computer: Intel Pentium-III, Intel Xeon, AMD-Athlon, AMD-Opteron, Cray XT4, Cray XT5 Operating system: UNIX, LINUX, Windows XP Has the code been vectorized or parallelized?: Yes, parallelized using MPI RAM: 10 Mwords Word size: The code is written in single-precision for the use on a 64-bit processor. The compiler option -r8 or +autodblpad (or equivalent) has to be used to promote all real and complex single-precision floating-point items to double precision when the code is used on a 32-bit machine. Classification: 17.22 Catalogue identifier of previous version: ADFL_v2_2 Journal reference of previous version: Comput. Phys. Comm. 180 (2009) 2361 External routines: The user must have access to the NAGLIB subroutine f02axe, or LAPACK subroutines zhpev, zhpevx, zheevr, or zheevd, which diagonalize complex hermitian matrices, the LAPACK subroutines dgetri and dgetrf which invert arbitrary real matrices, the LAPACK subroutines dsyevd, dsytrf and dsytri which compute eigenvalues and eigenfunctions of real symmetric matrices, the LINPACK subroutines zgedi and zgeco, which invert arbitrary complex matrices and calculate determinants, the BLAS routines dcopy, dscal, dgeem and dgemv for double-precision linear algebra and zcopy, zdscal, zgeem and zgemv for complex linear algebra, or provide another set of subroutines that can perform such tasks. The BLAS and LAPACK subroutines can be obtained from the Netlib Repository at the University of Tennessee, Knoxville: http://netlib2.cs.utk.edu/. Does the new version supersede the previous version?: Yes Nature of problem: The nuclear mean field and an analysis of its symmetries in realistic cases are the main ingredients of a description of nuclear states. Within the Local Density Approximation, or for a zero-range velocity-dependent Skyrme interaction, the nuclear mean field is local and velocity dependent. The locality allows for an effective and fast solution of the self-consistent Hartree-Fock equations, even for heavy nuclei, and for various nucleonic ( n-particle- n-hole) configurations, deformations, excitation energies, or angular momenta. Similarly, Local Density Approximation in the particle-particle channel, which is equivalent to using a zero-range interaction, allows for a simple implementation of pairing effects within the Hartree-Fock-Bogolyubov method. Solution method: The program uses the Cartesian harmonic oscillator basis to expand single-particle or single-quasiparticle wave functions of neutrons and protons interacting by means of the Skyrme effective interaction and zero-range pairing interaction. The expansion coefficients are determined by the iterative diagonalization of the mean-field Hamiltonians or Routhians which depend non-linearly on the local neutron and proton densities. Suitable constraints are used to obtain states corresponding to a given configuration, deformation or angular momentum. The method of solution has been presented in: [J. Dobaczewski, J. Dudek, Comput. Phys. Commun. 102 (1997) 166]. Reasons for new version: Version 2.49s of HFODD provides a number of new options such as the isospin mixing and projection of the Skyrme functional, the finite-temperature HF and HFB formalism and optimized methods to perform multi-constrained calculations. It is also the first version of HFODD to contain threading and parallel capabilities. Summary of revisions: Isospin mixing and projection of the HF states has been implemented. The finite-temperature formalism for the HFB equations has been implemented. The Lipkin translational energy correction method has been implemented. Calculation of the shell correction has been implemented. The two-basis method for the solution to the HFB equations has been implemented. The Augmented Lagrangian Method (ALM) for calculations with multiple constraints has been implemented. The linear constraint method based on the cranking approximation of the RPA matrix has been implemented. An interface between HFODD and the axially-symmetric and parity-conserving code HFBTHO has been implemented. The mixing of the matrix elements of the HF or HFB matrix has been implemented. A parallel interface using the MPI library has been implemented. A scalable model for reading input data has been implemented. OpenMP pragmas have been implemented in three subroutines. The diagonalization of the HFB matrix in the simplex-breaking case has been parallelized using the ScaLAPACK library. Several little significant errors of the previous published version were corrected. Running time: In serial mode, running 6 HFB iterations for 152Dy for conserved parity and signature symmetries in a full spherical basis of N=14 shells takes approximately 8 min on an AMD Opteron processor at 2.6 GHz, assuming standard BLAS and LAPACK libraries. As a rule of thumb, runtime for HFB calculations for parity and signature conserved symmetries roughly increases as N, where N is the number of full HO shells. Using custom-built optimized BLAS and LAPACK libraries (such as in the ATLAS implementation) can bring down the execution time by 60%. Using the threaded version of the code with 12 threads and threaded BLAS libraries can bring an additional factor 2 speed-up, so that the same 6 HFB iterations now take of the order of 2 min 30 s.
BOOK REVIEW: Numerical Recipes in C++: The Art of Scientific Computing (2nd edn)1 Numerical Recipes Example Book (C++) (2nd edn)2 Numerical Recipes Multi-Language Code CD ROM with LINUX or UNIX Single-Screen License Revised Version3Numerical Recipes in C++: The Art of Scientific Computing (2nd edn) Numerical Recipes Example Book (C++) (2nd edn) Numerical Recipes Multi-Language Code CD ROM with LINUX or UNIX Single-Screen License Revised Version

NASA Astrophysics Data System (ADS)

Press, William H.; Teukolsky, Saul A.; Vettering, William T.; Flannery, Brian P.

2003-05-01

The two Numerical Recipes books are marvellous. The principal book, The Art of Scientific Computing, contains program listings for almost every conceivable requirement, and it also contains a well written discussion of the algorithms and the numerical methods involved. The Example Book provides a complete driving program, with helpful notes, for nearly all the routines in the principal book. The first edition of Numerical Recipes: The Art of Scientific Computing was published in 1986 in two versions, one with programs in Fortran, the other with programs in Pascal. There were subsequent versions with programs in BASIC and in C. The second, enlarged edition was published in 1992, again in two versions, one with programs in Fortran (NR(F)), the other with programs in C (NR(C)). In 1996 the authors produced Numerical Recipes in Fortran 90: The Art of Parallel Scientific Computing as a supplement, called Volume 2, with the original (Fortran) version referred to as Volume 1. Numerical Recipes in C++ (NR(C++)) is another version of the 1992 edition. The numerical recipes are also available on a CD ROM: if you want to use any of the recipes, I would strongly advise you to buy the CD ROM. The CD ROM contains the programs in all the languages. When the first edition was published I bought it, and have also bought copies of the other editions as they have appeared. Anyone involved in scientific computing ought to have a copy of at least one version of Numerical Recipes, and there also ought to be copies in every library. If you already have NR(F), should you buy the NR(C++) and, if not, which version should you buy? In the preface to Volume 2 of NR(F), the authors say 'C and C++ programmers have not been far from our minds as we have written this volume, and we think that you will find that time spent in absorbing its principal lessons will be amply repaid in the future as C and C++ eventually develop standard parallel extensions'. In the preface and introduction to NR(C++), the authors point out some of the problems in the use of C++ in scientific computing. I have not found any mention of parallel computing in NR(C++). Fortran has quite a lot going for it. As someone who has used it in most of its versions from Fortran II, I have seen it develop and leave behind other languages promoted by various enthusiasts: who now uses Algol or Pascal? I think it unlikely that C++ will disappear: it was devised as a systems language, and can also be used for other purposes such as scientific computing. It is possible that Fortran will disappear, but Fortran has the strengths that it can develop, that there are extensive Fortran subroutine libraries, and that it has been developed for parallel computing. To argue with programmers as to which is the best language to use is sterile. If you wish to use C++, then buy NR(C++), but you should also look at volume 2 of NR(F). If you are a Fortran programmer, then make sure you have NR(F), volumes 1 and 2. But whichever language you use, make sure you have one version or the other, and the CD ROM. The Example Book provides listings of complete programs to run nearly all the routines in NR, frequently based on cases where an anlytical solution is available. It is helpful when developing a new program incorporating an unfamiliar routine to see that routine actually working, and this is what the programs in the Example Book achieve. I started teaching computational physics before Numerical Recipes was published. If I were starting again, I would make heavy use of both The Art of Scientific Computing and of the Example Book. Every computational physics teaching laboratory should have both volumes: the programs in the Example Book are included on the CD ROM, but the extra commentary in the book itself is of considerable value. P Borcherds
THE PLUTO CODE FOR ADAPTIVE MESH COMPUTATIONS IN ASTROPHYSICAL FLUID DYNAMICS

DOE Office of Scientific and Technical Information (OSTI.GOV)

Mignone, A.; Tzeferacos, P.; Zanni, C.

We present a description of the adaptive mesh refinement (AMR) implementation of the PLUTO code for solving the equations of classical and special relativistic magnetohydrodynamics (MHD and RMHD). The current release exploits, in addition to the static grid version of the code, the distributed infrastructure of the CHOMBO library for multidimensional parallel computations over block-structured, adaptively refined grids. We employ a conservative finite-volume approach where primary flow quantities are discretized at the cell center in a dimensionally unsplit fashion using the Corner Transport Upwind method. Time stepping relies on a characteristic tracing step where piecewise parabolic method, weighted essentially non-oscillatory,more » or slope-limited linear interpolation schemes can be handily adopted. A characteristic decomposition-free version of the scheme is also illustrated. The solenoidal condition of the magnetic field is enforced by augmenting the equations with a generalized Lagrange multiplier providing propagation and damping of divergence errors through a mixed hyperbolic/parabolic explicit cleaning step. Among the novel features, we describe an extension of the scheme to include non-ideal dissipative processes, such as viscosity, resistivity, and anisotropic thermal conduction without operator splitting. Finally, we illustrate an efficient treatment of point-local, potentially stiff source terms over hierarchical nested grids by taking advantage of the adaptivity in time. Several multidimensional benchmarks and applications to problems of astrophysical relevance assess the potentiality of the AMR version of PLUTO in resolving flow features separated by large spatial and temporal disparities.« less
SMMP v. 3.0—Simulating proteins and protein interactions in Python and Fortran

NASA Astrophysics Data System (ADS)

Meinke, Jan H.; Mohanty, Sandipan; Eisenmenger, Frank; Hansmann, Ulrich H. E.

2008-03-01

We describe a revised and updated version of the program package SMMP. SMMP is an open-source FORTRAN package for molecular simulation of proteins within the standard geometry model. It is designed as a simple and inexpensive tool for researchers and students to become familiar with protein simulation techniques. SMMP 3.0 sports a revised API increasing its flexibility, an implementation of the Lund force field, multi-molecule simulations, a parallel implementation of the energy function, Python bindings, and more. Program summaryTitle of program:SMMP Catalogue identifier:ADOJ_v3_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADOJ_v3_0.html Program obtainable from: CPC Program Library, Queen's University of Belfast, N. Ireland Licensing provisions:Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html Programming language used:FORTRAN, Python No. of lines in distributed program, including test data, etc.:52 105 No. of bytes in distributed program, including test data, etc.:599 150 Distribution format:tar.gz Computer:Platform independent Operating system:OS independent RAM:2 Mbytes Classification:3 Does the new version supersede the previous version?:Yes Nature of problem:Molecular mechanics computations and Monte Carlo simulation of proteins. Solution method:Utilizes ECEPP2/3, FLEX, and Lund potentials. Includes Monte Carlo simulation algorithms for canonical, as well as for generalized ensembles. Reasons for new version:API changes and increased functionality. Summary of revisions:Added Lund potential; parameters used in subroutines are now passed as arguments; multi-molecule simulations; parallelized energy calculation for ECEPP; Python bindings. Restrictions:The consumed CPU time increases with the size of protein molecule. Running time:Depends on the size of the simulated molecule.
Psychometric properties of the Haitian Creole version of the Resilience Scale with a sample of adult survivors of the 2010 earthquake.

PubMed

Cénat, Jude Mary; Derivois, Daniel; Hébert, Martine; Eid, Patricia; Mouchenik, Yoram

2015-11-01

Resilience is defined as the ability of people to cope with disasters and significant life adversities. The present paper aims to investigate the underlying structure of the Creole version of the Resilience Scale and its psychometric properties using a sample of adult survivors of the 2010 earthquake. A parallel analysis was conducted to determine the number of factors to extract and confirmatory factor analysis was performed using a sample of 1355 adult survivors of the 2010 earthquake from people of specific places where earthquake occurred with an average age of 31.57 (SD=14.42). All participants completed the Creole version of Resilience Scale (RS), the Impact of Event Scale Revised (IES-R), the Beck Depression Inventory (BDI) and the Social Support Questionnaire (SQQ-6). To facilitate exploratory (EFA) and confirmatory factor analysis (CFA), the sample was divided into two subsamples (subsample 1 for EFA and subsample 2 for CFA). Parallel analysis and confirmatory factor analysis results showed a good-fit 3-factor structure. The Cronbach α coefficient was .79, .74 and .72 respectively for the factor 1, 2 and 3 and correlated to each other. Construct validity of the Resilience scale was provided by significant correlation with measures of depression and social support satisfaction, but no correlation was found with posttraumatic stress disorder measure, except for factor 2. The results reveal a different factorial structure including 25 items of the RS. However, the Haitian Creole version of RS is a valid and reliable measure for assessing resilience for adults in Haiti. Copyright © 2015 Elsevier Inc. All rights reserved.
Introducing PROFESS 2.0: A parallelized, fully linear scaling program for orbital-free density functional theory calculations

NASA Astrophysics Data System (ADS)

Hung, Linda; Huang, Chen; Shin, Ilgyou; Ho, Gregory S.; Lignères, Vincent L.; Carter, Emily A.

2010-12-01

Orbital-free density functional theory (OFDFT) is a first principles quantum mechanics method to find the ground-state energy of a system by variationally minimizing with respect to the electron density. No orbitals are used in the evaluation of the kinetic energy (unlike Kohn-Sham DFT), and the method scales nearly linearly with the size of the system. The PRinceton Orbital-Free Electronic Structure Software (PROFESS) uses OFDFT to model materials from the atomic scale to the mesoscale. This new version of PROFESS allows the study of larger systems with two significant changes: PROFESS is now parallelized, and the ion-electron and ion-ion terms scale quasilinearly, instead of quadratically as in PROFESS v1 (L. Hung and E.A. Carter, Chem. Phys. Lett. 475 (2009) 163). At the start of a run, PROFESS reads the various input files that describe the geometry of the system (ion positions and cell dimensions), the type of elements (defined by electron-ion pseudopotentials), the actions you want it to perform (minimize with respect to electron density and/or ion positions and/or cell lattice vectors), and the various options for the computation (such as which functionals you want it to use). Based on these inputs, PROFESS sets up a computation and performs the appropriate optimizations. Energies, forces, stresses, material geometries, and electron density configurations are some of the values that can be output throughout the optimization. New version program summaryProgram Title: PROFESS Catalogue identifier: AEBN_v2_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEBN_v2_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 68 721 No. of bytes in distributed program, including test data, etc.: 1 708 547 Distribution format: tar.gz Programming language: Fortran 90 Computer: Intel with ifort; AMD Opteron with pathf90 Operating system: Linux Has the code been vectorized or parallelized?: Yes. Parallelization is implemented through domain composition using MPI. RAM: Problem dependent, but 2 GB is sufficient for up to 10,000 ions. Classification: 7.3 External routines: FFTW 2.1.5 ( http://www.fftw.org) Catalogue identifier of previous version: AEBN_v1_0 Journal reference of previous version: Comput. Phys. Comm. 179 (2008) 839 Does the new version supersede the previous version?: Yes Nature of problem: Given a set of coordinates describing the initial ion positions under periodic boundary conditions, recovers the ground state energy, electron density, ion positions, and cell lattice vectors predicted by orbital-free density functional theory. The computation of all terms is effectively linear scaling. Parallelization is implemented through domain decomposition, and up to ˜10,000 ions may be included in the calculation on just a single processor, limited by RAM. For example, when optimizing the geometry of ˜50,000 aluminum ions (plus vacuum) on 48 cores, a single iteration of conjugate gradient ion geometry optimization takes ˜40 minutes wall time. However, each CG geometry step requires two or more electron density optimizations, so step times will vary. Solution method: Computes energies as described in text; minimizes this energy with respect to the electron density, ion positions, and cell lattice vectors. Reasons for new version: To allow much larger systems to be simulated using PROFESS. Restrictions: PROFESS cannot use nonlocal (such as ultrasoft) pseudopotentials. A variety of local pseudopotential files are available at the Carter group website ( http://www.princeton.edu/mae/people/faculty/carter/homepage/research/localpseudopotentials/). Also, due to the current state of the kinetic energy functionals, PROFESS is only reliable for main group metals and some properties of semiconductors. Running time: Problem dependent: the test example provided with the code takes less than a second to run. Timing results for large scale problems are given in the PROFESS paper and Ref. [1].
The source of dual-task limitations: Serial or parallel processing of multiple response selections?

PubMed Central

Marois, René

2014-01-01

Although it is generally recognized that the concurrent performance of two tasks incurs costs, the sources of these dual-task costs remain controversial. The serial bottleneck model suggests that serial postponement of task performance in dual-task conditions results from a central stage of response selection that can only process one task at a time. Cognitive-control models, by contrast, propose that multiple response selections can proceed in parallel, but that serial processing of task performance is predominantly adopted because its processing efficiency is higher than that of parallel processing. In the present study, we empirically tested this proposition by examining whether parallel processing would occur when it was more efficient and financially rewarded. The results indicated that even when parallel processing was more efficient and was incentivized by financial reward, participants still failed to process tasks in parallel. We conclude that central information processing is limited by a serial bottleneck. PMID:23864266
Calibration of hydrological model with programme PEST

NASA Astrophysics Data System (ADS)

Brilly, Mitja; Vidmar, Andrej; Kryžanowski, Andrej; Bezak, Nejc; Šraj, Mojca

2016-04-01

PEST is tool based on minimization of an objective function related to the root mean square error between the model output and the measurement. We use "singular value decomposition", section of the PEST control file, and Tikhonov regularization method for successfully estimation of model parameters. The PEST sometimes failed if inverse problems were ill-posed, but (SVD) ensures that PEST maintains numerical stability. The choice of the initial guess for the initial parameter values is an important issue in the PEST and need expert knowledge. The flexible nature of the PEST software and its ability to be applied to whole catchments at once give results of calibration performed extremely well across high number of sub catchments. Use of parallel computing version of PEST called BeoPEST was successfully useful to speed up calibration process. BeoPEST employs smart slaves and point-to-point communications to transfer data between the master and slaves computers. The HBV-light model is a simple multi-tank-type model for simulating precipitation-runoff. It is conceptual balance model of catchment hydrology which simulates discharge using rainfall, temperature and estimates of potential evaporation. Version of HBV-light-CLI allows the user to run HBV-light from the command line. Input and results files are in XML form. This allows to easily connecting it with other applications such as pre and post-processing utilities and PEST itself. The procedure was applied on hydrological model of Savinja catchment (1852 km2) and consists of twenty one sub-catchments. Data are temporary processed on hourly basis.
Testing New Programming Paradigms with NAS Parallel Benchmarks

NASA Technical Reports Server (NTRS)

Jin, H.; Frumkin, M.; Schultz, M.; Yan, J.

2000-01-01

Over the past decade, high performance computing has evolved rapidly, not only in hardware architectures but also with increasing complexity of real applications. Technologies have been developing to aim at scaling up to thousands of processors on both distributed and shared memory systems. Development of parallel programs on these computers is always a challenging task. Today, writing parallel programs with message passing (e.g. MPI) is the most popular way of achieving scalability and high performance. However, writing message passing programs is difficult and error prone. Recent years new effort has been made in defining new parallel programming paradigms. The best examples are: HPF (based on data parallelism) and OpenMP (based on shared memory parallelism). Both provide simple and clear extensions to sequential programs, thus greatly simplify the tedious tasks encountered in writing message passing programs. HPF is independent of memory hierarchy, however, due to the immaturity of compiler technology its performance is still questionable. Although use of parallel compiler directives is not new, OpenMP offers a portable solution in the shared-memory domain. Another important development involves the tremendous progress in the internet and its associated technology. Although still in its infancy, Java promisses portability in a heterogeneous environment and offers possibility to "compile once and run anywhere." In light of testing these new technologies, we implemented new parallel versions of the NAS Parallel Benchmarks (NPBs) with HPF and OpenMP directives, and extended the work with Java and Java-threads. The purpose of this study is to examine the effectiveness of alternative programming paradigms. NPBs consist of five kernels and three simulated applications that mimic the computation and data movement of large scale computational fluid dynamics (CFD) applications. We started with the serial version included in NPB2.3. Optimization of memory and cache usage was applied to several benchmarks, noticeably BT and SP, resulting in better sequential performance. In order to overcome the lack of an HPF performance model and guide the development of the HPF codes, we employed an empirical performance model for several primitives found in the benchmarks. We encountered a few limitations of HPF, such as lack of supporting the "REDISTRIBUTION" directive and no easy way to handle irregular computation. The parallelization with OpenMP directives was done at the outer-most loop level to achieve the largest granularity. The performance of six HPF and OpenMP benchmarks is compared with their MPI counterparts for the Class-A problem size in the figure in next page. These results were obtained on an SGI Origin2000 (195MHz) with MIPSpro-f77 compiler 7.2.1 for OpenMP and MPI codes and PGI pghpf-2.4.3 compiler with MPI interface for HPF programs.
Modifying the Test of Understanding Graphs in Kinematics

ERIC Educational Resources Information Center

Zavala, Genaro; Tejeda, Santa; Barniol, Pablo; Beichner, Robert J.

2017-01-01

In this article, we present several modifications to the Test of Understanding Graphs in Kinematics. The most significant changes are (i) the addition and removal of items to achieve parallelism in the objectives (dimensions) of the test, thus allowing comparisons of students' performance that were not possible with the original version, and (ii)…
Parallel Activation in Bilingual Phonological Processing

ERIC Educational Resources Information Center

Lee, Su-Yeon

2011-01-01

In bilingual language processing, the parallel activation hypothesis suggests that bilinguals activate their two languages simultaneously during language processing. Support for the parallel activation mainly comes from studies of lexical (word-form) processing, with relatively less attention to phonological (sound) processing. According to…
Particle-in-Cell laser-plasma simulation on Xeon Phi coprocessors

NASA Astrophysics Data System (ADS)

Surmin, I. A.; Bastrakov, S. I.; Efimenko, E. S.; Gonoskov, A. A.; Korzhimanov, A. V.; Meyerov, I. B.

2016-05-01

This paper concerns the development of a high-performance implementation of the Particle-in-Cell method for plasma simulation on Intel Xeon Phi coprocessors. We discuss the suitability of the method for Xeon Phi architecture and present our experience in the porting and optimization of the existing parallel Particle-in-Cell code PICADOR. Direct porting without code modification gives performance on Xeon Phi close to that of an 8-core CPU on a benchmark problem with 50 particles per cell. We demonstrate step-by-step optimization techniques, such as improving data locality, enhancing parallelization efficiency and vectorization leading to an overall 4.2 × speedup on CPU and 7.5 × on Xeon Phi compared to the baseline version. The optimized version achieves 16.9 ns per particle update on an Intel Xeon E5-2660 CPU and 9.3 ns per particle update on an Intel Xeon Phi 5110P. For a real problem of laser ion acceleration in targets with surface grating, where a large number of macroparticles per cell is required, the speedup of Xeon Phi compared to CPU is 1.6 ×.
A sampling and classification item selection approach with content balancing.

PubMed

Chen, Pei-Hua

2015-03-01

Existing automated test assembly methods typically employ constrained combinatorial optimization. Constructing forms sequentially based on an optimization approach usually results in unparallel forms and requires heuristic modifications. Methods based on a random search approach have the major advantage of producing parallel forms sequentially without further adjustment. This study incorporated a flexible content-balancing element into the statistical perspective item selection method of the cell-only method (Chen et al. in Educational and Psychological Measurement, 72(6), 933-953, 2012). The new method was compared with a sequential interitem distance weighted deviation model (IID WDM) (Swanson & Stocking in Applied Psychological Measurement, 17(2), 151-166, 1993), a simultaneous IID WDM, and a big-shadow-test mixed integer programming (BST MIP) method to construct multiple parallel forms based on matching a reference form item-by-item. The results showed that the cell-only method with content balancing and the sequential and simultaneous versions of IID WDM yielded results comparable to those obtained using the BST MIP method. The cell-only method with content balancing is computationally less intensive than the sequential and simultaneous versions of IID WDM.
Fast, Parallel and Secure Cryptography Algorithm Using Lorenz's Attractor

NASA Astrophysics Data System (ADS)

Marco, Anderson Gonçalves; Martinez, Alexandre Souto; Bruno, Odemir Martinez

A novel cryptography method based on the Lorenz's attractor chaotic system is presented. The proposed algorithm is secure and fast, making it practical for general use. We introduce the chaotic operation mode, which provides an interaction among the password, message and a chaotic system. It ensures that the algorithm yields a secure codification, even if the nature of the chaotic system is known. The algorithm has been implemented in two versions: one sequential and slow and the other, parallel and fast. Our algorithm assures the integrity of the ciphertext (we know if it has been altered, which is not assured by traditional algorithms) and consequently its authenticity. Numerical experiments are presented, discussed and show the behavior of the method in terms of security and performance. The fast version of the algorithm has a performance comparable to AES, a popular cryptography program used commercially nowadays, but it is more secure, which makes it immediately suitable for general purpose cryptography applications. An internet page has been set up, which enables the readers to test the algorithm and also to try to break into the cipher.
Full-f version of GENE for turbulence in open-field-line systems

NASA Astrophysics Data System (ADS)

Pan, Q.; Told, D.; Shi, E. L.; Hammett, G. W.; Jenko, F.

2018-06-01

Unique properties of plasmas in the tokamak edge, such as large amplitude fluctuations and plasma-wall interactions in the open-field-line regions, require major modifications of existing gyrokinetic codes originally designed for simulating core turbulence. To this end, the global version of the 3D2V gyrokinetic code GENE, so far employing a δf-splitting technique, is extended to simulate electrostatic turbulence in straight open-field-line systems. The major extensions are the inclusion of the velocity-space nonlinearity, the development of a conducting-sheath boundary, and the implementation of the Lenard-Bernstein collision operator. With these developments, the code can be run as a full-f code and can handle particle loss to and reflection from the wall. The extended code is applied to modeling turbulence in the Large Plasma Device (LAPD), with a reduced mass ratio and a much lower collisionality. Similar to turbulence in a tokamak scrape-off layer, LAPD turbulence involves collisions, parallel streaming, cross-field turbulent transport with steep profiles, and particle loss at the parallel boundary.
Cross-Approximate Entropy parallel computation on GPUs for biomedical signal analysis. Application to MEG recordings.

PubMed

Martínez-Zarzuela, Mario; Gómez, Carlos; Díaz-Pernas, Francisco Javier; Fernández, Alberto; Hornero, Roberto

2013-10-01

Cross-Approximate Entropy (Cross-ApEn) is a useful measure to quantify the statistical dissimilarity of two time series. In spite of the advantage of Cross-ApEn over its one-dimensional counterpart (Approximate Entropy), only a few studies have applied it to biomedical signals, mainly due to its high computational cost. In this paper, we propose a fast GPU-based implementation of the Cross-ApEn that makes feasible its use over a large amount of multidimensional data. The scheme followed is fully scalable, thus maximizes the use of the GPU despite of the number of neural signals being processed. The approach consists in processing many trials or epochs simultaneously, with independence of its origin. In the case of MEG data, these trials can proceed from different input channels or subjects. The proposed implementation achieves an average speedup greater than 250× against a CPU parallel version running on a processor containing six cores. A dataset of 30 subjects containing 148 MEG channels (49 epochs of 1024 samples per channel) can be analyzed using our development in about 30min. The same processing takes 5 days on six cores and 15 days when running on a single core. The speedup is much larger if compared to a basic sequential Matlab(®) implementation, that would need 58 days per subject. To our knowledge, this is the first contribution of Cross-ApEn measure computation using GPUs. This study demonstrates that this hardware is, to the day, the best option for the signal processing of biomedical data with Cross-ApEn. Copyright © 2013 Elsevier Ireland Ltd. All rights reserved.

Medical students as learners: transforming the resident-level microskills of teaching into a parallel curriculum for medical students to aid the transition from classroom to OB/GYN clerkship.

PubMed

Amorosa, Jennifer M H; Graham, Mark J; Ratan, Rini B

2012-01-01

The objective of the study was to describe and assess a brief curricular intervention designed to help medical students adopt active learning strategies. Based on student interest, we created a one-hour workshop that focused on seven microskills of learning and presented it to our medical students during their Obstetrics and Gynecology clerkship. The workshop utilized a modified version of the "Five-Step 'Microskills' Model of Clinical Teaching" first described by Neher in 1992 and paralleled the model our residents are taught as part of their "Resident-as-Teacher" curriculum. Students were surveyed at various time points following the workshop to evaluate the perceived usefuness, value, and durability of the skills taught. Immediate postworkshop feedback was favorable with 93% of students expecting to use the skills taught. At the end of the rotation, students reported a significant increase in usage of each microskill via a retrospective pre/postquestionnaire. While response rates at 1, 3, and 6 months after the rotation were moderate, the majority of the students responding stated that they had utilized the microskills. In its pilot year, the Microskills of Learning workshop was a beneficial addition to our clinical clerkship curriculum. By utilizing a parallel curriculum to that of our residents, the workshop mutually enhanced the educational process by encouraging teachers and learners to speak the same language.
Robust Segmentation of Overlapping Cells in Histopathology Specimens Using Parallel Seed Detection and Repulsive Level Set

PubMed Central

Qi, Xin; Xing, Fuyong; Foran, David J.; Yang, Lin

2013-01-01

Automated image analysis of histopathology specimens could potentially provide support for early detection and improved characterization of breast cancer. Automated segmentation of the cells comprising imaged tissue microarrays (TMA) is a prerequisite for any subsequent quantitative analysis. Unfortunately, crowding and overlapping of cells present significant challenges for most traditional segmentation algorithms. In this paper, we propose a novel algorithm which can reliably separate touching cells in hematoxylin stained breast TMA specimens which have been acquired using a standard RGB camera. The algorithm is composed of two steps. It begins with a fast, reliable object center localization approach which utilizes single-path voting followed by mean-shift clustering. Next, the contour of each cell is obtained using a level set algorithm based on an interactive model. We compared the experimental results with those reported in the most current literature. Finally, performance was evaluated by comparing the pixel-wise accuracy provided by human experts with that produced by the new automated segmentation algorithm. The method was systematically tested on 234 image patches exhibiting dense overlap and containing more than 2200 cells. It was also tested on whole slide images including blood smears and tissue microarrays containing thousands of cells. Since the voting step of the seed detection algorithm is well suited for parallelization, a parallel version of the algorithm was implemented using graphic processing units (GPU) which resulted in significant speed-up over the C/C++ implementation. PMID:22167559
Xyce parallel electronic simulator design.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Thornquist, Heidi K.; Rankin, Eric Lamont; Mei, Ting

2010-09-01

This document is the Xyce Circuit Simulator developer guide. Xyce has been designed from the 'ground up' to be a SPICE-compatible, distributed memory parallel circuit simulator. While it is in many respects a research code, Xyce is intended to be a production simulator. As such, having software quality engineering (SQE) procedures in place to insure a high level of code quality and robustness are essential. Version control, issue tracking customer support, C++ style guildlines and the Xyce release process are all described. The Xyce Parallel Electronic Simulator has been under development at Sandia since 1999. Historically, Xyce has mostly beenmore » funded by ASC, the original focus of Xyce development has primarily been related to circuits for nuclear weapons. However, this has not been the only focus and it is expected that the project will diversify. Like many ASC projects, Xyce is a group development effort, which involves a number of researchers, engineers, scientists, mathmaticians and computer scientists. In addition to diversity of background, it is to be expected on long term projects for there to be a certain amount of staff turnover, as people move on to different projects. As a result, it is very important that the project maintain high software quality standards. The point of this document is to formally document a number of the software quality practices followed by the Xyce team in one place. Also, it is hoped that this document will be a good source of information for new developers.« less
Fast experiments for structure elucidation of small molecules: Hadamard NMR with multiple receivers.

PubMed

Gierth, Peter; Codina, Anna; Schumann, Frank; Kovacs, Helena; Kupče, Ēriks

2015-11-01

We propose several significant improvements to the PANSY (Parallel NMR SpectroscopY) experiments-PANSY COSY and PANSY-TOCSY. The improved versions of these experiments provide sufficient spectral information for structure elucidation of small organic molecules from just two 2D experiments. The PANSY-TOCSY-Q experiment has been modified to allow for simultaneous acquisition of three different types of NMR spectra-1D C-13 of non-protonated carbon sites, 2D TOCSY and multiplicity edited 2D HETCOR. In addition the J-filtered 2D PANSY-gCOSY experiment records a 2D HH gCOSY spectrum in parallel with a (1) J-filtered HC long-range HETCOR spectrum as well as offers a simplified data processing. In addition to parallel acquisition, further time savings are feasible because of significantly smaller F1 spectral windows as compared to the indirect detection experiments. Use of cryoprobes and multiple receivers can significantly alleviate the sensitivity issues that are usually associated with the so called direct detection experiments. In cases where experiments are sampling limited rather than sensitivity limited further reduction of experiment time is achieved by using Hadamard encoding. In favorable cases the total recording time for the two PANSY experiments can be reduced to just 40 s. The proposed PANSY experiments provide sufficient information to allow the CMCse software package (Bruker) to solve structures of small organic molecules. Copyright © 2015 John Wiley & Sons, Ltd.
PPM Receiver Implemented in Software

NASA Technical Reports Server (NTRS)

Gray, Andrew; Kang, Edward; Lay, Norman; Vilnrotter, Victor; Srinivasan, Meera; Lee, Clement

2010-01-01

A computer program has been written as a tool for developing optical pulse-position- modulation (PPM) receivers in which photodetector outputs are fed to analog-to-digital converters (ADCs) and all subsequent signal processing is performed digitally. The program can be used, for example, to simulate an all-digital version of the PPM receiver described in Parallel Processing of Broad-Band PPM Signals (NPO-40711), which appears elsewhere in this issue of NASA Tech Briefs. The program can also be translated into a design for digital PPM receiver hardware. The most notable innovation embodied in the software and the underlying PPM-reception concept is a digital processing subsystem that performs synchronization of PPM time slots, even though the digital processing is, itself, asynchronous in the sense that no attempt is made to synchronize it with the incoming optical signal a priori and there is no feedback to analog signal processing subsystems or ADCs. Functions performed by the software receiver include time-slot synchronization, symbol synchronization, coding preprocessing, and diagnostic functions. The program is written in the MATLAB and Simulink software system. The software receiver is highly parameterized and, hence, programmable: for example, slot- and symbol-synchronization filters have programmable bandwidths.
New technique for real-time distortion-invariant multiobject recognition and classification

NASA Astrophysics Data System (ADS)

Hong, Rutong; Li, Xiaoshun; Hong, En; Wang, Zuyi; Wei, Hongan

2001-04-01

A real-time hybrid distortion-invariant OPR system was established to make 3D multiobject distortion-invariant automatic pattern recognition. Wavelet transform technique was used to make digital preprocessing of the input scene, to depress the noisy background and enhance the recognized object. A three-layer backpropagation artificial neural network was used in correlation signal post-processing to perform multiobject distortion-invariant recognition and classification. The C-80 and NOA real-time processing ability and the multithread programming technology were used to perform high speed parallel multitask processing and speed up the post processing rate to ROIs. The reference filter library was constructed for the distortion version of 3D object model images based on the distortion parameter tolerance measuring as rotation, azimuth and scale. The real-time optical correlation recognition testing of this OPR system demonstrates that using the preprocessing, post- processing, the nonlinear algorithm os optimum filtering, RFL construction technique and the multithread programming technology, a high possibility of recognition and recognition rate ere obtained for the real-time multiobject distortion-invariant OPR system. The recognition reliability and rate was improved greatly. These techniques are very useful to automatic target recognition.
Global 30m Height Above the Nearest Drainage

NASA Astrophysics Data System (ADS)

Donchyts, Gennadii; Winsemius, Hessel; Schellekens, Jaap; Erickson, Tyler; Gao, Hongkai; Savenije, Hubert; van de Giesen, Nick

2016-04-01

Variability of the Earth surface is the primary characteristics affecting the flow of surface and subsurface water. Digital elevation models, usually represented as height maps above some well-defined vertical datum, are used a lot to compute hydrologic parameters such as local flow directions, drainage area, drainage network pattern, and many others. Usually, it requires a significant effort to derive these parameters at a global scale. One hydrological characteristic introduced in the last decade is Height Above the Nearest Drainage (HAND): a digital elevation model normalized using nearest drainage. This parameter has been shown to be useful for many hydrological and more general purpose applications, such as landscape hazard mapping, landform classification, remote sensing and rainfall-runoff modeling. One of the essential characteristics of HAND is its ability to capture heterogeneities in local environments, difficult to measure or model otherwise. While many applications of HAND were published in the academic literature, no studies analyze its variability on a global scale, especially, using higher resolution DEMs, such as the new, one arc-second (approximately 30m) resolution version of SRTM. In this work, we will present the first global version of HAND computed using a mosaic of two DEMS: 30m SRTM and Viewfinderpanorama DEM (90m). The lower resolution DEM was used to cover latitudes above 60 degrees north and below 56 degrees south where SRTM is not available. We compute HAND using the unmodified version of the input DEMs to ensure consistency with the original elevation model. We have parallelized processing by generating a homogenized, equal-area version of HydroBASINS catchments. The resulting catchment boundaries were used to perform processing using 30m resolution DEM. To compute HAND, a new version of D8 local drainage directions as well as flow accumulation were calculated. The latter was used to estimate river head by incorporating fixed and variable thresholding methods. The resulting HAND dataset was analyzed regarding its spatial variability and to assess the global distribution of the main landform types: valley, ecotone, slope, and plateau. The method used to compute HAND was implemented using PCRaster software, running on Google Compute Engine platform running under Ubuntu Linux. The Google Earth Engine was used to perform mosaicing and clipping of the original DEMs as well as to provide access to the final product. The effort took about three months of computing time on eight core CPU virtual machine.
Towards a HPC-oriented parallel implementation of a learning algorithm for bioinformatics applications

PubMed Central

2014-01-01

Background The huge quantity of data produced in Biomedical research needs sophisticated algorithmic methodologies for its storage, analysis, and processing. High Performance Computing (HPC) appears as a magic bullet in this challenge. However, several hard to solve parallelization and load balancing problems arise in this context. Here we discuss the HPC-oriented implementation of a general purpose learning algorithm, originally conceived for DNA analysis and recently extended to treat uncertainty on data (U-BRAIN). The U-BRAIN algorithm is a learning algorithm that finds a Boolean formula in disjunctive normal form (DNF), of approximately minimum complexity, that is consistent with a set of data (instances) which may have missing bits. The conjunctive terms of the formula are computed in an iterative way by identifying, from the given data, a family of sets of conditions that must be satisfied by all the positive instances and violated by all the negative ones; such conditions allow the computation of a set of coefficients (relevances) for each attribute (literal), that form a probability distribution, allowing the selection of the term literals. The great versatility that characterizes it, makes U-BRAIN applicable in many of the fields in which there are data to be analyzed. However the memory and the execution time required by the running are of O(n3) and of O(n5) order, respectively, and so, the algorithm is unaffordable for huge data sets. Results We find mathematical and programming solutions able to lead us towards the implementation of the algorithm U-BRAIN on parallel computers. First we give a Dynamic Programming model of the U-BRAIN algorithm, then we minimize the representation of the relevances. When the data are of great size we are forced to use the mass memory, and depending on where the data are actually stored, the access times can be quite different. According to the evaluation of algorithmic efficiency based on the Disk Model, in order to reduce the costs of the communications between different memories (RAM, Cache, Mass, Virtual) and to achieve efficient I/O performance, we design a mass storage structure able to access its data with a high degree of temporal and spatial locality. Then we develop a parallel implementation of the algorithm. We model it as a SPMD system together to a Message-Passing Programming Paradigm. Here, we adopt the high-level message-passing systems MPI (Message Passing Interface) in the version for the Java programming language, MPJ. The parallel processing is organized into four stages: partitioning, communication, agglomeration and mapping. The decomposition of the U-BRAIN algorithm determines the necessity of a communication protocol design among the processors involved. Efficient synchronization design is also discussed. Conclusions In the context of a collaboration between public and private institutions, the parallel model of U-BRAIN has been implemented and tested on the INTEL XEON E7xxx and E5xxx family of the CRESCO structure of Italian National Agency for New Technologies, Energy and Sustainable Economic Development (ENEA), developed in the framework of the European Grid Infrastructure (EGI), a series of efforts to provide access to high-throughput computing resources across Europe using grid computing techniques. The implementation is able to minimize both the memory space and the execution time. The test data used in this study are IPDATA (Irvine Primate splice- junction DATA set), a subset of HS3D (Homo Sapiens Splice Sites Dataset) and a subset of COSMIC (the Catalogue of Somatic Mutations in Cancer). The execution time and the speed-up on IPDATA reach the best values within about 90 processors. Then the parallelization advantage is balanced by the greater cost of non-local communications between the processors. A similar behaviour is evident on HS3D, but at a greater number of processors, so evidencing the direct relationship between data size and parallelization gain. This behaviour is confirmed on COSMIC. Overall, the results obtained show that the parallel version is up to 30 times faster than the serial one. PMID:25077818
Towards a HPC-oriented parallel implementation of a learning algorithm for bioinformatics applications.

PubMed

D'Angelo, Gianni; Rampone, Salvatore

2014-01-01

The huge quantity of data produced in Biomedical research needs sophisticated algorithmic methodologies for its storage, analysis, and processing. High Performance Computing (HPC) appears as a magic bullet in this challenge. However, several hard to solve parallelization and load balancing problems arise in this context. Here we discuss the HPC-oriented implementation of a general purpose learning algorithm, originally conceived for DNA analysis and recently extended to treat uncertainty on data (U-BRAIN). The U-BRAIN algorithm is a learning algorithm that finds a Boolean formula in disjunctive normal form (DNF), of approximately minimum complexity, that is consistent with a set of data (instances) which may have missing bits. The conjunctive terms of the formula are computed in an iterative way by identifying, from the given data, a family of sets of conditions that must be satisfied by all the positive instances and violated by all the negative ones; such conditions allow the computation of a set of coefficients (relevances) for each attribute (literal), that form a probability distribution, allowing the selection of the term literals. The great versatility that characterizes it, makes U-BRAIN applicable in many of the fields in which there are data to be analyzed. However the memory and the execution time required by the running are of O(n(3)) and of O(n(5)) order, respectively, and so, the algorithm is unaffordable for huge data sets. We find mathematical and programming solutions able to lead us towards the implementation of the algorithm U-BRAIN on parallel computers. First we give a Dynamic Programming model of the U-BRAIN algorithm, then we minimize the representation of the relevances. When the data are of great size we are forced to use the mass memory, and depending on where the data are actually stored, the access times can be quite different. According to the evaluation of algorithmic efficiency based on the Disk Model, in order to reduce the costs of the communications between different memories (RAM, Cache, Mass, Virtual) and to achieve efficient I/O performance, we design a mass storage structure able to access its data with a high degree of temporal and spatial locality. Then we develop a parallel implementation of the algorithm. We model it as a SPMD system together to a Message-Passing Programming Paradigm. Here, we adopt the high-level message-passing systems MPI (Message Passing Interface) in the version for the Java programming language, MPJ. The parallel processing is organized into four stages: partitioning, communication, agglomeration and mapping. The decomposition of the U-BRAIN algorithm determines the necessity of a communication protocol design among the processors involved. Efficient synchronization design is also discussed. In the context of a collaboration between public and private institutions, the parallel model of U-BRAIN has been implemented and tested on the INTEL XEON E7xxx and E5xxx family of the CRESCO structure of Italian National Agency for New Technologies, Energy and Sustainable Economic Development (ENEA), developed in the framework of the European Grid Infrastructure (EGI), a series of efforts to provide access to high-throughput computing resources across Europe using grid computing techniques. The implementation is able to minimize both the memory space and the execution time. The test data used in this study are IPDATA (Irvine Primate splice- junction DATA set), a subset of HS3D (Homo Sapiens Splice Sites Dataset) and a subset of COSMIC (the Catalogue of Somatic Mutations in Cancer). The execution time and the speed-up on IPDATA reach the best values within about 90 processors. Then the parallelization advantage is balanced by the greater cost of non-local communications between the processors. A similar behaviour is evident on HS3D, but at a greater number of processors, so evidencing the direct relationship between data size and parallelization gain. This behaviour is confirmed on COSMIC. Overall, the results obtained show that the parallel version is up to 30 times faster than the serial one.
Towards seamless workflows in agile data science

NASA Astrophysics Data System (ADS)

Klump, J. F.; Robertson, J.

2017-12-01

Agile workflows are a response to projects with requirements that may change over time. They prioritise rapid and flexible responses to change, preferring to adapt to changes in requirements rather than predict them before a project starts. This suits the needs of research very well because research is inherently agile in its methodology. The adoption of agile methods has made collaborative data analysis much easier in a research environment fragmented across institutional data stores, HPC, personal and lab computers and more recently cloud environments. Agile workflows use tools that share a common worldview: in an agile environment, there may be more that one valid version of data, code or environment in play at any given time. All of these versions need references and identifiers. For example, a team of developers following the git-flow conventions (github.com/nvie/gitflow) may have several active branches, one for each strand of development. These workflows allow rapid and parallel iteration while maintaining identifiers pointing to individual snapshots of data and code and allowing rapid switching between strands. In contrast, the current focus of versioning in research data management is geared towards managing data for reproducibility and long-term preservation of the record of science. While both are important goals in the persistent curation domain of the institutional research data infrastructure, current tools emphasise planning over adaptation and can introduce unwanted rigidity by insisting on a single valid version or point of truth. In the collaborative curation domain of a research project, things are more fluid. However, there is no equivalent to the "versioning iso-surface" of the git protocol for the management and versioning of research data. At CSIRO we are developing concepts and tools for the agile management of software code and research data for virtual research environments, based on our experiences of actual data analytics projects in the geosciences. We use code management that allows researchers to interact with the code through tools like Jupyter Notebooks while data are held in an object store. Our aim is an architecture allowing seamless integration of code development, data management, and data processing in virtual research environments.
Synthesizing parallel imaging applications using the CAP (computer-aided parallelization) tool

NASA Astrophysics Data System (ADS)

Gennart, Benoit A.; Mazzariol, Marc; Messerli, Vincent; Hersch, Roger D.

1997-12-01

Imaging applications such as filtering, image transforms and compression/decompression require vast amounts of computing power when applied to large data sets. These applications would potentially benefit from the use of parallel processing. However, dedicated parallel computers are expensive and their processing power per node lags behind that of the most recent commodity components. Furthermore, developing parallel applications remains a difficult task: writing and debugging the application is difficult (deadlocks), programs may not be portable from one parallel architecture to the other, and performance often comes short of expectations. In order to facilitate the development of parallel applications, we propose the CAP computer-aided parallelization tool which enables application programmers to specify at a high-level of abstraction the flow of data between pipelined-parallel operations. In addition, the CAP tool supports the programmer in developing parallel imaging and storage operations. CAP enables combining efficiently parallel storage access routines and image processing sequential operations. This paper shows how processing and I/O intensive imaging applications must be implemented to take advantage of parallelism and pipelining between data access and processing. This paper's contribution is (1) to show how such implementations can be compactly specified in CAP, and (2) to demonstrate that CAP specified applications achieve the performance of custom parallel code. The paper analyzes theoretically the performance of CAP specified applications and demonstrates the accuracy of the theoretical analysis through experimental measurements.
Update on Development of Mesh Generation Algorithms in MeshKit

DOE Office of Scientific and Technical Information (OSTI.GOV)

Jain, Rajeev; Vanderzee, Evan; Mahadevan, Vijay

2015-09-30

MeshKit uses a graph-based design for coding all its meshing algorithms, which includes the Reactor Geometry (and mesh) Generation (RGG) algorithms. This report highlights the developmental updates of all the algorithms, results and future work. Parallel versions of algorithms, documentation and performance results are reported. RGG GUI design was updated to incorporate new features requested by the users; boundary layer generation and parallel RGG support were added to the GUI. Key contributions to the release, upgrade and maintenance of other SIGMA1 libraries (CGM and MOAB) were made. Several fundamental meshing algorithms for creating a robust parallel meshing pipeline in MeshKitmore » are under development. Results and current status of automated, open-source and high quality nuclear reactor assembly mesh generation algorithms such as trimesher, quadmesher, interval matching and multi-sweeper are reported.« less
The Goddard Space Flight Center Program to develop parallel image processing systems

NASA Technical Reports Server (NTRS)

Schaefer, D. H.

1972-01-01

Parallel image processing which is defined as image processing where all points of an image are operated upon simultaneously is discussed. Coherent optical, noncoherent optical, and electronic methods are considered parallel image processing techniques.
Linking Chaotic Advection with Subsurface Biogeochemical Processes

NASA Astrophysics Data System (ADS)

Mays, D. C.; Freedman, V. L.; White, S. K.; Fang, Y.; Neupauer, R.

2017-12-01

This work investigates the extent to which groundwater flow kinematics drive subsurface biogeochemical processes. In terms of groundwater flow kinematics, we consider chaotic advection, whose essential ingredient is stretching and folding of plumes. Chaotic advection is appealing within the context of groundwater remediation because it has been shown to optimize plume spreading in the laminar flows characteristic of aquifers. In terms of subsurface biogeochemical processes, we consider an existing model for microbially-mediated reduction of relatively mobile uranium(VI) to relatively immobile uranium(IV) following injection of acetate into a floodplain aquifer beneath a former uranium mill in Rifle, Colorado. This model has been implemented in the reactive transport code eSTOMP, the massively parallel version of STOMP (Subsurface Transport Over Multiple Phases). This presentation will report preliminary numerical simulations in which the hydraulic boundary conditions in the eSTOMP model are manipulated to simulate chaotic advection resulting from engineered injection and extraction of water through a manifold of wells surrounding the plume of injected acetate. This approach provides an avenue to simulate the impact of chaotic advection within the existing framework of the eSTOMP code.
GPU-based streaming architectures for fast cone-beam CT image reconstruction and demons deformable registration.

PubMed

Sharp, G C; Kandasamy, N; Singh, H; Folkert, M

2007-10-07

This paper shows how to significantly accelerate cone-beam CT reconstruction and 3D deformable image registration using the stream-processing model. We describe data-parallel designs for the Feldkamp, Davis and Kress (FDK) reconstruction algorithm, and the demons deformable registration algorithm, suitable for use on a commodity graphics processing unit. The streaming versions of these algorithms are implemented using the Brook programming environment and executed on an NVidia 8800 GPU. Performance results using CT data of a preserved swine lung indicate that the GPU-based implementations of the FDK and demons algorithms achieve a substantial speedup--up to 80 times for FDK and 70 times for demons when compared to an optimized reference implementation on a 2.8 GHz Intel processor. In addition, the accuracy of the GPU-based implementations was found to be excellent. Compared with CPU-based implementations, the RMS differences were less than 0.1 Hounsfield unit for reconstruction and less than 0.1 mm for deformable registration.
The Snow Data System at NASA JPL

NASA Astrophysics Data System (ADS)

Laidlaw, R.; Painter, T. H.; Mattmann, C. A.; Ramirez, P.; Bormann, K.; Brodzik, M. J.; Burgess, A. B.; Rittger, K.; Goodale, C. E.; Joyce, M.; McGibbney, L. J.; Zimdars, P.

2014-12-01

NASA JPL's Snow Data System has a data-processing pipeline powered by Apache OODT, an open source software tool. The pipeline has been running for several years and has successfully generated a significant amount of cryosphere data, including MODIS-based products such as MODSCAG, MODDRFS and MODICE, with historical and near-real time windows and covering regions such as the Artic, Western US, Alaska, Central Europe, Asia, South America, Australia and New Zealand. The team continues to improve the pipeline, using monitoring tools such as Ganglia to give an overview of operations, and improving fault-tolerance with automated recovery scripts. Several alternative adaptations of the Snow Covered Area and Grain size (SCAG) algorithm are being investigated. These include using VIIRS and Landsat TM/ETM+ satellite data as inputs. Parallel computing techniques are being considered for core SCAG processing, such as using the PyCUDA Python API to utilize multi-core GPU architectures. An experimental version of MODSCAG is also being developed for the Google Earth Engine platform, a cloud-based service.
A Secure Web Application Providing Public Access to High-Performance Data Intensive Scientific Resources - ScalaBLAST Web Application

DOE Office of Scientific and Technical Information (OSTI.GOV)

Curtis, Darren S.; Peterson, Elena S.; Oehmen, Chris S.

2008-05-04

This work presents the ScalaBLAST Web Application (SWA), a web based application implemented using the PHP script language, MySQL DBMS, and Apache web server under a GNU/Linux platform. SWA is an application built as part of the Data Intensive Computer for Complex Biological Systems (DICCBS) project at the Pacific Northwest National Laboratory (PNNL). SWA delivers accelerated throughput of bioinformatics analysis via high-performance computing through a convenient, easy-to-use web interface. This approach greatly enhances emerging fields of study in biology such as ontology-based homology, and multiple whole genome comparisons which, in the absence of a tool like SWA, require a heroicmore » effort to overcome the computational bottleneck associated with genome analysis. The current version of SWA includes a user account management system, a web based user interface, and a backend process that generates the files necessary for the Internet scientific community to submit a ScalaBLAST parallel processing job on a dedicated cluster.« less
Using EnergyPlus for California Title-24 compliancecalculations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Huang, Joe; Bourassa, Norman; Buhl, Fred

2006-08-26

For the past decade, the non-residential portion of California's Title-24 building energy standard has relied on DOE-2.1E as the reference computer simulation program for development as well as compliance. However, starting in 2004, the California Energy Commission has been evaluating the possible use of Energy Plus as the reference program in future revisions of Title-24. As part of this evaluation, the authors converted the Alternate Compliance Method (ACM) certification test suite of 150 DOE-2 files to Energy Plus, and made parallel DOE-2 and Energy Plus runs for this extensive set of test cases. A customized version of DOE-2.1E named doe2epmore » was developed to automate the conversion process. This paper describes this conversion process, including the difficulties in establishing an apples-to-apples comparison between the two programs, and summarizes how the DOE-2 and Energy Plus results compare for the ACM test cases.« less
Accelerating Virtual High-Throughput Ligand Docking: current technology and case study on a petascale supercomputer.

PubMed

Ellingson, Sally R; Dakshanamurthy, Sivanesan; Brown, Milton; Smith, Jeremy C; Baudry, Jerome

2014-04-25

In this paper we give the current state of high-throughput virtual screening. We describe a case study of using a task-parallel MPI (Message Passing Interface) version of Autodock4 [1], [2] to run a virtual high-throughput screen of one-million compounds on the Jaguar Cray XK6 Supercomputer at Oak Ridge National Laboratory. We include a description of scripts developed to increase the efficiency of the predocking file preparation and postdocking analysis. A detailed tutorial, scripts, and source code for this MPI version of Autodock4 are available online at http://www.bio.utk.edu/baudrylab/autodockmpi.htm.
Initial singularity and pure geometric field theories

NASA Astrophysics Data System (ADS)

Wanas, M. I.; Kamal, Mona M.; Dabash, Tahia F.

2018-01-01

In the present article we use a modified version of the geodesic equation, together with a modified version of the Raychaudhuri equation, to study initial singularities. These modified equations are used to account for the effect of the spin-torsion interaction on the existence of initial singularities in cosmological models. Such models are the results of solutions of the field equations of a class of field theories termed pure geometric. The geometric structure used in this study is an absolute parallelism structure satisfying the cosmological principle. It is shown that the existence of initial singularities is subject to some mathematical (geometric) conditions. The scheme suggested for this study can be easily generalized.

Shape detection of Gaborized outline versions of everyday objects

PubMed Central

Sassi, Michaël; Machilsen, Bart; Wagemans, Johan

2012-01-01

We previously tested the identifiability of six versions of Gaborized outlines of everyday objects, differing in the orientations assigned to elements inside and outside the outline. We found significant differences in identifiability between the versions, and related a number of stimulus metrics to identifiability [Sassi, M., Vancleef, K., Machilsen, B., Panis, S., & Wagemans, J. (2010). Identification of everyday objects on the basis of Gaborized outline versions. i-Perception, 1(3), 121–142]. In this study, after retesting the identifiability of new variants of three of the stimulus versions, we tested their robustness to local orientation jitter in a detection experiment. In general, our results replicated the key findings from the previous study, and allowed us to substantiate our earlier interpretations of the effects of our stimulus metrics and of the performance differences between the different stimulus versions. The results of the detection task revealed a different ranking order of stimulus versions than the identification task. By examining the parallels and differences between the effects of our stimulus metrics in the two tasks, we found evidence for a trade-off between shape detectability and identifiability. The generally simple and smooth shapes that yield the strongest contour integration and most robust detectability tend to lack the distinguishing features necessary for clear-cut identification. Conversely, contours that do contain such identifying features tend to be inherently more complex and, therefore, yield weaker integration and less robust detectability. PMID:23483752
Advanced lithographic filtration and contamination control for 14nm node and beyond semiconductor processes

NASA Astrophysics Data System (ADS)

Varanasi, Rao; Mesawich, Michael; Connor, Patrick; Johnson, Lawrence

2017-03-01

Two versions of a specific 2nm rated filter containing filtration medium and all other components produced from high density polyethylene (HDPE), one subjected to standard cleaning, the other to specialized ultra-cleaning, were evaluated in terms of their cleanliness characteristics, and also defectivity of wafers processed with photoresist filtered through each. With respect to inherent cleanliness, the ultraclean version exhibited a 70% reduction in total metal extractables and 90% reduction in organics extractables compared to the standard clean version. In terms of particulate cleanliness, the ultraclean version achieved stability of effluent particles 30nm and larger in about half the time required by the standard clean version, also exhibiting effluent levels at stability almost 90% lower. In evaluating defectivity of blanket wafers processed with photoresist filtered through either version, initial defect density while using the ultraclean version was about half that observed when the standard clean version was in service, with defectivity also falling more rapidly during subsequent usage of the ultraclean version compared to the standard clean version. Similar behavior was observed for patterned wafers, where the enhanced defect reduction was primarily of bridging defects. The filter evaluation and actual process-oriented results demonstrate the extreme value in using filtration designed possessing the optimal intrinsic characteristics, but with further improvements possible through enhanced cleaning processes
Serial multiplier arrays for parallel computation

NASA Technical Reports Server (NTRS)

Winters, Kel

1990-01-01

Arrays of systolic serial-parallel multiplier elements are proposed as an alternative to conventional SIMD mesh serial adder arrays for applications that are multiplication intensive and require few stored operands. The design and operation of a number of multiplier and array configurations featuring locality of connection, modularity, and regularity of structure are discussed. A design methodology combining top-down and bottom-up techniques is described to facilitate development of custom high-performance CMOS multiplier element arrays as well as rapid synthesis of simulation models and semicustom prototype CMOS components. Finally, a differential version of NORA dynamic circuits requiring a single-phase uncomplemented clock signal introduced for this application.
An MPA-IO interface to HPSS

NASA Technical Reports Server (NTRS)

Jones, Terry; Mark, Richard; Martin, Jeanne; May, John; Pierce, Elsie; Stanberry, Linda

1996-01-01

This paper describes an implementation of the proposed MPI-IO (Message Passing Interface - Input/Output) standard for parallel I/O. Our system uses third-party transfer to move data over an external network between the processors where it is used and the I/O devices where it resides. Data travels directly from source to destination, without the need for shuffling it among processors or funneling it through a central node. Our distributed server model lets multiple compute nodes share the burden of coordinating data transfers. The system is built on the High Performance Storage System (HPSS), and a prototype version runs on a Meiko CS-2 parallel computer.
Thread concept for automatic task parallelization in image analysis

NASA Astrophysics Data System (ADS)

Lueckenhaus, Maximilian; Eckstein, Wolfgang

1998-09-01

Parallel processing of image analysis tasks is an essential method to speed up image processing and helps to exploit the full capacity of distributed systems. However, writing parallel code is a difficult and time-consuming process and often leads to an architecture-dependent program that has to be re-implemented when changing the hardware. Therefore it is highly desirable to do the parallelization automatically. For this we have developed a special kind of thread concept for image analysis tasks. Threads derivated from one subtask may share objects and run in the same context but may process different threads of execution and work on different data in parallel. In this paper we describe the basics of our thread concept and show how it can be used as basis of an automatic task parallelization to speed up image processing. We further illustrate the design and implementation of an agent-based system that uses image analysis threads for generating and processing parallel programs by taking into account the available hardware. The tests made with our system prototype show that the thread concept combined with the agent paradigm is suitable to speed up image processing by an automatic parallelization of image analysis tasks.
Parallel multireference configuration interaction calculations on mini-β-carotenes and β-carotene

NASA Astrophysics Data System (ADS)

Kleinschmidt, Martin; Marian, Christel M.; Waletzke, Mirko; Grimme, Stefan

2009-01-01

We present a parallelized version of a direct selecting multireference configuration interaction (MRCI) code [S. Grimme and M. Waletzke, J. Chem. Phys. 111, 5645 (1999)]. The program can be run either in ab initio mode or as semiempirical procedure combined with density functional theory (DFT/MRCI). We have investigated the efficiency of the parallelization in case studies on carotenoids and porphyrins. The performance is found to depend heavily on the cluster architecture. While the speed-up on the older Intel Netburst technology is close to linear for up to 12-16 processes, our results indicate that it is not favorable to use all cores of modern Intel Dual Core or Quad Core processors simultaneously for memory intensive tasks. Due to saturation of the memory bandwidth, we recommend to run less demanding tasks on the latter architectures in parallel to two (Dual Core) or four (Quad Core) MRCI processes per node. The DFT/MRCI branch has been employed to study the low-lying singlet and triplet states of mini-n-β-carotenes (n =3, 5, 7, 9) and β-carotene (n =11) at the geometries of the ground state, the first excited triplet state, and the optically bright singlet state. The order of states depends heavily on the conjugation length and the nuclear geometry. The B1u+ state constitutes the S1 state in the vertical absorption spectrum of mini-3-β-carotene but switches order with the 2 A1g- state upon excited state relaxation. In the longer carotenes, near degeneracy or even root flipping between the B1u+ and B1u- states is observed whereas the 3 A1g- state is found to remain energetically above the optically bright B1u+ state at all nuclear geometries investigated here. The DFT/MRCI method is seen to underestimate the absolute excitation energies of the longer mini-β-carotenes but the energy gaps between the excited states are reproduced well. In addition to singlet data, triplet-triplet absorption energies are presented. For β-carotene, where these transition energies are known from experiment, excellent agreement with our calculations is observed.
Parallel multireference configuration interaction calculations on mini-beta-carotenes and beta-carotene.

PubMed

Kleinschmidt, Martin; Marian, Christel M; Waletzke, Mirko; Grimme, Stefan

2009-01-28

We present a parallelized version of a direct selecting multireference configuration interaction (MRCI) code [S. Grimme and M. Waletzke, J. Chem. Phys. 111, 5645 (1999)]. The program can be run either in ab initio mode or as semiempirical procedure combined with density functional theory (DFT/MRCI). We have investigated the efficiency of the parallelization in case studies on carotenoids and porphyrins. The performance is found to depend heavily on the cluster architecture. While the speed-up on the older Intel Netburst technology is close to linear for up to 12-16 processes, our results indicate that it is not favorable to use all cores of modern Intel Dual Core or Quad Core processors simultaneously for memory intensive tasks. Due to saturation of the memory bandwidth, we recommend to run less demanding tasks on the latter architectures in parallel to two (Dual Core) or four (Quad Core) MRCI processes per node. The DFT/MRCI branch has been employed to study the low-lying singlet and triplet states of mini-n-beta-carotenes (n=3, 5, 7, 9) and beta-carotene (n=11) at the geometries of the ground state, the first excited triplet state, and the optically bright singlet state. The order of states depends heavily on the conjugation length and the nuclear geometry. The (1)B(u) (+) state constitutes the S(1) state in the vertical absorption spectrum of mini-3-beta-carotene but switches order with the 2 (1)A(g) (-) state upon excited state relaxation. In the longer carotenes, near degeneracy or even root flipping between the (1)B(u) (+) and (1)B(u) (-) states is observed whereas the 3 (1)A(g) (-) state is found to remain energetically above the optically bright (1)B(u) (+) state at all nuclear geometries investigated here. The DFT/MRCI method is seen to underestimate the absolute excitation energies of the longer mini-beta-carotenes but the energy gaps between the excited states are reproduced well. In addition to singlet data, triplet-triplet absorption energies are presented. For beta-carotene, where these transition energies are known from experiment, excellent agreement with our calculations is observed.
Studies in optical parallel processing. [All optical and electro-optic approaches

NASA Technical Reports Server (NTRS)

Lee, S. H.

1978-01-01

Threshold and A/D devices for converting a gray scale image into a binary one were investigated for all-optical and opto-electronic approaches to parallel processing. Integrated optical logic circuits (IOC) and optical parallel logic devices (OPA) were studied as an approach to processing optical binary signals. In the IOC logic scheme, a single row of an optical image is coupled into the IOC substrate at a time through an array of optical fibers. Parallel processing is carried out out, on each image element of these rows, in the IOC substrate and the resulting output exits via a second array of optical fibers. The OPAL system for parallel processing which uses a Fabry-Perot interferometer for image thresholding and analog-to-digital conversion, achieves a higher degree of parallel processing than is possible with IOC.
Parallel workflow tools to facilitate human brain MRI post-processing

PubMed Central

Cui, Zaixu; Zhao, Chenxi; Gong, Gaolang

2015-01-01

Multi-modal magnetic resonance imaging (MRI) techniques are widely applied in human brain studies. To obtain specific brain measures of interest from MRI datasets, a number of complex image post-processing steps are typically required. Parallel workflow tools have recently been developed, concatenating individual processing steps and enabling fully automated processing of raw MRI data to obtain the final results. These workflow tools are also designed to make optimal use of available computational resources and to support the parallel processing of different subjects or of independent processing steps for a single subject. Automated, parallel MRI post-processing tools can greatly facilitate relevant brain investigations and are being increasingly applied. In this review, we briefly summarize these parallel workflow tools and discuss relevant issues. PMID:26029043
Cooperative storage of shared files in a parallel computing system with dynamic block size

DOEpatents

Bent, John M.; Faibish, Sorin; Grider, Gary

2015-11-10

Improved techniques are provided for parallel writing of data to a shared object in a parallel computing system. A method is provided for storing data generated by a plurality of parallel processes to a shared object in a parallel computing system. The method is performed by at least one of the processes and comprises: dynamically determining a block size for storing the data; exchanging a determined amount of the data with at least one additional process to achieve a block of the data having the dynamically determined block size; and writing the block of the data having the dynamically determined block size to a file system. The determined block size comprises, e.g., a total amount of the data to be stored divided by the number of parallel processes. The file system comprises, for example, a log structured virtual parallel file system, such as a Parallel Log-Structured File System (PLFS).
Development of the GEM-MACH-FireWork System: An Air Quality Model with On-line Wildfire Emissions within the Canadian Operational Air Quality Forecast System

NASA Astrophysics Data System (ADS)

Pavlovic, Radenko; Chen, Jack; Beaulieu, Paul-Andre; Anselmp, David; Gravel, Sylvie; Moran, Mike; Menard, Sylvain; Davignon, Didier

2014-05-01

A wildfire emissions processing system has been developed to incorporate near-real-time emissions from wildfires and large prescribed burns into Environment Canada's real-time GEM-MACH air quality (AQ) forecast system. Since the GEM-MACH forecast domain covers Canada and most of the U.S.A., including Alaska, fire location information is needed for both of these large countries. During AQ model runs, emissions from individual fire sources are injected into elevated model layers based on plume-rise calculations and then transport and chemistry calculations are performed. This "on the fly" approach to the insertion of the fire emissions provides flexibility and efficiency since on-line meteorology is used and computational overhead in emissions pre-processing is reduced. GEM-MACH-FireWork, an experimental wildfire version of GEM-MACH, was run in real-time mode for the summers of 2012 and 2013 in parallel with the normal operational version. 48-hour forecasts were generated every 12 hours (at 00 and 12 UTC). Noticeable improvements in the AQ forecasts for PM2.5 were seen in numerous regions where fire activity was high. Case studies evaluating model performance for specific regions and computed objective scores will be included in this presentation. Using the lessons learned from the last two summers, Environment Canada will continue to work towards the goal of incorporating near-real-time intermittent wildfire emissions into the operational air quality forecast system.
Validation of the Polish version of the Multidimensional Body-Self Relations Questionnaire among women.

PubMed

Brytek-Matera, Anna; Rogoza, Radosław

2015-03-01

In Poland, appropriate means to assess body image are relatively limited. The aim of the study was to evaluate the psychometric properties of the Polish version of the Multidimensional Body-Self Relations Questionnaire (MBSRQ). To do so, a sample of 341 females ranging in age from 18 to 35 years (M = 23.09; SD = 3.14) participated in the present study. Owing to the fact that the confirmatory factor analysis of the original nine-factor model was not well fitted to the data (RMSEA = 0.06; CFI = 0.75) the exploratory approach was employed. Based on parallel analysis and minimum average partial an eight-factor structure of the Polish version of the MBSRQ was distinguished. Exploratory factor analysis revealed a factorial structure similar to the original version. The proposed model was tested using an exploratory structural equation modelling approach which resulted in good fit (RMSEA = 0.04; CFI = 0.91). In the present study, the internal reliability assessed by McDonald's ω coefficient amounts from 0.66 to 0.91. In conclusion, the Polish version of the MBSRQ is a useful measure for the attitudinal component of body image assessment.
The Case For Prediction-based Best-effort Real-time Systems.

DTIC Science & Technology

1999-01-01

Real - time Systems Peter A. Dinda Loukas Kallivokas January...DISTRIBUTION STATEMENT A Approved for Public Release Distribution Unlimited DTIG QUALBR DISSECTED X The Case For Prediction-based Best-effort Real - time Systems Peter...Mellon University Pittsburgh, PA 15213 A version of this paper appeared in the Seventh Workshop on Parallel and Distributed Real - Time Systems
Local Norms and Test Characteristics for Selected Forms of the M.A.A. Placement Test.

ERIC Educational Resources Information Center

Melancon, Janet G.; Thompson, Bruce

The psychometric integrity of selected items from the Mathematics Association of America (MAA) placement tests for college students was investigated. Two alternative and parallel versions of the test were developed (Form A and Form B) for this study. Data for 539 students seeking admission into an undergraduate mathematics curriculum at a private…
Software Techniques for Balancing Computation & Communication in Parallel Systems

DTIC Science & Technology

1994-07-01

boer of Tasks: 15 PE Loand Yaltanc: 0.0000 K ] PE Loed Ya tance: 0.0000 Into-Tas Com: LInter-Task Com: 116 Ntwok traffic: ±16 PE LAYMT 1, Networkc...confusion. Because past versions for all files were saved and documented within SCCS, software developers were able to roll back to various combinations of
Innovation and the Future of e-Books. Reprints

ERIC Educational Resources Information Center

Warren, John

2009-01-01

The technological development and cultural acceptance of e-books today parallels the state of the printed book in the 15th century. E-books are increasingly available from a variety of distributors and retailers, and work on a myriad of devices, but the majority remain simply digitized versions of print books. Some devices or platforms include…
Testing of DRAINMOD for Forested Watersheds with Non-Pattern Drainage

Treesearch

Devendra M. Amatya; Ge Sun; R. Wayne Skaggs; Carl C. Trettin

2003-01-01

Models like DRAINMOD and its forestry version, DRAINLOB, have been specifically developed as a field scale model for evaluating hydrologic effects of crops (trees), soil, and water management practices for lands with pattern drainage (i.e. with parallel ditches) on relatively flat, high water table soils. These models conduct a water balance between the ditches to...
Validating the Factor Structure of the Self-Report Psychopathy Scale in a Community Sample

ERIC Educational Resources Information Center

Mahmut, Mehmet K.; Menictas, Con; Stevenson, Richard J.; Homewood, Judi

2011-01-01

Currently, there is no standard self-report measure of psychopathy in community-dwelling samples that parallels the most commonly used measure of psychopathy in forensic and clinical samples, the Psychopathy Checklist. A promising instrument is the Self-Report Psychopathy scale (SRP), which was derived from the original version the Psychopathy…
Further Validation of the IDAS: Evidence of Convergent, Discriminant, Criterion, and Incremental Validity

ERIC Educational Resources Information Center

Watson, David; O'Hara, Michael W.; Chmielewski, Michael; McDade-Montez, Elizabeth A.; Koffel, Erin; Naragon, Kristin; Stuart, Scott

2008-01-01

The authors explicated the validity of the Inventory of Depression and Anxiety Symptoms (IDAS; D. Watson et al., 2007) in 2 samples (306 college students and 605 psychiatric patients). The IDAS scales showed strong convergent validity in relation to parallel interview-based scores on the Clinician Rating version of the IDAS; the mean convergent…
Effects of First Occasion Test Experience on Longitudinal Cognitive Change

ERIC Educational Resources Information Center

Salthouse, Timothy A.

2013-01-01

Effects of additional test experience on longitudinal change in 5 cognitive abilities was examined in a sample of healthy adults ranging from 18 to 80 years of age. Participants receiving experience with parallel versions of the cognitive tests on the first occasion had more positive cognitive change an average of 2.5 years later than participants…

Role of Interactions and Correlations on Collective Dynamics of Molecular Motors Along Parallel Filaments

NASA Astrophysics Data System (ADS)

Midha, Tripti; Gupta, Arvind Kumar

2017-11-01

Cytoskeletal motors known as motor proteins are molecules that drive cellular transport along several parallel cytoskeletal filaments and support many biological processes. Experimental evidences suggest that they interact with the nearest molecules of their filament while performing any mechanical work. These interactions modify the microscopic level properties of motor proteins. In this work, a new version of two-channel totally asymmetric simple exclusion process, that incorporates the intra-channel interactions in a thermodynamically consistent way, is proposed. As the existing approaches for multi-channel systems deviate from analyzing the combined effect of inter and intra-channel interactions, a new approach known as modified vertical cluster mean field is developed. The approach along with Monte Carlo simulations successfully encounters some correlations and computes the complex dynamic properties of the system. Role of symmetry of interactions and inter-channel coupling is observed on the phase diagrams, maximal particle current and its corresponding optimal interaction strength. Surprisingly, for all values of coupling rate and most of the interaction splittings, the optimal interaction strength corresponding to maximal current belongs to the case of weak repulsive interactions. Moreover, for weak interaction splittings and with an increase in the coupling rate, the optimal interaction strength tends towards the known experimental results. The effect of coupling as well as interaction energy is also measured for correlations. They are found to be short-range and weaker for repulsive and weak attractive interactions while they are long-range and stronger for large attractions.
Efficient iterative methods applied to the solution of transonic flows

DOE Office of Scientific and Technical Information (OSTI.GOV)

Wissink, A.M.; Lyrintzis, A.S.; Chronopoulos, A.T.

1996-02-01

We investigate the use of an inexact Newton`s method to solve the potential equations in the transonic regime. As a test case, we solve the two-dimensional steady transonic small disturbance equation. Approximate factorization/ADI techniques have traditionally been employed for implicit solutions of this nonlinear equation. Instead, we apply Newton`s method using an exact analytical determination of the Jacobian with preconditioned conjugate gradient-like iterative solvers for solution of the linear systems in each Newton iteration. Two iterative solvers are tested; a block s-step version of the classical Orthomin(k) algorithm called orthogonal s-step Orthomin (OSOmin) and the well-known GIVIRES method. The preconditionermore » is a vectorizable and parallelizable version of incomplete LU (ILU) factorization. Efficiency of the Newton-Iterative method on vector and parallel computer architectures is the main issue addressed. In vectorized tests on a single processor of the Cray C-90, the performance of Newton-OSOmin is superior to Newton-GMRES and a more traditional monotone AF/ADI method (MAF) for a variety of transonic Mach numbers and mesh sizes. Newton- GIVIRES is superior to MAF for some cases. The parallel performance of the Newton method is also found to be very good on multiple processors of the Cray C-90 and on the massively parallel thinking machine CM-5, where very fast execution rates (up to 9 Gflops) are found for large problems. 38 refs., 14 figs., 7 tabs.« less
Efficiency Analysis of the Parallel Implementation of the SIMPLE Algorithm on Multiprocessor Computers

NASA Astrophysics Data System (ADS)

Lashkin, S. V.; Kozelkov, A. S.; Yalozo, A. V.; Gerasimov, V. Yu.; Zelensky, D. K.

2017-12-01

This paper describes the details of the parallel implementation of the SIMPLE algorithm for numerical solution of the Navier-Stokes system of equations on arbitrary unstructured grids. The iteration schemes for the serial and parallel versions of the SIMPLE algorithm are implemented. In the description of the parallel implementation, special attention is paid to computational data exchange among processors under the condition of the grid model decomposition using fictitious cells. We discuss the specific features for the storage of distributed matrices and implementation of vector-matrix operations in parallel mode. It is shown that the proposed way of matrix storage reduces the number of interprocessor exchanges. A series of numerical experiments illustrates the effect of the multigrid SLAE solver tuning on the general efficiency of the algorithm; the tuning involves the types of the cycles used (V, W, and F), the number of iterations of a smoothing operator, and the number of cells for coarsening. Two ways (direct and indirect) of efficiency evaluation for parallelization of the numerical algorithm are demonstrated. The paper presents the results of solving some internal and external flow problems with the evaluation of parallelization efficiency by two algorithms. It is shown that the proposed parallel implementation enables efficient computations for the problems on a thousand processors. Based on the results obtained, some general recommendations are made for the optimal tuning of the multigrid solver, as well as for selecting the optimal number of cells per processor.
Portability and Cross-Platform Performance of an MPI-Based Parallel Polygon Renderer

NASA Technical Reports Server (NTRS)

Crockett, Thomas W.

1999-01-01

Visualizing the results of computations performed on large-scale parallel computers is a challenging problem, due to the size of the datasets involved. One approach is to perform the visualization and graphics operations in place, exploiting the available parallelism to obtain the necessary rendering performance. Over the past several years, we have been developing algorithms and software to support visualization applications on NASA's parallel supercomputers. Our results have been incorporated into a parallel polygon rendering system called PGL. PGL was initially developed on tightly-coupled distributed-memory message-passing systems, including Intel's iPSC/860 and Paragon, and IBM's SP2. Over the past year, we have ported it to a variety of additional platforms, including the HP Exemplar, SGI Origin2OOO, Cray T3E, and clusters of Sun workstations. In implementing PGL, we have had two primary goals: cross-platform portability and high performance. Portability is important because (1) our manpower resources are limited, making it difficult to develop and maintain multiple versions of the code, and (2) NASA's complement of parallel computing platforms is diverse and subject to frequent change. Performance is important in delivering adequate rendering rates for complex scenes and ensuring that parallel computing resources are used effectively. Unfortunately, these two goals are often at odds. In this paper we report on our experiences with portability and performance of the PGL polygon renderer across a range of parallel computing platforms.
Xyce™ Parallel Electronic Simulator Users' Guide, Version 6.5.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Keiter, Eric R.; Aadithya, Karthik V.; Mei, Ting

This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows one to developmore » new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandia's needs, including some radiation- aware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase -- a message passing parallel implementation -- which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows. The information herein is subject to change without notice. Copyright © 2002-2016 Sandia Corporation. All rights reserved.« less
Efficient multitasking: parallel versus serial processing of multiple tasks

PubMed Central

Fischer, Rico; Plessow, Franziska

2015-01-01

In the context of performance optimizations in multitasking, a central debate has unfolded in multitasking research around whether cognitive processes related to different tasks proceed only sequentially (one at a time), or can operate in parallel (simultaneously). This review features a discussion of theoretical considerations and empirical evidence regarding parallel versus serial task processing in multitasking. In addition, we highlight how methodological differences and theoretical conceptions determine the extent to which parallel processing in multitasking can be detected, to guide their employment in future research. Parallel and serial processing of multiple tasks are not mutually exclusive. Therefore, questions focusing exclusively on either task-processing mode are too simplified. We review empirical evidence and demonstrate that shifting between more parallel and more serial task processing critically depends on the conditions under which multiple tasks are performed. We conclude that efficient multitasking is reflected by the ability of individuals to adjust multitasking performance to environmental demands by flexibly shifting between different processing strategies of multiple task-component scheduling. PMID:26441742
Efficient multitasking: parallel versus serial processing of multiple tasks.

PubMed

Fischer, Rico; Plessow, Franziska

2015-01-01

In the context of performance optimizations in multitasking, a central debate has unfolded in multitasking research around whether cognitive processes related to different tasks proceed only sequentially (one at a time), or can operate in parallel (simultaneously). This review features a discussion of theoretical considerations and empirical evidence regarding parallel versus serial task processing in multitasking. In addition, we highlight how methodological differences and theoretical conceptions determine the extent to which parallel processing in multitasking can be detected, to guide their employment in future research. Parallel and serial processing of multiple tasks are not mutually exclusive. Therefore, questions focusing exclusively on either task-processing mode are too simplified. We review empirical evidence and demonstrate that shifting between more parallel and more serial task processing critically depends on the conditions under which multiple tasks are performed. We conclude that efficient multitasking is reflected by the ability of individuals to adjust multitasking performance to environmental demands by flexibly shifting between different processing strategies of multiple task-component scheduling.
Scaling Optimization of the SIESTA MHD Code

NASA Astrophysics Data System (ADS)

Seal, Sudip; Hirshman, Steven; Perumalla, Kalyan

2013-10-01

SIESTA is a parallel three-dimensional plasma equilibrium code capable of resolving magnetic islands at high spatial resolutions for toroidal plasmas. Originally designed to exploit small-scale parallelism, SIESTA has now been scaled to execute efficiently over several thousands of processors P. This scaling improvement was accomplished with minimal intrusion to the execution flow of the original version. First, the efficiency of the iterative solutions was improved by integrating the parallel tridiagonal block solver code BCYCLIC. Krylov-space generation in GMRES was then accelerated using a customized parallel matrix-vector multiplication algorithm. Novel parallel Hessian generation algorithms were integrated and memory access latencies were dramatically reduced through loop nest optimizations and data layout rearrangement. These optimizations sped up equilibria calculations by factors of 30-50. It is possible to compute solutions with granularity N/P near unity on extremely fine radial meshes (N > 1024 points). Grid separation in SIESTA, which manifests itself primarily in the resonant components of the pressure far from rational surfaces, is strongly suppressed by finer meshes. Large problem sizes of up to 300 K simultaneous non-linear coupled equations have been solved on the NERSC supercomputers. Work supported by U.S. DOE under Contract DE-AC05-00OR22725 with UT-Battelle, LLC.
Participatory ergonomics for psychological factors evaluation in work system design.

PubMed

Wang, Lingyan; Lau, Henry Y K

2012-01-01

It is a well recognized understanding that workers whose voice needs to be heard should be actively encouraged as full participants and involved in the early design stages of new ergonomic work system which encompass the development and implementation of new tools, workplaces, technologies or organizations. This paper presents a novel participatory strategy to evaluate three key psychological factors which are respectively mental fatigue, spiritual stress, and emotional satisfaction in work system design based on a modified version of Participatory Ergonomics (PE). In specific, it integrates a PE technique with a formulation view by combining the parallel development of PE strategies, frameworks and functions throughout the coverage of the entire work system design process, so as to bridge the gap between qualitative and quantitative analysis of psychological factors which can cause adverse or advantageous effects on worker's physiological and behavioral performance.
Image stack alignment in full-field X-ray absorption spectroscopy using SIFT_PyOCL.

PubMed

Paleo, Pierre; Pouyet, Emeline; Kieffer, Jérôme

2014-03-01

Full-field X-ray absorption spectroscopy experiments allow the acquisition of millions of spectra within minutes. However, the construction of the hyperspectral image requires an image alignment procedure with sub-pixel precision. While the image correlation algorithm has originally been used for image re-alignment using translations, the Scale Invariant Feature Transform (SIFT) algorithm (which is by design robust versus rotation, illumination change, translation and scaling) presents an additional advantage: the alignment can be limited to a region of interest of any arbitrary shape. In this context, a Python module, named SIFT_PyOCL, has been developed. It implements a parallel version of the SIFT algorithm in OpenCL, providing high-speed image registration and alignment both on processors and graphics cards. The performance of the algorithm allows online processing of large datasets.
Pececillo

DOE Office of Scientific and Technical Information (OSTI.GOV)

Carlson, Neil; Jibben, Zechariah; Brady, Peter

2017-06-28

Pececillo is a proxy-app for the open source Truchas metal processing code (LA-CC-15-097). It implements many of the physics models used in Truchas: free-surface, incompressible Navier-Stokes fluid dynamics (e.g., water waves); heat transport, material phase change, view factor thermal radiation; species advection-diffusion; quasi-static, elastic/plastic solid mechanics with contact; electomagnetics (Maxwell's equations). The models are simplified versions that retain the fundamental computational complexity of the Truchas models while omitting many non-essential features and modeling capabilities. The purpose is to expose Truchas algorithms in a greatly simplified context where computer science problems related to parallel performance on advanced architectures can be moremore » easily investigated. While Pececillo is capable of performing simulations representative of typical Truchas metal casting, welding, and additive manufacturing simulations, it lacks many of the modeling capabilites needed for real applications.« less
Why is happy-sad more difficult? Focal emotional information impairs inhibitory control in children and adults.

PubMed

Kramer, Hannah J; Lagattuta, Kristin Hansen; Sayfan, Liat

2015-02-01

This study compared the relative difficulty of the happy-sad inhibitory control task (say "happy" for the sad face and "sad" for the happy face) against other card tasks that varied by the presence and type (focal vs. peripheral; negative vs. positive) of emotional information in a sample of 4- to 11-year-olds and adults (N = 264). Participants also completed parallel "name games" (direct labeling). All age groups made more errors and took longer to respond to happy-sad compared to other versions, and the relative difficulty of happy-sad increased with age. The happy-sad name game even posed a greater challenge than some opposite games. These data provide insight into the impact of emotions on cognitive processing across a wide age range. PsycINFO Database Record (c) 2015 APA, all rights reserved.
Interferometric scanning optical microscope for surface characterization.

PubMed

Offside, M J; Somekh, M G

1992-11-01

A phase-sensitive scanning optical microscope is described that can measure surface height changes down to 0.1 nm. This is achieved by using two heterodyne Michelson interferometers in parallel. One interferometer probes the sample with a tightly focused beam, and the second has a collimated beam that illuminates a large area of the surface, providing a large area on sample reference. This is facilitated by using a specially constructed objective lens that permits the relative areas illuminated by the two probe beams to be varied both arbitrarily and independently, thus ensuring an accurate absolute phase measurement. We subtracted the phase outputs from each interferometer to provide the sample phase information, canceling the phase noise resulting from microphonics in the process. Results from a prototype version of the microscope are presented that demonstrate the advantages of the system over existing techniques.
Asymmetry in the Farley-Buneman dispersion relation caused by parallel electric fields

NASA Astrophysics Data System (ADS)

Forsythe, Victoriya V.; Makarevich, Roman A.

2016-11-01

An implicit assumption utilized in studies of E region plasma waves generated by the Farley-Buneman instability (FBI) is that the FBI dispersion relation and its solutions for the growth rate and phase velocity are perfectly symmetric with respect to the reversal of the wave propagation component parallel to the magnetic field. In the present study, a recently derived general dispersion relation that describes fundamental plasma instabilities in the lower ionosphere including FBI is considered and it is demonstrated that the dispersion relation is symmetric only for background electric fields that are perfectly perpendicular to the magnetic field. It is shown that parallel electric fields result in significant differences between the growth rates and phase velocities for propagation of parallel components of opposite signs. These differences are evaluated using numerical solutions of the general dispersion relation and shown to exhibit an approximately linear relationship with the parallel electric field near the E region peak altitude of 110 km. An analytic expression for the differences is also derived from an approximate version of the dispersion relation, with comparisons between numerical and analytic results agreeing near 110 km. It is further demonstrated that parallel electric fields do not change the overall symmetry when the full 3-D wave propagation vector is reversed, with no symmetry seen when either the perpendicular or parallel component is reversed. The present results indicate that moderate-to-strong parallel electric fields of 0.1-1.0 mV/m can result in experimentally measurable differences between the characteristics of plasma waves with parallel propagation components of opposite polarity.
Programming a hillslope water movement model on the MPP

NASA Technical Reports Server (NTRS)

Devaney, J. E.; Irving, A. R.; Camillo, P. J.; Gurney, R. J.

1987-01-01

A physically based numerical model was developed of heat and moisture flow within a hillslope on a parallel architecture computer, as a precursor to a model of a complete catchment. Moisture flow within a catchment includes evaporation, overland flow, flow in unsaturated soil, and flow in saturated soil. Because of the empirical evidence that moisture flow in unsaturated soil is mainly in the vertical direction, flow in the unsaturated zone can be modeled as a series of one dimensional columns. This initial version of the hillslope model includes evaporation and a single column of one dimensional unsaturated zone flow. This case has already been solved on an IBM 3081 computer and is now being applied to the massively parallel processor architecture so as to make the extension to the one dimensional case easier and to check the problems and benefits of using a parallel architecture machine.
Optimized and parallelized implementation of the electronegativity equalization method and the atom-bond electronegativity equalization method.

PubMed

Vareková, R Svobodová; Koca, J

2006-02-01

The most common way to calculate charge distribution in a molecule is ab initio quantum mechanics (QM). Some faster alternatives to QM have also been developed, the so-called "equalization methods" EEM and ABEEM, which are based on DFT. We have implemented and optimized the EEM and ABEEM methods and created the EEM SOLVER and ABEEM SOLVER programs. It has been found that the most time-consuming part of equalization methods is the reduction of the matrix belonging to the equation system generated by the method. Therefore, for both methods this part was replaced by the parallel algorithm WIRS and implemented within the PVM environment. The parallelized versions of the programs EEM SOLVER and ABEEM SOLVER showed promising results, especially on a single computer with several processors (compact PVM). The implemented programs are available through the Web page http://ncbr.chemi.muni.cz/~n19n/eem_abeem.
Run-time parallelization and scheduling of loops

NASA Technical Reports Server (NTRS)

Saltz, Joel H.; Mirchandaney, Ravi; Crowley, Kay

1990-01-01

Run time methods are studied to automatically parallelize and schedule iterations of a do loop in certain cases, where compile-time information is inadequate. The methods presented involve execution time preprocessing of the loop. At compile-time, these methods set up the framework for performing a loop dependency analysis. At run time, wave fronts of concurrently executable loop iterations are identified. Using this wavefront information, loop iterations are reordered for increased parallelism. Symbolic transformation rules are used to produce: inspector procedures that perform execution time preprocessing and executors or transformed versions of source code loop structures. These transformed loop structures carry out the calculations planned in the inspector procedures. Performance results are presented from experiments conducted on the Encore Multimax. These results illustrate that run time reordering of loop indices can have a significant impact on performance. Furthermore, the overheads associated with this type of reordering are amortized when the loop is executed several times with the same dependency structure.
Parallelized multi–graphics processing unit framework for high-speed Gabor-domain optical coherence microscopy

PubMed Central

Tankam, Patrice; Santhanam, Anand P.; Lee, Kye-Sung; Won, Jungeun; Canavesi, Cristina; Rolland, Jannick P.

2014-01-01

Abstract. Gabor-domain optical coherence microscopy (GD-OCM) is a volumetric high-resolution technique capable of acquiring three-dimensional (3-D) skin images with histological resolution. Real-time image processing is needed to enable GD-OCM imaging in a clinical setting. We present a parallelized and scalable multi-graphics processing unit (GPU) computing framework for real-time GD-OCM image processing. A parallelized control mechanism was developed to individually assign computation tasks to each of the GPUs. For each GPU, the optimal number of amplitude-scans (A-scans) to be processed in parallel was selected to maximize GPU memory usage and core throughput. We investigated five computing architectures for computational speed-up in processing 1000×1000 A-scans. The proposed parallelized multi-GPU computing framework enables processing at a computational speed faster than the GD-OCM image acquisition, thereby facilitating high-speed GD-OCM imaging in a clinical setting. Using two parallelized GPUs, the image processing of a 1×1×0.6 mm3 skin sample was performed in about 13 s, and the performance was benchmarked at 6.5 s with four GPUs. This work thus demonstrates that 3-D GD-OCM data may be displayed in real-time to the examiner using parallelized GPU processing. PMID:24695868
Parallelized multi-graphics processing unit framework for high-speed Gabor-domain optical coherence microscopy.

PubMed

Tankam, Patrice; Santhanam, Anand P; Lee, Kye-Sung; Won, Jungeun; Canavesi, Cristina; Rolland, Jannick P

2014-07-01

Gabor-domain optical coherence microscopy (GD-OCM) is a volumetric high-resolution technique capable of acquiring three-dimensional (3-D) skin images with histological resolution. Real-time image processing is needed to enable GD-OCM imaging in a clinical setting. We present a parallelized and scalable multi-graphics processing unit (GPU) computing framework for real-time GD-OCM image processing. A parallelized control mechanism was developed to individually assign computation tasks to each of the GPUs. For each GPU, the optimal number of amplitude-scans (A-scans) to be processed in parallel was selected to maximize GPU memory usage and core throughput. We investigated five computing architectures for computational speed-up in processing 1000×1000 A-scans. The proposed parallelized multi-GPU computing framework enables processing at a computational speed faster than the GD-OCM image acquisition, thereby facilitating high-speed GD-OCM imaging in a clinical setting. Using two parallelized GPUs, the image processing of a 1×1×0.6 mm3 skin sample was performed in about 13 s, and the performance was benchmarked at 6.5 s with four GPUs. This work thus demonstrates that 3-D GD-OCM data may be displayed in real-time to the examiner using parallelized GPU processing.
BLESS 2: accurate, memory-efficient and fast error correction method.

PubMed

Heo, Yun; Ramachandran, Anand; Hwu, Wen-Mei; Ma, Jian; Chen, Deming

2016-08-01

The most important features of error correction tools for sequencing data are accuracy, memory efficiency and fast runtime. The previous version of BLESS was highly memory-efficient and accurate, but it was too slow to handle reads from large genomes. We have developed a new version of BLESS to improve runtime and accuracy while maintaining a small memory usage. The new version, called BLESS 2, has an error correction algorithm that is more accurate than BLESS, and the algorithm has been parallelized using hybrid MPI and OpenMP programming. BLESS 2 was compared with five top-performing tools, and it was found to be the fastest when it was executed on two computing nodes using MPI, with each node containing twelve cores. Also, BLESS 2 showed at least 11% higher gain while retaining the memory efficiency of the previous version for large genomes. Freely available at https://sourceforge.net/projects/bless-ec dchen@illinois.edu Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

A high-speed linear algebra library with automatic parallelism

NASA Technical Reports Server (NTRS)

Boucher, Michael L.

1994-01-01

Parallel or distributed processing is key to getting highest performance workstations. However, designing and implementing efficient parallel algorithms is difficult and error-prone. It is even more difficult to write code that is both portable to and efficient on many different computers. Finally, it is harder still to satisfy the above requirements and include the reliability and ease of use required of commercial software intended for use in a production environment. As a result, the application of parallel processing technology to commercial software has been extremely small even though there are numerous computationally demanding programs that would significantly benefit from application of parallel processing. This paper describes DSSLIB, which is a library of subroutines that perform many of the time-consuming computations in engineering and scientific software. DSSLIB combines the high efficiency and speed of parallel computation with a serial programming model that eliminates many undesirable side-effects of typical parallel code. The result is a simple way to incorporate the power of parallel processing into commercial software without compromising maintainability, reliability, or ease of use. This gives significant advantages over less powerful non-parallel entries in the market.
Neural Parallel Engine: A toolbox for massively parallel neural signal processing.

PubMed

Tam, Wing-Kin; Yang, Zhi

2018-05-01

Large-scale neural recordings provide detailed information on neuronal activities and can help elicit the underlying neural mechanisms of the brain. However, the computational burden is also formidable when we try to process the huge data stream generated by such recordings. In this study, we report the development of Neural Parallel Engine (NPE), a toolbox for massively parallel neural signal processing on graphical processing units (GPUs). It offers a selection of the most commonly used routines in neural signal processing such as spike detection and spike sorting, including advanced algorithms such as exponential-component-power-component (EC-PC) spike detection and binary pursuit spike sorting. We also propose a new method for detecting peaks in parallel through a parallel compact operation. Our toolbox is able to offer a 5× to 110× speedup compared with its CPU counterparts depending on the algorithms. A user-friendly MATLAB interface is provided to allow easy integration of the toolbox into existing workflows. Previous efforts on GPU neural signal processing only focus on a few rudimentary algorithms, are not well-optimized and often do not provide a user-friendly programming interface to fit into existing workflows. There is a strong need for a comprehensive toolbox for massively parallel neural signal processing. A new toolbox for massively parallel neural signal processing has been created. It can offer significant speedup in processing signals from large-scale recordings up to thousands of channels. Copyright © 2018 Elsevier B.V. All rights reserved.
Anatomically constrained neural network models for the categorization of facial expression

NASA Astrophysics Data System (ADS)

McMenamin, Brenton W.; Assadi, Amir H.

2004-12-01

The ability to recognize facial expression in humans is performed with the amygdala which uses parallel processing streams to identify the expressions quickly and accurately. Additionally, it is possible that a feedback mechanism may play a role in this process as well. Implementing a model with similar parallel structure and feedback mechanisms could be used to improve current facial recognition algorithms for which varied expressions are a source for error. An anatomically constrained artificial neural-network model was created that uses this parallel processing architecture and feedback to categorize facial expressions. The presence of a feedback mechanism was not found to significantly improve performance for models with parallel architecture. However the use of parallel processing streams significantly improved accuracy over a similar network that did not have parallel architecture. Further investigation is necessary to determine the benefits of using parallel streams and feedback mechanisms in more advanced object recognition tasks.
Anatomically constrained neural network models for the categorization of facial expression

NASA Astrophysics Data System (ADS)

McMenamin, Brenton W.; Assadi, Amir H.

2005-01-01

The ability to recognize facial expression in humans is performed with the amygdala which uses parallel processing streams to identify the expressions quickly and accurately. Additionally, it is possible that a feedback mechanism may play a role in this process as well. Implementing a model with similar parallel structure and feedback mechanisms could be used to improve current facial recognition algorithms for which varied expressions are a source for error. An anatomically constrained artificial neural-network model was created that uses this parallel processing architecture and feedback to categorize facial expressions. The presence of a feedback mechanism was not found to significantly improve performance for models with parallel architecture. However the use of parallel processing streams significantly improved accuracy over a similar network that did not have parallel architecture. Further investigation is necessary to determine the benefits of using parallel streams and feedback mechanisms in more advanced object recognition tasks.
Parallel processing data network of master and slave transputers controlled by a serial control network

DOEpatents

Crosetto, D.B.

1996-12-31

The present device provides for a dynamically configurable communication network having a multi-processor parallel processing system having a serial communication network and a high speed parallel communication network. The serial communication network is used to disseminate commands from a master processor to a plurality of slave processors to effect communication protocol, to control transmission of high density data among nodes and to monitor each slave processor`s status. The high speed parallel processing network is used to effect the transmission of high density data among nodes in the parallel processing system. Each node comprises a transputer, a digital signal processor, a parallel transfer controller, and two three-port memory devices. A communication switch within each node connects it to a fast parallel hardware channel through which all high density data arrives or leaves the node. 6 figs.
Parallel processing data network of master and slave transputers controlled by a serial control network

DOEpatents

Crosetto, Dario B.

1996-01-01

The present device provides for a dynamically configurable communication network having a multi-processor parallel processing system having a serial communication network and a high speed parallel communication network. The serial communication network is used to disseminate commands from a master processor (100) to a plurality of slave processors (200) to effect communication protocol, to control transmission of high density data among nodes and to monitor each slave processor's status. The high speed parallel processing network is used to effect the transmission of high density data among nodes in the parallel processing system. Each node comprises a transputer (104), a digital signal processor (114), a parallel transfer controller (106), and two three-port memory devices. A communication switch (108) within each node (100) connects it to a fast parallel hardware channel (70) through which all high density data arrives or leaves the node.
Infrastructure Upgrades to Support Model Longevity and New Applications: The Variable Infiltration Capacity Model Version 5.0 (VIC 5.0)

NASA Astrophysics Data System (ADS)

Nijssen, B.; Hamman, J.; Bohn, T. J.

2015-12-01

The Variable Infiltration Capacity (VIC) model is a macro-scale semi-distributed hydrologic model. VIC development began in the early 1990s and it has been used extensively, applied from basin to global scales. VIC has been applied in a many use cases, including the construction of hydrologic data sets, trend analysis, data evaluation and assimilation, forecasting, coupled climate modeling, and climate change impact analysis. Ongoing applications of the VIC model include the University of Washington's drought monitor and forecast systems, and NASA's land data assimilation systems. The development of VIC version 5.0 focused on reconfiguring the legacy VIC source code to support a wider range of modern modeling applications. The VIC source code has been moved to a public Github repository to encourage participation by the model development community-at-large. The reconfiguration has separated the physical core of the model from the driver, which is responsible for memory allocation, pre- and post-processing and I/O. VIC 5.0 includes four drivers that use the same physical model core: classic, image, CESM, and Python. The classic driver supports legacy VIC configurations and runs in the traditional time-before-space configuration. The image driver includes a space-before-time configuration, netCDF I/O, and uses MPI for parallel processing. This configuration facilitates the direct coupling of streamflow routing, reservoir, and irrigation processes within VIC. The image driver is the foundation of the CESM driver; which couples VIC to CESM's CPL7 and a prognostic atmosphere. Finally, we have added a Python driver that provides access to the functions and datatypes of VIC's physical core from a Python interface. This presentation demonstrates how reconfiguring legacy source code extends the life and applicability of a research model.
Super and parallel computers and their impact on civil engineering

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kamat, M.P.

1986-01-01

This book presents the papers given at a conference on the use of supercomputers in civil engineering. Topics considered at the conference included solving nonlinear equations on a hypercube, a custom architectured parallel processing system, distributed data processing, algorithms, computer architecture, parallel processing, vector processing, computerized simulation, and cost benefit analysis.
A new version of Stochastic-parallel-gradient-descent algorithm (SPGD) for phase correction of a distorted orbital angular momentum (OAM) beam

NASA Astrophysics Data System (ADS)

Jiao Ling, LIn; Xiaoli, Yin; Huan, Chang; Xiaozhou, Cui; Yi-Lin, Guo; Huan-Yu, Liao; Chun-YU, Gao; Guohua, Wu; Guang-Yao, Liu; Jin-KUn, Jiang; Qing-Hua, Tian

2018-02-01

Atmospheric turbulence limits the performance of orbital angular momentum-based free-space optical communication (FSO-OAM) system. In order to compensate phase distortion induced by atmospheric turbulence, wavefront sensorless adaptive optics (WSAO) has been proposed and studied in recent years. In this paper a new version of SPGD called MZ-SPGD, which combines the Z-SPGD based on the deformable mirror influence function and the M-SPGD based on the Zernike polynomials, is proposed. Numerical simulations show that the hybrid method decreases convergence times markedly but can achieve the same compensated effect compared to Z-SPGD and M-SPGD.
Time-Dependent Simulation of Incompressible Flow in a Turbopump Using Overset Grid Approach

NASA Technical Reports Server (NTRS)

Kiris, Cetin; Kwak, Dochan

2001-01-01

This paper reports the progress being made towards complete unsteady turbopump simulation capability by using overset grid systems. A computational model of a turbo-pump impeller is used as a test case for the performance evaluation of the MPI, hybrid MPI/Open-MP, and MLP versions of the INS3D code. Relative motion of the grid system for rotor-stator interaction was obtained by employing overset grid techniques. Unsteady computations for a turbo-pump, which contains 114 zones with 34.3 Million grid points, are performed on Origin 2000 systems at NASA Ames Research Center. The approach taken for these simulations, and the performance of the parallel versions of the code are presented.
Design and Implementation of a Distributed Version of the NASA Engine Performance Program

NASA Technical Reports Server (NTRS)

Cours, Jeffrey T.

1994-01-01

Distributed NEPP is a new version of the NASA Engine Performance Program that runs in parallel on a collection of Unix workstations connected through a network. The program is fault-tolerant, efficient, and shows significant speed-up in a multi-user, heterogeneous environment. This report describes the issues involved in designing distributed NEPP, the algorithms the program uses, and the performance distributed NEPP achieves. It develops an analytical model to predict and measure the performance of the simple distribution, multiple distribution, and fault-tolerant distribution algorithms that distributed NEPP incorporates. Finally, the appendices explain how to use distributed NEPP and document the organization of the program's source code.
Parallel processing architecture for computing inverse differential kinematic equations of the PUMA arm

NASA Technical Reports Server (NTRS)

Hsia, T. C.; Lu, G. Z.; Han, W. H.

1987-01-01

In advanced robot control problems, on-line computation of inverse Jacobian solution is frequently required. Parallel processing architecture is an effective way to reduce computation time. A parallel processing architecture is developed for the inverse Jacobian (inverse differential kinematic equation) of the PUMA arm. The proposed pipeline/parallel algorithm can be inplemented on an IC chip using systolic linear arrays. This implementation requires 27 processing cells and 25 time units. Computation time is thus significantly reduced.
Performance evaluation of canny edge detection on a tiled multicore architecture

NASA Astrophysics Data System (ADS)

Brethorst, Andrew Z.; Desai, Nehal; Enright, Douglas P.; Scrofano, Ronald

2011-01-01

In the last few years, a variety of multicore architectures have been used to parallelize image processing applications. In this paper, we focus on assessing the parallel speed-ups of different Canny edge detection parallelization strategies on the Tile64, a tiled multicore architecture developed by the Tilera Corporation. Included in these strategies are different ways Canny edge detection can be parallelized, as well as differences in data management. The two parallelization strategies examined were loop-level parallelism and domain decomposition. Loop-level parallelism is achieved through the use of OpenMP,1 and it is capable of parallelization across the range of values over which a loop iterates. Domain decomposition is the process of breaking down an image into subimages, where each subimage is processed independently, in parallel. The results of the two strategies show that for the same number of threads, programmer implemented, domain decomposition exhibits higher speed-ups than the compiler managed, loop-level parallelism implemented with OpenMP.
Use of parallel computing in mass processing of laser data

NASA Astrophysics Data System (ADS)

Będkowski, J.; Bratuś, R.; Prochaska, M.; Rzonca, A.

2015-12-01

The first part of the paper includes a description of the rules used to generate the algorithm needed for the purpose of parallel computing and also discusses the origins of the idea of research on the use of graphics processors in large scale processing of laser scanning data. The next part of the paper includes the results of an efficiency assessment performed for an array of different processing options, all of which were substantially accelerated with parallel computing. The processing options were divided into the generation of orthophotos using point clouds, coloring of point clouds, transformations, and the generation of a regular grid, as well as advanced processes such as the detection of planes and edges, point cloud classification, and the analysis of data for the purpose of quality control. Most algorithms had to be formulated from scratch in the context of the requirements of parallel computing. A few of the algorithms were based on existing technology developed by the Dephos Software Company and then adapted to parallel computing in the course of this research study. Processing time was determined for each process employed for a typical quantity of data processed, which helped confirm the high efficiency of the solutions proposed and the applicability of parallel computing to the processing of laser scanning data. The high efficiency of parallel computing yields new opportunities in the creation and organization of processing methods for laser scanning data.
Implementation and evaluation of the Level Set method: Towards efficient and accurate simulation of wet etching for microengineering applications

NASA Astrophysics Data System (ADS)

Montoliu, C.; Ferrando, N.; Gosálvez, M. A.; Cerdá, J.; Colom, R. J.

2013-10-01

The use of atomistic methods, such as the Continuous Cellular Automaton (CCA), is currently regarded as a computationally efficient and experimentally accurate approach for the simulation of anisotropic etching of various substrates in the manufacture of Micro-electro-mechanical Systems (MEMS). However, when the features of the chemical process are modified, a time-consuming calibration process needs to be used to transform the new macroscopic etch rates into a corresponding set of atomistic rates. Furthermore, changing the substrate requires a labor-intensive effort to reclassify most atomistic neighborhoods. In this context, the Level Set (LS) method provides an alternative approach where the macroscopic forces affecting the front evolution are directly applied at the discrete level, thus avoiding the need for reclassification and/or calibration. Correspondingly, we present a fully-operational Sparse Field Method (SFM) implementation of the LS approach, discussing in detail the algorithm and providing a thorough characterization of the computational cost and simulation accuracy, including a comparison to the performance by the most recent CCA model. We conclude that the SFM implementation achieves similar accuracy as the CCA method with less fluctuations in the etch front and requiring roughly 4 times less memory. Although SFM can be up to 2 times slower than CCA for the simulation of anisotropic etchants, it can also be up to 10 times faster than CCA for isotropic etchants. In addition, we present a parallel, GPU-based implementation (gSFM) and compare it to an optimized, multicore CPU version (cSFM), demonstrating that the SFM algorithm can be successfully parallelized and the simulation times consequently reduced, while keeping the accuracy of the simulations. Although modern multicore CPUs provide an acceptable option, the massively parallel architecture of modern GPUs is more suitable, as reflected by computational times for gSFM up to 7.4 times faster than for cSFM.
Changing of the Guard: Interpretive Continuity of the 2005 Strong Interest Inventory

ERIC Educational Resources Information Center

Bailey, Donna C.; Larson, Lisa M.; Borgen, Fred H.; Gasser, Courtney E.

2008-01-01

This study is the first to examine the equivalence of the 2005 Strong Interest Inventory with the 1994 Strong. The authors examine the parallel content scales of the two versions for female and male college students separately (n = 622). The scales include the six General Occupational Themes (GOTs), 22 of the 25 Basic Interest Scales (BISs) of the…
Basic Research in Computer Science

DTIC Science & Technology

1993-10-01

No 0704-0188 _7 7 , - -s ,,, - .’ . .. ..... _" r oI N tp o .. ,o - Ile *, nhe’ asect of 1. AGENCY USE ONLY :leave b/in,) 2 REPORT DATE 7 REPORT TYPE...Parallel Programming. ACM, April, 1991. 116 [Laird et al. 90] Laird, J.E., C.B. Congdon , C.B. Altmann, and K. Swedlow. Soar User’s Manual: Version 5.2
Domain decomposition methods for the parallel computation of reacting flows

NASA Technical Reports Server (NTRS)

Keyes, David E.

1988-01-01

Domain decomposition is a natural route to parallel computing for partial differential equation solvers. Subdomains of which the original domain of definition is comprised are assigned to independent processors at the price of periodic coordination between processors to compute global parameters and maintain the requisite degree of continuity of the solution at the subdomain interfaces. In the domain-decomposed solution of steady multidimensional systems of PDEs by finite difference methods using a pseudo-transient version of Newton iteration, the only portion of the computation which generally stands in the way of efficient parallelization is the solution of the large, sparse linear systems arising at each Newton step. For some Jacobian matrices drawn from an actual two-dimensional reacting flow problem, comparisons are made between relaxation-based linear solvers and also preconditioned iterative methods of Conjugate Gradient and Chebyshev type, focusing attention on both iteration count and global inner product count. The generalized minimum residual method with block-ILU preconditioning is judged the best serial method among those considered, and parallel numerical experiments on the Encore Multimax demonstrate for it approximately 10-fold speedup on 16 processors.
The Star Wars Scroll Illusion.

PubMed

Shapiro, Arthur G

2015-10-01

The Star Wars Scroll Illusion is a dynamic version of the Leaning Tower Illusion. When two copies of a Star-Wars-like scrolling text are placed side by side (with separate vanishing points), the two scrolls appear to head in different directions even though they are physically parallel in the picture plane. Variations of the illusion are shown with one vanishing point, as well as from an inverted perspective where the scrolls appear to originate in the distance. The demos highlight the conflict between the physical lines in the picture plane and perspective interpretation: With two perspective points, the scrolling texts are parallel to each other in the picture plane but not in perspective interpretation; with one perspective point, the texts are not parallel to each other in the picture plane but are parallel to each other in perspective interpretation. The size of the effect is linearly related to the angle of rotation of the scrolls into the third dimension; the Scroll Illusion is stronger than the Leaning Tower Illusion for rotation angles between 35° and 90°. There is no effect of motion per se on the strength of the illusion.
Performance of a parallel thermal-hydraulics code TEMPEST

DOE Office of Scientific and Technical Information (OSTI.GOV)

Fann, G.I.; Trent, D.S.

The authors describe the parallelization of the Tempest thermal-hydraulics code. The serial version of this code is used for production quality 3-D thermal-hydraulics simulations. Good speedup was obtained with a parallel diagonally preconditioned BiCGStab non-symmetric linear solver, using a spatial domain decomposition approach for the semi-iterative pressure-based and mass-conserved algorithm. The test case used here to illustrate the performance of the BiCGStab solver is a 3-D natural convection problem modeled using finite volume discretization in cylindrical coordinates. The BiCGStab solver replaced the LSOR-ADI method for solving the pressure equation in TEMPEST. BiCGStab also solves the coupled thermal energy equation. Scalingmore » performance of 3 problem sizes (221220 nodes, 358120 nodes, and 701220 nodes) are presented. These problems were run on 2 different parallel machines: IBM-SP and SGI PowerChallenge. The largest problem attains a speedup of 68 on an 128 processor IBM-SP. In real terms, this is over 34 times faster than the fastest serial production time using the LSOR-ADI solver.« less

Algorithms and Application of Sparse Matrix Assembly and Equation Solvers for Aeroacoustics

NASA Technical Reports Server (NTRS)

Watson, W. R.; Nguyen, D. T.; Reddy, C. J.; Vatsa, V. N.; Tang, W. H.

2001-01-01

An algorithm for symmetric sparse equation solutions on an unstructured grid is described. Efficient, sequential sparse algorithms for degree-of-freedom reordering, supernodes, symbolic/numerical factorization, and forward backward solution phases are reviewed. Three sparse algorithms for the generation and assembly of symmetric systems of matrix equations are presented. The accuracy and numerical performance of the sequential version of the sparse algorithms are evaluated over the frequency range of interest in a three-dimensional aeroacoustics application. Results show that the solver solutions are accurate using a discretization of 12 points per wavelength. Results also show that the first assembly algorithm is impractical for high-frequency noise calculations. The second and third assembly algorithms have nearly equal performance at low values of source frequencies, but at higher values of source frequencies the third algorithm saves CPU time and RAM. The CPU time and the RAM required by the second and third assembly algorithms are two orders of magnitude smaller than that required by the sparse equation solver. A sequential version of these sparse algorithms can, therefore, be conveniently incorporated into a substructuring for domain decomposition formulation to achieve parallel computation, where different substructures are handles by different parallel processors.
A Performance Comparison of the Parallel Preconditioners for Iterative Methods for Large Sparse Linear Systems Arising from Partial Differential Equations on Structured Grids

NASA Astrophysics Data System (ADS)

Ma, Sangback

In this paper we compare various parallel preconditioners such as Point-SSOR (Symmetric Successive OverRelaxation), ILU(0) (Incomplete LU) in the Wavefront ordering, ILU(0) in the Multi-color ordering, Multi-Color Block SOR (Successive OverRelaxation), SPAI (SParse Approximate Inverse) and pARMS (Parallel Algebraic Recursive Multilevel Solver) for solving large sparse linear systems arising from two-dimensional PDE (Partial Differential Equation)s on structured grids. Point-SSOR is well-known, and ILU(0) is one of the most popular preconditioner, but it is inherently serial. ILU(0) in the Wavefront ordering maximizes the parallelism in the natural order, but the lengths of the wave-fronts are often nonuniform. ILU(0) in the Multi-color ordering is a simple way of achieving a parallelism of the order N, where N is the order of the matrix, but its convergence rate often deteriorates as compared to that of natural ordering. We have chosen the Multi-Color Block SOR preconditioner combined with direct sparse matrix solver, since for the Laplacian matrix the SOR method is known to have a nondeteriorating rate of convergence when used with the Multi-Color ordering. By using block version we expect to minimize the interprocessor communications. SPAI computes the sparse approximate inverse directly by least squares method. Finally, ARMS is a preconditioner recursively exploiting the concept of independent sets and pARMS is the parallel version of ARMS. Experiments were conducted for the Finite Difference and Finite Element discretizations of five two-dimensional PDEs with large meshsizes up to a million on an IBM p595 machine with distributed memory. Our matrices are real positive, i. e., their real parts of the eigenvalues are positive. We have used GMRES(m) as our outer iterative method, so that the convergence of GMRES(m) for our test matrices are mathematically guaranteed. Interprocessor communications were done using MPI (Message Passing Interface) primitives. The results show that in general ILU(0) in the Multi-Color ordering ahd ILU(0) in the Wavefront ordering outperform the other methods but for symmetric and nearly symmetric 5-point matrices Multi-Color Block SOR gives the best performance, except for a few cases with a small number of processors.
DOVIS: an implementation for high-throughput virtual screening using AutoDock.

PubMed

Zhang, Shuxing; Kumar, Kamal; Jiang, Xiaohui; Wallqvist, Anders; Reifman, Jaques

2008-02-27

Molecular-docking-based virtual screening is an important tool in drug discovery that is used to significantly reduce the number of possible chemical compounds to be investigated. In addition to the selection of a sound docking strategy with appropriate scoring functions, another technical challenge is to in silico screen millions of compounds in a reasonable time. To meet this challenge, it is necessary to use high performance computing (HPC) platforms and techniques. However, the development of an integrated HPC system that makes efficient use of its elements is not trivial. We have developed an application termed DOVIS that uses AutoDock (version 3) as the docking engine and runs in parallel on a Linux cluster. DOVIS can efficiently dock large numbers (millions) of small molecules (ligands) to a receptor, screening 500 to 1,000 compounds per processor per day. Furthermore, in DOVIS, the docking session is fully integrated and automated in that the inputs are specified via a graphical user interface, the calculations are fully integrated with a Linux cluster queuing system for parallel processing, and the results can be visualized and queried. DOVIS removes most of the complexities and organizational problems associated with large-scale high-throughput virtual screening, and provides a convenient and efficient solution for AutoDock users to use this software in a Linux cluster platform.
Design and test of data acquisition systems for the Medipix2 chip based on PC standard interfaces

NASA Astrophysics Data System (ADS)

Fanti, Viviana; Marzeddu, Roberto; Piredda, Giuseppina; Randaccio, Paolo

2005-07-01

We describe two readout systems for hybrid detectors using the Medipix2 single photon counting chip, developed within the Medipix Collaboration. The Medipix2 chip (256×256 pixels, 55 μm pitch) has an active area of about 2 cm 2 and is bump-bonded to a pixel semiconductor array of silicon or other semiconductor material. The readout systems we are developing are based on two widespread standard PC interfaces: parallel port and USB (Universal Serial Bus) version 1.1. The parallel port is the simplest PC interface even if slow and the USB is a serial bus interface present nowadays on all PCs and offering good performances.
IMa2p - Parallel MCMC and inference of ancient demography under the Isolation with Migration (IM) model

PubMed Central

Sethuraman, Arun; Hey, Jody

2015-01-01

IMa2 and related programs are used to study the divergence of closely related species and of populations within species. These methods are based on the sampling of genealogies using MCMC, and they can proceed quite slowly for larger data sets. We describe a parallel implementation, called IMa2p, that provides a nearly linear increase in genealogy sampling rate with the number of processors in use. IMa2p is written in OpenMPI and C++, and scales well for demographic analyses of a large number of loci and populations, which are difficult to study using the serial version of the program. PMID:26059786
The development of a revised version of multi-center molecular Ornstein-Zernike equation

NASA Astrophysics Data System (ADS)

Kido, Kentaro; Yokogawa, Daisuke; Sato, Hirofumi

2012-04-01

Ornstein-Zernike (OZ)-type theory is a powerful tool to obtain 3-dimensional solvent distribution around solute molecule. Recently, we proposed multi-center molecular OZ method, which is suitable for parallel computing of 3D solvation structure. The distribution function in this method consists of two components, namely reference and residue parts. Several types of the function were examined as the reference part to investigate the numerical robustness of the method. As the benchmark, the method is applied to water, benzene in aqueous solution and single-walled carbon nanotube in chloroform solution. The results indicate that fully-parallelization is achieved by utilizing the newly proposed reference functions.
A Factor Analytic Investigation of the Person-in-Recovery and Provider Versions of the Revised Recovery Self-Assessment (RSA-R).

PubMed

Konkolÿ Thege, Barna; Ham, Elke; Ball, Laura C

2017-12-01

Recovery is understood as living a life with hope, purpose, autonomy, productivity, and community engagement despite a mental illness. The aim of this study was to provide further information on the psychometric properties of the Person-in-Recovery and Provider versions of the Revised Recovery Self-Assessment (RSA-R), a widely used measure of recovery orientation. Data from 654 individuals were analyzed, 519 of whom were treatment providers (63.6% female), while 135 were inpatients (10.4% female) of a Canadian tertiary-level psychiatric hospital. Confirmatory and exploratory techniques were used to investigate the factor structure of both versions of the instrument. Results of the confirmatory factor analyses showed that none of the four theoretically plausible models fit the data well. Principal component analyses could not replicate the structure obtained by the scale developers either and instead resulted in a five-component solution for the Provider and a four-component solution for the Person-in-Recovery version. When considering the results of a parallel analysis, the number of components to retain dropped to two for the Provider version and one for the Person-in-Recovery version. We can conclude that the RSA-R requires further revision to become a psychometrically sound instrument for assessing recovery-oriented practices in an inpatient mental health-care setting.
SCELib3.0: The new revision of SCELib, the parallel computational library of molecular properties in the Single Center Approach

NASA Astrophysics Data System (ADS)

Sanna, N.; Baccarelli, I.; Morelli, G.

2009-12-01

SCELib is a computer program which implements the Single Center Expansion (SCE) method to describe molecular electronic densities and the interaction potentials between a charged projectile (electron or positron) and a target molecular system. The first version (CPC Catalog identifier ADMG_v1_0) was submitted to the CPC Program Library in 2000, and version 2.0 (ADMG_v2_0) was submitted in 2004. We here announce the new release 3.0 which presents additional features with respect to the previous versions aiming at a significative enhance of its capabilities to deal with larger molecular systems. SCELib 3.0 allows for ab initio effective core potential (ECP) calculations of the molecular wavefunctions to be used in the SCE method in addition to the standard all-electron description of the molecule. The list of supported architectures has been updated and the code has been ported to platforms based on accelerating coprocessors, such as the NVIDIA GPGPU and the new parallel model adopted is able to efficiently run on a mixed many-core computing system. Program summaryProgram title: SCELib3.0 Catalogue identifier: ADMG_v3_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADMG_v3_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 2 018 862 No. of bytes in distributed program, including test data, etc.: 4 955 014 Distribution format: tar.gz Programming language: C Compilers used: xlc V8.x, Intel C V10.x, Portland Group V7.x, nvcc V2.x Computer: All SMP platforms based on AIX, Linux and SUNOS operating systems over SPARC, POWER, Intel Itanium2, X86, em64t and Opteron processors Operating system: SUNOS, IBM AIX, Linux RedHat (Enterprise), Linux SuSE (SLES) Has the code been vectorized or parallelized?: Yes. 1 to 32 (CPU or GPU) used RAM: Up to 32 GB depending on the molecular system and runtime parameters Classification: 16.5 Catalogue identifier of previous version: ADMG_v2_0 Journal reference of previous version: Comput. Phys. Comm. 162 (2004) 51 External routines: CUDA libraries (SDK V2.x). Does the new version supersede the previous version?: Yes Nature of problem: In this set of codes an efficient procedure is implemented to describe the wavefunction and related molecular properties of a polyatomic molecular system within the Single Center of Expansion (SCE) approximation. The resulting SCE wavefunction, electron density, electrostatic and correlation/polarization potentials can then be used in a wide variety of applications, such as electron-molecule scattering calculations, quantum chemistry studies, biomodelling and drug design. Solution method: The polycentre Hartree-Fock solution for a molecule of arbitrary geometry, based on linear combination of Gaussian-Type Orbital (GTO), is expanded over a single center, typically the Center Of Mass (C.O.M.), by means of a Gauss Legendre/Chebyschev quadrature over the θ,φ angular coordinates. The resulting SCE numerical wavefunction is then used to calculate the one-particle electron density, the electrostatic potential and two different models for the correlation/polarization potentials induced by the impinging electron, which have the correct asymptotic behavior for the leading dipole molecular polarizabilities. Reasons for new version: The present release of SCELib allows the study of larger molecular systems with respect to the previous versions by means of theoretical and technological advances, with the first implementation of the code over a many-core computing system. Summary of revisions: The major features added with respect to SCELib Version 2.0 are molecular wavefunctions obtained via the Los Alamos (Hay and Wadt) LAN ECP plus DZ description of the inner-shell electrons (on Na-La, Hf-Bi elements) [1] can now be single-center-expanded; the addition required modifications of: (i) the filtering code readgau, (ii) the main reading function setinp, (iii) the sphint code (including changes to the CalcMO code), (iv) the densty code, (v) the vst code; the classes of platforms supported now include two more architectures based on accelerated coprocessors (Nvidia GSeries GPGPU and ClearSpeed e720 (ClearSpeed version, experimental; initial preliminary porting of the sphint() function not for production runs - see the code documentation for additional detail). A single-precision representation for real numbers in the SCE mapping of the GTOs ( sphint code), has been implemented into the new code; the I h symmetry point group for the molecular systems has been added to those already allowed in the SCE procedure; the orientation of the molecular axis system for the Cs (planar) symmetry has been changed in accord with the standard orientation adopted by the latest version of the quantum chemistry code (Gaussian C03 [2]), which is used to generate the input multi-centre molecular wavefunctions ( z-axis perpendicular to the symmetry plane); the abelian subgroup for the Cs point group has been changed from C 1 to Cs; atomic basis functions including g-type GTOs can now be single-center-expanded. Restrictions: Depending on the molecular system under study and on the operating conditions the program may or may not fit into available RAM memory. In this case a feature of the program is to memory map a disk file in order to efficiently access the memory data through a disk device. The parallel GP-GPU implementation limits the number of CPU threads to the number of GPU cores present. Running time: The execution time strongly depends on the molecular target description and on the hardware/OS chosen, it is directly proportional to the ( r,θ,φ) grid size and to the number of angular basis functions used. Thus, from the program printout of the main arrays memory occupancy, the user can approximately derive the expected computer time needed for a given calculation executed in serial mode. For parallel executions the overall efficiency must be further taken into account, and this depends on the no. of processors used as well as on the parallel architecture chosen, so a simple general law is at present not determinable. References:[1] P.J. Hay, W.R. Wadt, J. Chem. Phys. 82 (1985) 270; W.R. Wadt, P.J. Hay, J. Chem. Phys. 284 (1985);P.J. Hay, W.R. Wadt, J. Chem. Phys. 299 (1985). [2] M.J. Frisch et al., Gaussian 03, revision C.02, Gaussian, Inc., Wallingford, CT, 2004.
Parallelized CCHE2D flow model with CUDA Fortran on Graphics Process Units

USDA-ARS?s Scientific Manuscript database

This paper presents the CCHE2D implicit flow model parallelized using CUDA Fortran programming technique on Graphics Processing Units (GPUs). A parallelized implicit Alternating Direction Implicit (ADI) solver using Parallel Cyclic Reduction (PCR) algorithm on GPU is developed and tested. This solve...
Massively parallel information processing systems for space applications

NASA Technical Reports Server (NTRS)

Schaefer, D. H.

1979-01-01

NASA is developing massively parallel systems for ultra high speed processing of digital image data collected by satellite borne instrumentation. Such systems contain thousands of processing elements. Work is underway on the design and fabrication of the 'Massively Parallel Processor', a ground computer containing 16,384 processing elements arranged in a 128 x 128 array. This computer uses existing technology. Advanced work includes the development of semiconductor chips containing thousands of feedthrough paths. Massively parallel image analog to digital conversion technology is also being developed. The goal is to provide compact computers suitable for real-time onboard processing of images.
Parallel log structured file system collective buffering to achieve a compact representation of scientific and/or dimensional data

DOEpatents

Grider, Gary A.; Poole, Stephen W.

2015-09-01

Collective buffering and data pattern solutions are provided for storage, retrieval, and/or analysis of data in a collective parallel processing environment. For example, a method can be provided for data storage in a collective parallel processing environment. The method comprises receiving data to be written for a plurality of collective processes within a collective parallel processing environment, extracting a data pattern for the data to be written for the plurality of collective processes, generating a representation describing the data pattern, and saving the data and the representation.
schwimmbad: A uniform interface to parallel processing pools in Python

NASA Astrophysics Data System (ADS)

Price-Whelan, Adrian M.; Foreman-Mackey, Daniel

2017-09-01

Many scientific and computing problems require doing some calculation on all elements of some data set. If the calculations can be executed in parallel (i.e. without any communication between calculations), these problems are said to be perfectly parallel. On computers with multiple processing cores, these tasks can be distributed and executed in parallel to greatly improve performance. A common paradigm for handling these distributed computing problems is to use a processing "pool": the "tasks" (the data) are passed in bulk to the pool, and the pool handles distributing the tasks to a number of worker processes when available. schwimmbad provides a uniform interface to parallel processing pools and enables switching easily between local development (e.g., serial processing or with multiprocessing) and deployment on a cluster or supercomputer (via, e.g., MPI or JobLib).
Survey of computer vision technology for UVA navigation

NASA Astrophysics Data System (ADS)

Xie, Bo; Fan, Xiang; Li, Sijian

2017-11-01

Navigation based on computer version technology, which has the characteristics of strong independence, high precision and is not susceptible to electrical interference, has attracted more and more attention in the filed of UAV navigation research. Early navigation project based on computer version technology mainly applied to autonomous ground robot. In recent years, the visual navigation system is widely applied to unmanned machine, deep space detector and underwater robot. That further stimulate the research of integrated navigation algorithm based on computer version technology. In China, with many types of UAV development and two lunar exploration, the three phase of the project started, there has been significant progress in the study of visual navigation. The paper expounds the development of navigation based on computer version technology in the filed of UAV navigation research and draw a conclusion that visual navigation is mainly applied to three aspects as follows.(1) Acquisition of UAV navigation parameters. The parameters, including UAV attitude, position and velocity information could be got according to the relationship between the images from sensors and carrier's attitude, the relationship between instant matching images and the reference images and the relationship between carrier's velocity and characteristics of sequential images.(2) Autonomous obstacle avoidance. There are many ways to achieve obstacle avoidance in UAV navigation. The methods based on computer version technology ,including feature matching, template matching, image frames and so on, are mainly introduced. (3) The target tracking, positioning. Using the obtained images, UAV position is calculated by using optical flow method, MeanShift algorithm, CamShift algorithm, Kalman filtering and particle filter algotithm. The paper expounds three kinds of mainstream visual system. (1) High speed visual system. It uses parallel structure, with which image detection and processing are carried out at high speed. The system is applied to rapid response system. (2) The visual system of distributed network. There are several discrete image data acquisition sensor in different locations, which transmit image data to the node processor to increase the sampling rate. (3) The visual system combined with observer. The system combines image sensors with the external observers to make up for lack of visual equipment. To some degree, these systems overcome lacks of the early visual system, including low frequency, low processing efficiency and strong noise. In the end, the difficulties of navigation based on computer version technology in practical application are briefly discussed. (1) Due to the huge workload of image operation , the real-time performance of the system is poor. (2) Due to the large environmental impact , the anti-interference ability of the system is poor.(3) Due to the ability to work in a particular environment, the system has poor adaptability.
Parallel Signal Processing and System Simulation using aCe

NASA Technical Reports Server (NTRS)

Dorband, John E.; Aburdene, Maurice F.

2003-01-01

Recently, networked and cluster computation have become very popular for both signal processing and system simulation. A new language is ideally suited for parallel signal processing applications and system simulation since it allows the programmer to explicitly express the computations that can be performed concurrently. In addition, the new C based parallel language (ace C) for architecture-adaptive programming allows programmers to implement algorithms and system simulation applications on parallel architectures by providing them with the assurance that future parallel architectures will be able to run their applications with a minimum of modification. In this paper, we will focus on some fundamental features of ace C and present a signal processing application (FFT).
Parallel processing in finite element structural analysis

NASA Technical Reports Server (NTRS)

Noor, Ahmed K.

1987-01-01

A brief review is made of the fundamental concepts and basic issues of parallel processing. Discussion focuses on parallel numerical algorithms, performance evaluation of machines and algorithms, and parallelism in finite element computations. A computational strategy is proposed for maximizing the degree of parallelism at different levels of the finite element analysis process including: 1) formulation level (through the use of mixed finite element models); 2) analysis level (through additive decomposition of the different arrays in the governing equations into the contributions to a symmetrized response plus correction terms); 3) numerical algorithm level (through the use of operator splitting techniques and application of iterative processes); and 4) implementation level (through the effective combination of vectorization, multitasking and microtasking, whenever available).
Connectionism, parallel constraint satisfaction processes, and gestalt principles: (re) introducing cognitive dynamics to social psychology.

PubMed

Read, S J; Vanman, E J; Miller, L C

1997-01-01

We argue that recent work in connectionist modeling, in particular the parallel constraint satisfaction processes that are central to many of these models, has great importance for understanding issues of both historical and current concern for social psychologists. We first provide a brief description of connectionist modeling, with particular emphasis on parallel constraint satisfaction processes. Second, we examine the tremendous similarities between parallel constraint satisfaction processes and the Gestalt principles that were the foundation for much of modem social psychology. We propose that parallel constraint satisfaction processes provide a computational implementation of the principles of Gestalt psychology that were central to the work of such seminal social psychologists as Asch, Festinger, Heider, and Lewin. Third, we then describe how parallel constraint satisfaction processes have been applied to three areas that were key to the beginnings of modern social psychology and remain central today: impression formation and causal reasoning, cognitive consistency (balance and cognitive dissonance), and goal-directed behavior. We conclude by discussing implications of parallel constraint satisfaction principles for a number of broader issues in social psychology, such as the dynamics of social thought and the integration of social information within the narrow time frame of social interaction.
Using Parallel Processing for Problem Solving.

DTIC Science & Technology

1979-12-01

are the basic parallel proces- sing primitive . Different goals of the system can be pursued in parallel by placing them in separate activities...Language primitives are provided for manipulating running activities. Viewpoints are a generalization of context FOM -(over "*’ DD I FON 1473 ’EDITION OF I...arc the basic parallel processing primitive . Different goals of the system can be pursued in parallel by placing them in separate activities. Language
Reading Pictures for Story Comprehension Requires Mental Imagery Skills

PubMed Central

Boerma, Inouk E.; Mol, Suzanne E.; Jolles, Jelle

2016-01-01

We examined the role of mental imagery skills on story comprehension in 150 fifth graders (10- to 12-year-olds), when reading a narrative book chapter with alternating words and pictures (i.e., text blocks were alternated by one- or two-page picture spreads). A parallel group design was used, in which we compared our experimental book version, in which pictures were used to replace parts of the corresponding text, to two control versions, i.e., a text-only version and a version with the full story text and all pictures. Analyses showed an interaction between mental imagery and book version: children with higher mental imagery skills outperformed children with lower mental imagery skills on story comprehension after reading the experimental narrative. This was not the case for both control conditions. This suggests that children’s mental imagery skills significantly contributed to the mental representation of the story that they created, by successfully integrating information from both words and pictures. The results emphasize the importance of mental imagery skills for explaining individual variability in reading development. Implications for educational practice are that we should find effective ways to instruct children how to “read” pictures and how to develop and use their mental imagery skills. This will probably contribute to their mental models and therefore their story comprehension. PMID:27822194
Real-time implementations of image segmentation algorithms on shared memory multicore architecture: a survey (Conference Presentation)

NASA Astrophysics Data System (ADS)

Akil, Mohamed

2017-05-01

The real-time processing is getting more and more important in many image processing applications. Image segmentation is one of the most fundamental tasks image analysis. As a consequence, many different approaches for image segmentation have been proposed. The watershed transform is a well-known image segmentation tool. The watershed transform is a very data intensive task. To achieve acceleration and obtain real-time processing of watershed algorithms, parallel architectures and programming models for multicore computing have been developed. This paper focuses on the survey of the approaches for parallel implementation of sequential watershed algorithms on multicore general purpose CPUs: homogeneous multicore processor with shared memory. To achieve an efficient parallel implementation, it's necessary to explore different strategies (parallelization/distribution/distributed scheduling) combined with different acceleration and optimization techniques to enhance parallelism. In this paper, we give a comparison of various parallelization of sequential watershed algorithms on shared memory multicore architecture. We analyze the performance measurements of each parallel implementation and the impact of the different sources of overhead on the performance of the parallel implementations. In this comparison study, we also discuss the advantages and disadvantages of the parallel programming models. Thus, we compare the OpenMP (an application programming interface for multi-Processing) with Ptheads (POSIX Threads) to illustrate the impact of each parallel programming model on the performance of the parallel implementations.
MOPITT V4 processing error

Atmospheric Science Data Center

2013-08-06

... error Version 4 The MOPITT/NCAR team discovered a data processing error that affects all V4 products previously ... products are available on News and Status on the MOPITT team web site . Data Product New V4 Product Version(s) ...

Image Processing Using a Parallel Architecture.

DTIC Science & Technology

1987-12-01

ENG/87D-25 Abstract This study developed a set o± low level image processing tools on a parallel computer that allows concurrent processing of images...environment, the set of tools offers a significant reduction in the time required to perform some commonly used image processing operations. vI IMAGE...step toward developing these systems, a structured set of image processing tools was implemented using a parallel computer. More important than
Logarithmic Superdiffusion in Two Dimensional Driven Lattice Gases

NASA Astrophysics Data System (ADS)

Krug, J.; Neiss, R. A.; Schadschneider, A.; Schmidt, J.

2018-03-01

The spreading of density fluctuations in two-dimensional driven diffusive systems is marginally anomalous. Mode coupling theory predicts that the diffusivity in the direction of the drive diverges with time as (ln t)^{2/3} with a prefactor depending on the macroscopic current-density relation and the diffusion tensor of the fluctuating hydrodynamic field equation. Here we present the first numerical verification of this behavior for a particular version of the two-dimensional asymmetric exclusion process. Particles jump strictly asymmetrically along one of the lattice directions and symmetrically along the other, and an anisotropy parameter p governs the ratio between the two rates. Using a novel massively parallel coupling algorithm that strongly reduces the fluctuations in the numerical estimate of the two-point correlation function, we are able to accurately determine the exponent of the logarithmic correction. In addition, the variation of the prefactor with p provides a stringent test of mode coupling theory.
MapReduce implementation of a hybrid spectral library-database search method for large-scale peptide identification.

PubMed

Kalyanaraman, Ananth; Cannon, William R; Latt, Benjamin; Baxter, Douglas J

2011-11-01

A MapReduce-based implementation called MR-MSPolygraph for parallelizing peptide identification from mass spectrometry data is presented. The underlying serial method, MSPolygraph, uses a novel hybrid approach to match an experimental spectrum against a combination of a protein sequence database and a spectral library. Our MapReduce implementation can run on any Hadoop cluster environment. Experimental results demonstrate that, relative to the serial version, MR-MSPolygraph reduces the time to solution from weeks to hours, for processing tens of thousands of experimental spectra. Speedup and other related performance studies are also reported on a 400-core Hadoop cluster using spectral datasets from environmental microbial communities as inputs. The source code along with user documentation are available on http://compbio.eecs.wsu.edu/MR-MSPolygraph. ananth@eecs.wsu.edu; william.cannon@pnnl.gov. Supplementary data are available at Bioinformatics online.
Celiac disease: progress towards diagnosis and definition of pathogenic mechanisms.

PubMed

Rossi, Mauro; Bot, Adrian

2011-08-01

The current issue of the International Reviews of Immunology is dedicated entirely to Celiac Disease (CD). Recent development of additional biomarkers and diagnostics resulted in a sharp revision of the prevalence of this condition, with a previously unrecognized subclinical occurrence in the adult population. This was paralleled by groundbreaking progress in understanding its molecular pathogenesis: while gluten-derived peptides activate the innate immunity, post-translationally modified gluten elicits an adaptive immunity. These arms amplify each other, resulting in a self- perpetuating autoimmune condition, influenced by disturbances of the gut flora and mucus chemistry. The process evolves dramatically in a subset of patients with vulnerable immune homeostasis (eg. Treg cells) explaining the progressive, aggravating syndrome in the clinically overt version of CD. In depth understanding of the pathogenesis of CD thus creates the premises of developing novel, more accurate animal models that should support a rationale development of new prophylactic and therapeutic interventions.
SDAV Viz July Progress Update: LANL

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sewell, Christopher Meyer

2012-07-30

SDAV Viz July Progress Update: (1) VPIC (Vector Particle in Cell) Kinetic Plasma Simulation Code - (a) Implemented first version of an in-situ adapter based on Paraview CoProcessing Library, (b) Three pipelines: vtkDataSetMapper, vtkContourFilter, vtkPistonContour, (c) Next, resolve issue at boundaries of processor domains; add more advanced viz/analysis pipelines; (2) Halo finding/merger trees - (a) Summer student Wathsala W. from University of Utah is working on data-parallel halo finder algorithm using PISTON, (b) Timo Bremer (LLNL), Valerio Pascucci (Utah), George Zagaris (Kitware), and LANL people are interested in using merger trees for tracking the evolution of halos in cosmo simulations;more » discussed possible overlap with work by Salman Habib and Katrin Heitmann (Argonne) during their visit to LANL 7/11; (3) PISTON integration in ParaView - Now available from ParaView github.« less
Indoor air quality analysis based on Hadoop

NASA Astrophysics Data System (ADS)

Tuo, Wang; Yunhua, Sun; Song, Tian; Liang, Yu; Weihong, Cui

2014-03-01

The air of the office environment is our research object. The data of temperature, humidity, concentrations of carbon dioxide, carbon monoxide and ammonia are collected peer one to eight seconds by the sensor monitoring system. And all the data are stored in the Hbase database of Hadoop platform. With the help of HBase feature of column-oriented store and versioned (automatically add the time column), the time-series data sets are bulit based on the primary key Row-key and timestamp. The parallel computing programming model MapReduce is used to process millions of data collected by sensors. By analysing the changing trend of parameters' value at different time of the same day and at the same time of various dates, the impact of human factor and other factors on the room microenvironment is achieved according to the liquidity of the office staff. Moreover, the effective way to improve indoor air quality is proposed in the end of this paper.
Cellular automaton model for molecular traffic jams

NASA Astrophysics Data System (ADS)

Belitsky, V.; Schütz, G. M.

2011-07-01

We consider the time evolution of an exactly solvable cellular automaton with random initial conditions both in the large-scale hydrodynamic limit and on the microscopic level. This model is a version of the totally asymmetric simple exclusion process with sublattice parallel update and thus may serve as a model for studying traffic jams in systems of self-driven particles. We study the emergence of shocks from the microscopic dynamics of the model. In particular, we introduce shock measures whose time evolution we can compute explicitly, both in the thermodynamic limit and for open boundaries where a boundary-induced phase transition driven by the motion of a shock occurs. The motion of the shock, which results from the collective dynamics of the exclusion particles, is a random walk with an internal degree of freedom that determines the jump direction. This type of hopping dynamics is reminiscent of some transport phenomena in biological systems.
A Parallel Genetic Algorithm for Automated Electronic Circuit Design

NASA Technical Reports Server (NTRS)

Long, Jason D.; Colombano, Silvano P.; Haith, Gary L.; Stassinopoulos, Dimitris

2000-01-01

Parallelized versions of genetic algorithms (GAs) are popular primarily for three reasons: the GA is an inherently parallel algorithm, typical GA applications are very compute intensive, and powerful computing platforms, especially Beowulf-style computing clusters, are becoming more affordable and easier to implement. In addition, the low communication bandwidth required allows the use of inexpensive networking hardware such as standard office ethernet. In this paper we describe a parallel GA and its use in automated high-level circuit design. Genetic algorithms are a type of trial-and-error search technique that are guided by principles of Darwinian evolution. Just as the genetic material of two living organisms can intermix to produce offspring that are better adapted to their environment, GAs expose genetic material, frequently strings of 1s and Os, to the forces of artificial evolution: selection, mutation, recombination, etc. GAs start with a pool of randomly-generated candidate solutions which are then tested and scored with respect to their utility. Solutions are then bred by probabilistically selecting high quality parents and recombining their genetic representations to produce offspring solutions. Offspring are typically subjected to a small amount of random mutation. After a pool of offspring is produced, this process iterates until a satisfactory solution is found or an iteration limit is reached. Genetic algorithms have been applied to a wide variety of problems in many fields, including chemistry, biology, and many engineering disciplines. There are many styles of parallelism used in implementing parallel GAs. One such method is called the master-slave or processor farm approach. In this technique, slave nodes are used solely to compute fitness evaluations (the most time consuming part). The master processor collects fitness scores from the nodes and performs the genetic operators (selection, reproduction, variation, etc.). Because of dependency issues in the GA, it is possible to have idle processors. However, as long as the load at each processing node is similar, the processors are kept busy nearly all of the time. In applying GAs to circuit design, a suitable genetic representation 'is that of a circuit-construction program. We discuss one such circuit-construction programming language and show how evolution can generate useful analog circuit designs. This language has the desirable property that virtually all sets of combinations of primitives result in valid circuit graphs. Our system allows circuit size (number of devices), circuit topology, and device values to be evolved. Using a parallel genetic algorithm and circuit simulation software, we present experimental results as applied to three analog filter and two amplifier design tasks. For example, a figure shows an 85 dB amplifier design evolved by our system, and another figure shows the performance of that circuit (gain and frequency response). In all tasks, our system is able to generate circuits that achieve the target specifications.
ParaBTM: A Parallel Processing Framework for Biomedical Text Mining on Supercomputers.

PubMed

Xing, Yuting; Wu, Chengkun; Yang, Xi; Wang, Wei; Zhu, En; Yin, Jianping

2018-04-27

A prevailing way of extracting valuable information from biomedical literature is to apply text mining methods on unstructured texts. However, the massive amount of literature that needs to be analyzed poses a big data challenge to the processing efficiency of text mining. In this paper, we address this challenge by introducing parallel processing on a supercomputer. We developed paraBTM, a runnable framework that enables parallel text mining on the Tianhe-2 supercomputer. It employs a low-cost yet effective load balancing strategy to maximize the efficiency of parallel processing. We evaluated the performance of paraBTM on several datasets, utilizing three types of named entity recognition tasks as demonstration. Results show that, in most cases, the processing efficiency can be greatly improved with parallel processing, and the proposed load balancing strategy is simple and effective. In addition, our framework can be readily applied to other tasks of biomedical text mining besides NER.
Design of a dataway processor for a parallel image signal processing system

NASA Astrophysics Data System (ADS)

Nomura, Mitsuru; Fujii, Tetsuro; Ono, Sadayasu

1995-04-01

Recently, demands for high-speed signal processing have been increasing especially in the field of image data compression, computer graphics, and medical imaging. To achieve sufficient power for real-time image processing, we have been developing parallel signal-processing systems. This paper describes a communication processor called 'dataway processor' designed for a new scalable parallel signal-processing system. The processor has six high-speed communication links (Dataways), a data-packet routing controller, a RISC CORE, and a DMA controller. Each communication link operates at 8-bit parallel in a full duplex mode at 50 MHz. Moreover, data routing, DMA, and CORE operations are processed in parallel. Therefore, sufficient throughput is available for high-speed digital video signals. The processor is designed in a top- down fashion using a CAD system called 'PARTHENON.' The hardware is fabricated using 0.5-micrometers CMOS technology, and its hardware is about 200 K gates.
TIMEDELN: A programme for the detection and parametrization of overlapping resonances using the time-delay method

NASA Astrophysics Data System (ADS)

Little, Duncan A.; Tennyson, Jonathan; Plummer, Martin; Noble, Clifford J.; Sunderland, Andrew G.

2017-06-01

TIMEDELN implements the time-delay method of determining resonance parameters from the characteristic Lorentzian form displayed by the largest eigenvalues of the time-delay matrix. TIMEDELN constructs the time-delay matrix from input K-matrices and analyses its eigenvalues. This new version implements multi-resonance fitting and may be run serially or as a high performance parallel code with three levels of parallelism. TIMEDELN takes K-matrices from a scattering calculation, either read from a file or calculated on a dynamically adjusted grid, and calculates the time-delay matrix. This is then diagonalized, with the largest eigenvalue representing the longest time-delay experienced by the scattering particle. A resonance shows up as a characteristic Lorentzian form in the time-delay: the programme searches the time-delay eigenvalues for maxima and traces resonances when they pass through different eigenvalues, separating overlapping resonances. It also performs the fitting of the calculated data to the Lorentzian form and outputs resonance positions and widths. Any remaining overlapping resonances can be fitted jointly. The branching ratios of decay into the open channels can also be found. The programme may be run serially or in parallel with three levels of parallelism. The parallel code modules are abstracted from the main physics code and can be used independently.
Search asymmetries: parallel processing of uncertain sensory information.

PubMed

Vincent, Benjamin T

2011-08-01

What is the mechanism underlying search phenomena such as search asymmetry? Two-stage models such as Feature Integration Theory and Guided Search propose parallel pre-attentive processing followed by serial post-attentive processing. They claim search asymmetry effects are indicative of finding pairs of features, one processed in parallel, the other in serial. An alternative proposal is that a 1-stage parallel process is responsible, and search asymmetries occur when one stimulus has greater internal uncertainty associated with it than another. While the latter account is simpler, only a few studies have set out to empirically test its quantitative predictions, and many researchers still subscribe to the 2-stage account. This paper examines three separate parallel models (Bayesian optimal observer, max rule, and a heuristic decision rule). All three parallel models can account for search asymmetry effects and I conclude that either people can optimally utilise the uncertain sensory data available to them, or are able to select heuristic decision rules which approximate optimal performance. Copyright © 2011 Elsevier Ltd. All rights reserved.
A Revised Spanish/English Oral Proficiency Test, 1974 Field Test Results. Research and Development Memorandum No. 134.

ERIC Educational Resources Information Center

Ramirez, Arnulfo G.; Politzer, Robert L.

A revised Spanish/English oral-proficiency test battery was administered to 40 Spanish-surnamed pupils equally divided by sex at grade levels 1, 3, 5, and 7. The test battery included parallel Spanish and English versions of: (1) a 12-item vocabulary pretest, (2) a 32-item vocabulary-by-domain test consisting of four sections--home, neighborhood,…
Superquantile/CVaR Risk Measures: Second-Order Theory

DTIC Science & Technology

2014-07-17

order version of quantile regression . Keywords: superquantiles, conditional value-at-risk, second-order superquantiles, mixed superquan- tiles... quantile regression . 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF ABSTRACT Same as Report (SAR) 18. NUMBER OF PAGES 26 19a...second-order superquantiles is in the domain of generalized regression . We laid out in [16] a parallel methodology to that of quantile regression
Exploring Time Perspective in Greek Young Adults: Validation of the Zimbardo Time Perspective Inventory and Relationships with Mental Health Indicators

ERIC Educational Resources Information Center

Anagnostopoulos, Fotios; Griva, Fay

2012-01-01

In this article we examine the factorial structure of the Greek version of the Zimbardo Time Perspective Inventory (ZTPI; Zimbardo and Boyd in "J Personal Soc Psychol" 77:1271-1288, 1999), in a sample of 337 university students, using principal axis factoring (PAF) with oblique rotation, and its dimensionality using parallel analysis.…
Some Problems and Solutions in Transferring Ecosystem Simulation Codes to Supercomputers

NASA Technical Reports Server (NTRS)

Skiles, J. W.; Schulbach, C. H.

1994-01-01

Many computer codes for the simulation of ecological systems have been developed in the last twenty-five years. This development took place initially on main-frame computers, then mini-computers, and more recently, on micro-computers and workstations. Recent recognition of ecosystem science as a High Performance Computing and Communications Program Grand Challenge area emphasizes supercomputers (both parallel and distributed systems) as the next set of tools for ecological simulation. Transferring ecosystem simulation codes to such systems is not a matter of simply compiling and executing existing code on the supercomputer since there are significant differences in the system architectures of sequential, scalar computers and parallel and/or vector supercomputers. To more appropriately match the application to the architecture (necessary to achieve reasonable performance), the parallelism (if it exists) of the original application must be exploited. We discuss our work in transferring a general grassland simulation model (developed on a VAX in the FORTRAN computer programming language) to a Cray Y-MP. We show the Cray shared-memory vector-architecture, and discuss our rationale for selecting the Cray. We describe porting the model to the Cray and executing and verifying a baseline version, and we discuss the changes we made to exploit the parallelism in the application and to improve code execution. As a result, the Cray executed the model 30 times faster than the VAX 11/785 and 10 times faster than a Sun 4 workstation. We achieved an additional speed-up of approximately 30 percent over the original Cray run by using the compiler's vectorizing capabilities and the machine's ability to put subroutines and functions "in-line" in the code. With the modifications, the code still runs at only about 5% of the Cray's peak speed because it makes ineffective use of the vector processing capabilities of the Cray. We conclude with a discussion and future plans.
77 FR 47573 - Approval and Promulgation of Implementation Plans; Mississippi; 110(a)(2)(E)(ii) Infrastructure...

Federal Register 2010, 2011, 2012, 2013, 2014

2012-08-09

... Mississippi Department of Environmental Quality (MDEQ), on July 13, 2012, for parallel processing. This... of Contents I. What is parallel processing? II. Background III. What elements are required under... Executive Order Reviews I. What is parallel processing? Consistent with EPA regulations found at 40 CFR Part...
Double Take: Parallel Processing by the Cerebral Hemispheres Reduces Attentional Blink

ERIC Educational Resources Information Center

Scalf, Paige E.; Banich, Marie T.; Kramer, Arthur F.; Narechania, Kunjan; Simon, Clarissa D.

2007-01-01

Recent data have shown that parallel processing by the cerebral hemispheres can expand the capacity of visual working memory for spatial locations (J. F. Delvenne, 2005) and attentional tracking (G. A. Alvarez & P. Cavanagh, 2005). Evidence that parallel processing by the cerebral hemispheres can improve item identification has remained elusive.…
DEMAID - A DESIGN MANAGER'S AID FOR INTELLIGENT DECOMPOSITION (SUN VERSION)

NASA Technical Reports Server (NTRS)

Rogers, J. L.

1994-01-01

Many engineering systems are large and multi-disciplinary. Before the design of new complex systems such as large space platforms can begin, the possible interactions among subsystems and their parts must be determined. Once this is completed the proposed system can be decomposed to identify its hierarchical structure. DeMAID (A Design Manager's Aid for Intelligent Decomposition) is a knowledge-based system for ordering the sequence of modules and identifying a possible multilevel structure for the design problem. DeMAID displays the modules in an N x N matrix format (called a design structure matrix) where a module is any process that requires input and generates an output. (Modules which generate an output but do not require an input, such as an initialization process, are also acceptable.) Although DeMAID requires an investment of time to generate and refine the list of modules for input, it could save a considerable amount of money and time in the total design process, particularly in new design problems where the ordering of the modules has not been defined. The decomposition of a complex design system into subsystems requires the judgement of the design manager. DeMAID reorders and groups the modules based on the links (interactions) among the modules, helping the design manager make decomposition decisions early in the design cycle. The modules are grouped into circuits (the subsystems) and displayed in an N x N matrix format. Feedback links, which indicate an iterative process, are minimized and only occur within a subsystem. Since there are no feedback links among the circuits, the circuits can be displayed in a multilevel format. Thus, a large amount of information is reduced to one or two displays which are stored for later retrieval and modification. The design manager and leaders of the design teams then have a visual display of the design problem and the intricate interactions among the different modules. The design manager could save a substantial amount of time if circuits on the same level of the multilevel structure are executed in parallel. DeMAID estimates the time savings based on the number of available processors. In addition to decomposing the system into subsystems, DeMAID examines the dependencies of a problem with independent variables and dependant functions. A dependency matrix is created to show the relationship. DeMAID is based on knowledge base techniques to provide flexibility and ease in adding new capabilities. Although DeMAID was originally written for design problems, it has proven to be very general in solving any problem which contains modules (processes) which take an input and generate an output. For example, one group is applying DeMAID to gain understanding of the data flow of a very large computer program. In this example, the modules are the subroutines of the program. The design manager begins the design of a system by determining the level of modules which need to be ordered. The level is the "granularity" of the problem. For example, the design manager may wish to examine disciplines (a coarse model), analysis programs, or the data level (a fine model). Once the system is divided into these modules, each module's input and output is determined, creating a data file for input to the main program. DeMAID is executed through a system of menus. The user has the choice to plan, schedule, display the N x N matrix, display the multilevel organization, or examine the dependency matrix. The main program calls a subroutine which reads a rule file and a data file, asserts facts into the knowledge base, and executes the inference engine of the artificial intelligence/expert systems program, CLIPS (C Language Integrated Production System). To determine the effects of changes in the design process, DeMAID includes a trace effects feature. There are two methods available to trace the effects of a change in the design process. The first method traces forward through the outputs to determine the effects of an output with respect to a change in a particular input. The second method traces backward to determine what modules must be re-executed if the output of a module must be recomputed. DeMAID is available in three machine versions: a Macintosh version which is written in Symantec's Think C 3.01, a Sun version, and an SGI IRIS version, both of which are written in C language. The Macintosh version requires system software 6.0.2 or later and CLIPS 4.3. The source code for the Macintosh version will not compile under version 4.0 of Think C; however, a sample executable is provided on the distribution media. QuickDraw is required for plotting. The Sun version requires GKS 4.1 graphics libraries, OpenWindows 3, and CLIPS 4.3. The SGI IRIS version requires CLIPS 4.3. Since DeMAID is not compatible with CLIPS 5.0 or later, the source code for CLIPS 4.3 is included on the distribution media; however, the documentation for CLIPS 4.3 is not included in the documentation package for DeMAID. It is available from COSMIC separately as the documentation for MSC-21208. The standard distribution medium for the Macintosh version of DeMAID is a set of four 3.5 inch 800K Macintosh format diskettes. The standard distribution medium for the Sun version of DeMAID is a .25 inch streaming magnetic tape cartridge (QIC-24) in UNIX tar format. The standard distribution medium for the IRIS version is a .25 inch IRIX compatible streaming magnetic tape cartridge in UNIX tar format. All versions include sample input. DeMAID was originally developed for use on VAX VMS computers in 1989. The Macintosh version of DeMAID was released in 1991 and updated in 1992. The Sun version of DeMAID was released in 1992 and updated in 1993. The SGI IRIS version was released in 1993.
DEMAID - A DESIGN MANAGER'S AID FOR INTELLIGENT DECOMPOSITION (SGI IRIS VERSION)

NASA Technical Reports Server (NTRS)

Rogers, J. L.

1994-01-01

Many engineering systems are large and multi-disciplinary. Before the design of new complex systems such as large space platforms can begin, the possible interactions among subsystems and their parts must be determined. Once this is completed the proposed system can be decomposed to identify its hierarchical structure. DeMAID (A Design Manager's Aid for Intelligent Decomposition) is a knowledge-based system for ordering the sequence of modules and identifying a possible multilevel structure for the design problem. DeMAID displays the modules in an N x N matrix format (called a design structure matrix) where a module is any process that requires input and generates an output. (Modules which generate an output but do not require an input, such as an initialization process, are also acceptable.) Although DeMAID requires an investment of time to generate and refine the list of modules for input, it could save a considerable amount of money and time in the total design process, particularly in new design problems where the ordering of the modules has not been defined. The decomposition of a complex design system into subsystems requires the judgement of the design manager. DeMAID reorders and groups the modules based on the links (interactions) among the modules, helping the design manager make decomposition decisions early in the design cycle. The modules are grouped into circuits (the subsystems) and displayed in an N x N matrix format. Feedback links, which indicate an iterative process, are minimized and only occur within a subsystem. Since there are no feedback links among the circuits, the circuits can be displayed in a multilevel format. Thus, a large amount of information is reduced to one or two displays which are stored for later retrieval and modification. The design manager and leaders of the design teams then have a visual display of the design problem and the intricate interactions among the different modules. The design manager could save a substantial amount of time if circuits on the same level of the multilevel structure are executed in parallel. DeMAID estimates the time savings based on the number of available processors. In addition to decomposing the system into subsystems, DeMAID examines the dependencies of a problem with independent variables and dependant functions. A dependency matrix is created to show the relationship. DeMAID is based on knowledge base techniques to provide flexibility and ease in adding new capabilities. Although DeMAID was originally written for design problems, it has proven to be very general in solving any problem which contains modules (processes) which take an input and generate an output. For example, one group is applying DeMAID to gain understanding of the data flow of a very large computer program. In this example, the modules are the subroutines of the program. The design manager begins the design of a system by determining the level of modules which need to be ordered. The level is the "granularity" of the problem. For example, the design manager may wish to examine disciplines (a coarse model), analysis programs, or the data level (a fine model). Once the system is divided into these modules, each module's input and output is determined, creating a data file for input to the main program. DeMAID is executed through a system of menus. The user has the choice to plan, schedule, display the N x N matrix, display the multilevel organization, or examine the dependency matrix. The main program calls a subroutine which reads a rule file and a data file, asserts facts into the knowledge base, and executes the inference engine of the artificial intelligence/expert systems program, CLIPS (C Language Integrated Production System). To determine the effects of changes in the design process, DeMAID includes a trace effects feature. There are two methods available to trace the effects of a change in the design process. The first method traces forward through the outputs to determine the effects of an output with respect to a change in a particular input. The second method traces backward to determine what modules must be re-executed if the output of a module must be recomputed. DeMAID is available in three machine versions: a Macintosh version which is written in Symantec's Think C 3.01, a Sun version, and an SGI IRIS version, both of which are written in C language. The Macintosh version requires system software 6.0.2 or later and CLIPS 4.3. The source code for the Macintosh version will not compile under version 4.0 of Think C; however, a sample executable is provided on the distribution media. QuickDraw is required for plotting. The Sun version requires GKS 4.1 graphics libraries, OpenWindows 3, and CLIPS 4.3. The SGI IRIS version requires CLIPS 4.3. Since DeMAID is not compatible with CLIPS 5.0 or later, the source code for CLIPS 4.3 is included on the distribution media; however, the documentation for CLIPS 4.3 is not included in the documentation package for DeMAID. It is available from COSMIC separately as the documentation for MSC-21208. The standard distribution medium for the Macintosh version of DeMAID is a set of four 3.5 inch 800K Macintosh format diskettes. The standard distribution medium for the Sun version of DeMAID is a .25 inch streaming magnetic tape cartridge (QIC-24) in UNIX tar format. The standard distribution medium for the IRIS version is a .25 inch IRIX compatible streaming magnetic tape cartridge in UNIX tar format. All versions include sample input. DeMAID was originally developed for use on VAX VMS computers in 1989. The Macintosh version of DeMAID was released in 1991 and updated in 1992. The Sun version of DeMAID was released in 1992 and updated in 1993. The SGI IRIS version was released in 1993.

Some links on this page may take you to non-federal websites. Their policies may differ from this site.