Sample records for openmp programming model

  1. Support of Multidimensional Parallelism in the OpenMP Programming Model

    NASA Technical Reports Server (NTRS)

    Jin, Hao-Qiang; Jost, Gabriele

    2003-01-01

    OpenMP is the current standard for shared-memory programming. While providing ease of parallel programming, the OpenMP programming model also has limitations which often effect the scalability of applications. Examples for these limitations are work distribution and point-to-point synchronization among threads. We propose extensions to the OpenMP programming model which allow the user to easily distribute the work in multiple dimensions and synchronize the workflow among the threads. The proposed extensions include four new constructs and the associated runtime library. They do not require changes to the source code and can be implemented based on the existing OpenMP standard. We illustrate the concept in a prototype translator and test with benchmark codes and a cloud modeling code.

  2. A Programming Model Performance Study Using the NAS Parallel Benchmarks

    DOE PAGES

    Shan, Hongzhang; Blagojević, Filip; Min, Seung-Jai; ...

    2010-01-01

    Harnessing the power of multicore platforms is challenging due to the additional levels of parallelism present. In this paper we use the NAS Parallel Benchmarks to study three programming models, MPI, OpenMP and PGAS to understand their performance and memory usage characteristics on current multicore architectures. To understand these characteristics we use the Integrated Performance Monitoring tool and other ways to measure communication versus computation time, as well as the fraction of the run time spent in OpenMP. The benchmarks are run on two different Cray XT5 systems and an Infiniband cluster. Our results show that in general the threemore » programming models exhibit very similar performance characteristics. In a few cases, OpenMP is significantly faster because it explicitly avoids communication. For these particular cases, we were able to re-write the UPC versions and achieve equal performance to OpenMP. Using OpenMP was also the most advantageous in terms of memory usage. Also we compare performance differences between the two Cray systems, which have quad-core and hex-core processors. We show that at scale the performance is almost always slower on the hex-core system because of increased contention for network resources.« less

  3. Analysis of Parallel Algorithms on SMP Node and Cluster of Workstations Using Parallel Programming Models with New Tile-based Method for Large Biological Datasets.

    PubMed

    Shrimankar, D D; Sathe, S R

    2016-01-01

    Sequence alignment is an important tool for describing the relationships between DNA sequences. Many sequence alignment algorithms exist, differing in efficiency, in their models of the sequences, and in the relationship between sequences. The focus of this study is to obtain an optimal alignment between two sequences of biological data, particularly DNA sequences. The algorithm is discussed with particular emphasis on time, speedup, and efficiency optimizations. Parallel programming presents a number of critical challenges to application developers. Today's supercomputer often consists of clusters of SMP nodes. Programming paradigms such as OpenMP and MPI are used to write parallel codes for such architectures. However, the OpenMP programs cannot be scaled for more than a single SMP node. However, programs written in MPI can have more than single SMP nodes. But such a programming paradigm has an overhead of internode communication. In this work, we explore the tradeoffs between using OpenMP and MPI. We demonstrate that the communication overhead incurs significantly even in OpenMP loop execution and increases with the number of cores participating. We also demonstrate a communication model to approximate the overhead from communication in OpenMP loops. Our results are astonishing and interesting to a large variety of input data files. We have developed our own load balancing and cache optimization technique for message passing model. Our experimental results show that our own developed techniques give optimum performance of our parallel algorithm for various sizes of input parameter, such as sequence size and tile size, on a wide variety of multicore architectures.

  4. Analysis of Parallel Algorithms on SMP Node and Cluster of Workstations Using Parallel Programming Models with New Tile-based Method for Large Biological Datasets

    PubMed Central

    Shrimankar, D. D.; Sathe, S. R.

    2016-01-01

    Sequence alignment is an important tool for describing the relationships between DNA sequences. Many sequence alignment algorithms exist, differing in efficiency, in their models of the sequences, and in the relationship between sequences. The focus of this study is to obtain an optimal alignment between two sequences of biological data, particularly DNA sequences. The algorithm is discussed with particular emphasis on time, speedup, and efficiency optimizations. Parallel programming presents a number of critical challenges to application developers. Today’s supercomputer often consists of clusters of SMP nodes. Programming paradigms such as OpenMP and MPI are used to write parallel codes for such architectures. However, the OpenMP programs cannot be scaled for more than a single SMP node. However, programs written in MPI can have more than single SMP nodes. But such a programming paradigm has an overhead of internode communication. In this work, we explore the tradeoffs between using OpenMP and MPI. We demonstrate that the communication overhead incurs significantly even in OpenMP loop execution and increases with the number of cores participating. We also demonstrate a communication model to approximate the overhead from communication in OpenMP loops. Our results are astonishing and interesting to a large variety of input data files. We have developed our own load balancing and cache optimization technique for message passing model. Our experimental results show that our own developed techniques give optimum performance of our parallel algorithm for various sizes of input parameter, such as sequence size and tile size, on a wide variety of multicore architectures. PMID:27932868

  5. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Barbara Chapman

    OpenMP was not well recognized at the beginning of the project, around year 2003, because of its limited use in DoE production applications and the inmature hardware support for an efficient implementation. Yet in the recent years, it has been graduately adopted both in HPC applications, mostly in the form of MPI+OpenMP hybrid code, and in mid-scale desktop applications for scientific and experimental studies. We have observed this trend and worked deligiently to improve our OpenMP compiler and runtimes, as well as to work with the OpenMP standard organization to make sure OpenMP are evolved in the direction close tomore » DoE missions. In the Center for Programming Models for Scalable Parallel Computing project, the HPCTools team at the University of Houston (UH), directed by Dr. Barbara Chapman, has been working with project partners, external collaborators and hardware vendors to increase the scalability and applicability of OpenMP for multi-core (and future manycore) platforms and for distributed memory systems by exploring different programming models, language extensions, compiler optimizations, as well as runtime library support.« less

  6. A ROSE-based OpenMP 3.0 Research Compiler Supporting Multiple Runtime Libraries

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Liao, C; Quinlan, D; Panas, T

    2010-01-25

    OpenMP is a popular and evolving programming model for shared-memory platforms. It relies on compilers for optimal performance and to target modern hardware architectures. A variety of extensible and robust research compilers are key to OpenMP's sustainable success in the future. In this paper, we present our efforts to build an OpenMP 3.0 research compiler for C, C++, and Fortran; using the ROSE source-to-source compiler framework. Our goal is to support OpenMP research for ourselves and others. We have extended ROSE's internal representation to handle all of the OpenMP 3.0 constructs and facilitate their manipulation. Since OpenMP research is oftenmore » complicated by the tight coupling of the compiler translations and the runtime system, we present a set of rules to define a common OpenMP runtime library (XOMP) on top of multiple runtime libraries. These rules additionally define how to build a set of translations targeting XOMP. Our work demonstrates how to reuse OpenMP translations across different runtime libraries. This work simplifies OpenMP research by decoupling the problematic dependence between the compiler translations and the runtime libraries. We present an evaluation of our work by demonstrating an analysis tool for OpenMP correctness. We also show how XOMP can be defined using both GOMP and Omni and present comparative performance results against other OpenMP compilers.« less

  7. Using OpenMP vs. Threading Building Blocks for Medical Imaging on Multi-cores

    NASA Astrophysics Data System (ADS)

    Kegel, Philipp; Schellmann, Maraike; Gorlatch, Sergei

    We compare two parallel programming approaches for multi-core systems: the well-known OpenMP and the recently introduced Threading Building Blocks (TBB) library by Intel®. The comparison is made using the parallelization of a real-world numerical algorithm for medical imaging. We develop several parallel implementations, and compare them w.r.t. programming effort, programming style and abstraction, and runtime performance. We show that TBB requires a considerable program re-design, whereas with OpenMP simple compiler directives are sufficient. While TBB appears to be less appropriate for parallelizing existing implementations, it fosters a good programming style and higher abstraction level for newly developed parallel programs. Our experimental measurements on a dual quad-core system demonstrate that OpenMP slightly outperforms TBB in our implementation.

  8. The Automatic Parallelisation of Scientific Application Codes Using a Computer Aided Parallelisation Toolkit

    NASA Technical Reports Server (NTRS)

    Ierotheou, C.; Johnson, S.; Leggett, P.; Cross, M.; Evans, E.; Jin, Hao-Qiang; Frumkin, M.; Yan, J.; Biegel, Bryan (Technical Monitor)

    2001-01-01

    The shared-memory programming model is a very effective way to achieve parallelism on shared memory parallel computers. Historically, the lack of a programming standard for using directives and the rather limited performance due to scalability have affected the take-up of this programming model approach. Significant progress has been made in hardware and software technologies, as a result the performance of parallel programs with compiler directives has also made improvements. The introduction of an industrial standard for shared-memory programming with directives, OpenMP, has also addressed the issue of portability. In this study, we have extended the computer aided parallelization toolkit (developed at the University of Greenwich), to automatically generate OpenMP based parallel programs with nominal user assistance. We outline the way in which loop types are categorized and how efficient OpenMP directives can be defined and placed using the in-depth interprocedural analysis that is carried out by the toolkit. We also discuss the application of the toolkit on the NAS Parallel Benchmarks and a number of real-world application codes. This work not only demonstrates the great potential of using the toolkit to quickly parallelize serial programs but also the good performance achievable on up to 300 processors for hybrid message passing and directive-based parallelizations.

  9. OpenMP 4.5 Validation and Verification Suite

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Pophale, Swaroop S; Bernholdt, David E; Hernandez, Oscar R

    2017-12-15

    OpenMP, a directive-based programming API, introduce directives for accelerator devices that programmers are starting to use more frequently in production codes. To make sure OpenMP directives work correctly across architectures, it is critical to have a mechanism that tests for an implementation's conformance to the OpenMP standard. This testing process can uncover ambiguities in the OpenMP specification, which helps compiler developers and users make a better use of the standard. We fill this gap through our validation and verification test suite that focuses on the offload directives available in OpenMP 4.5.

  10. Characterizing Task-Based OpenMP Programs

    PubMed Central

    Muddukrishna, Ananya; Jonsson, Peter A.; Brorsson, Mats

    2015-01-01

    Programmers struggle to understand performance of task-based OpenMP programs since profiling tools only report thread-based performance. Performance tuning also requires task-based performance in order to balance per-task memory hierarchy utilization against exposed task parallelism. We provide a cost-effective method to extract detailed task-based performance information from OpenMP programs. We demonstrate the utility of our method by quickly diagnosing performance problems and characterizing exposed task parallelism and per-task instruction profiles of benchmarks in the widely-used Barcelona OpenMP Tasks Suite. Programmers can tune performance faster and understand performance tradeoffs more effectively than existing tools by using our method to characterize task-based performance. PMID:25860023

  11. Toward Enhancing OpenMP's Work-Sharing Directives

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Chapman, B M; Huang, L; Jin, H

    2006-05-17

    OpenMP provides a portable programming interface for shared memory parallel computers (SMPs). Although this interface has proven successful for small SMPs, it requires greater flexibility in light of the steadily growing size of individual SMPs and the recent advent of multithreaded chips. In this paper, we describe two application development experiences that exposed these expressivity problems in the current OpenMP specification. We then propose mechanisms to overcome these limitations, including thread subteams and thread topologies. Thus, we identify language features that improve OpenMP application performance on emerging and large-scale platforms while preserving ease of programming.

  12. Improvement and speed optimization of numerical tsunami modelling program using OpenMP technology

    NASA Astrophysics Data System (ADS)

    Chernov, A.; Zaytsev, A.; Yalciner, A.; Kurkin, A.

    2009-04-01

    Currently, the basic problem of tsunami modeling is low speed of calculations which is unacceptable for services of the operative notification. Existing algorithms of numerical modeling of hydrodynamic processes of tsunami waves are developed without taking the opportunities of modern computer facilities. There is an opportunity to have considerable acceleration of process of calculations by using parallel algorithms. We discuss here new approach to parallelization tsunami modeling code using OpenMP Technology (for multiprocessing systems with the general memory). Nowadays, multiprocessing systems are easily accessible for everyone. The cost of the use of such systems becomes much lower comparing to the costs of clusters. This opportunity also benefits all programmers to apply multithreading algorithms on desktop computers of researchers. Other important advantage of the given approach is the mechanism of the general memory - there is no necessity to send data on slow networks (for example Ethernet). All memory is the common for all computing processes; it causes almost linear scalability of the program and processes. In the new version of NAMI DANCE using OpenMP technology and multi-threading algorithm provide 80% gain in speed in comparison with the one-thread version for dual-processor unit. The speed increased and 320% gain was attained for four core processor unit of PCs. Thus, it was possible to reduce considerably time of performance of calculations on the scientific workstations (desktops) without complete change of the program and user interfaces. The further modernization of algorithms of preparation of initial data and processing of results using OpenMP looks reasonable. The final version of NAMI DANCE with the increased computational speed can be used not only for research purposes but also in real time Tsunami Warning Systems.

  13. Testing New Programming Paradigms with NAS Parallel Benchmarks

    NASA Technical Reports Server (NTRS)

    Jin, H.; Frumkin, M.; Schultz, M.; Yan, J.

    2000-01-01

    Over the past decade, high performance computing has evolved rapidly, not only in hardware architectures but also with increasing complexity of real applications. Technologies have been developing to aim at scaling up to thousands of processors on both distributed and shared memory systems. Development of parallel programs on these computers is always a challenging task. Today, writing parallel programs with message passing (e.g. MPI) is the most popular way of achieving scalability and high performance. However, writing message passing programs is difficult and error prone. Recent years new effort has been made in defining new parallel programming paradigms. The best examples are: HPF (based on data parallelism) and OpenMP (based on shared memory parallelism). Both provide simple and clear extensions to sequential programs, thus greatly simplify the tedious tasks encountered in writing message passing programs. HPF is independent of memory hierarchy, however, due to the immaturity of compiler technology its performance is still questionable. Although use of parallel compiler directives is not new, OpenMP offers a portable solution in the shared-memory domain. Another important development involves the tremendous progress in the internet and its associated technology. Although still in its infancy, Java promisses portability in a heterogeneous environment and offers possibility to "compile once and run anywhere." In light of testing these new technologies, we implemented new parallel versions of the NAS Parallel Benchmarks (NPBs) with HPF and OpenMP directives, and extended the work with Java and Java-threads. The purpose of this study is to examine the effectiveness of alternative programming paradigms. NPBs consist of five kernels and three simulated applications that mimic the computation and data movement of large scale computational fluid dynamics (CFD) applications. We started with the serial version included in NPB2.3. Optimization of memory and cache usage was applied to several benchmarks, noticeably BT and SP, resulting in better sequential performance. In order to overcome the lack of an HPF performance model and guide the development of the HPF codes, we employed an empirical performance model for several primitives found in the benchmarks. We encountered a few limitations of HPF, such as lack of supporting the "REDISTRIBUTION" directive and no easy way to handle irregular computation. The parallelization with OpenMP directives was done at the outer-most loop level to achieve the largest granularity. The performance of six HPF and OpenMP benchmarks is compared with their MPI counterparts for the Class-A problem size in the figure in next page. These results were obtained on an SGI Origin2000 (195MHz) with MIPSpro-f77 compiler 7.2.1 for OpenMP and MPI codes and PGI pghpf-2.4.3 compiler with MPI interface for HPF programs.

  14. Experiences using OpenMP based on Computer Directed Software DSM on a PC Cluster

    NASA Technical Reports Server (NTRS)

    Hess, Matthias; Jost, Gabriele; Mueller, Matthias; Ruehle, Roland

    2003-01-01

    In this work we report on our experiences running OpenMP programs on a commodity cluster of PCs running a software distributed shared memory (DSM) system. We describe our test environment and report on the performance of a subset of the NAS Parallel Benchmarks that have been automaticaly parallelized for OpenMP. We compare the performance of the OpenMP implementations with that of their message passing counterparts and discuss performance differences.

  15. Argobots: A Lightweight Low-Level Threading and Tasking Framework

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Seo, Sangmin; Amer, Abdelhalim; Balaji, Pavan

    In the past few decades, a number of user-level threading and tasking models have been proposed in the literature to address the shortcomings of OS-level threads, primarily with respect to cost and flexibility. Current state-of-the-art user-level threading and tasking models, however, are either too specific to applications or architectures or are not as powerful or flexible. In this paper, we present Argobots, a lightweight, low-level threading and tasking framework that is designed as a portable and performant substrate for high-level programming models or runtime systems. Argobots offers a carefully designed execution model that balances generality of functionality with providing amore » rich set of controls to allow specialization by the user or high-level programming model. We describe the design, implementation, and optimization of Argobots and present integrations with three example high-level models: OpenMP, MPI, and co-located I/O service. Evaluations show that (1) Argobots outperforms existing generic threading runtimes; (2) our OpenMP runtime offers more efficient interoperability capabilities than production OpenMP runtimes do; (3) when MPI interoperates with Argobots instead of Pthreads, it enjoys reduced synchronization costs and better latency hiding capabilities; and (4) I/O service with Argobots reduces interference with co-located applications, achieving performance competitive with that of the Pthreads version.« less

  16. PARALLELISATION OF THE MODEL-BASED ITERATIVE RECONSTRUCTION ALGORITHM DIRA.

    PubMed

    Örtenberg, A; Magnusson, M; Sandborg, M; Alm Carlsson, G; Malusek, A

    2016-06-01

    New paradigms for parallel programming have been devised to simplify software development on multi-core processors and many-core graphical processing units (GPU). Despite their obvious benefits, the parallelisation of existing computer programs is not an easy task. In this work, the use of the Open Multiprocessing (OpenMP) and Open Computing Language (OpenCL) frameworks is considered for the parallelisation of the model-based iterative reconstruction algorithm DIRA with the aim to significantly shorten the code's execution time. Selected routines were parallelised using OpenMP and OpenCL libraries; some routines were converted from MATLAB to C and optimised. Parallelisation of the code with the OpenMP was easy and resulted in an overall speedup of 15 on a 16-core computer. Parallelisation with OpenCL was more difficult owing to differences between the central processing unit and GPU architectures. The resulting speedup was substantially lower than the theoretical peak performance of the GPU; the cause was explained. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.

  17. Archer

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Atzeni, Simone; Ahn, Dong; Gopalakrishnan, Ganesh

    2017-01-12

    Archer is built on top of the LLVM/Clang compilers that support OpenMP. It applies static and dynamic analysis techniques to detect data races in OpenMP programs generating a very low runtime and memory overhead. Static analyses identify data race free OpenMP regions and exclude them from runtime analysis, which is performed by ThreadSanitizer included in LLVM/Clang.

  18. What Multilevel Parallel Programs do when you are not Watching: A Performance Analysis Case Study Comparing MPI/OpenMP, MLP, and Nested OpenMP

    NASA Technical Reports Server (NTRS)

    Jost, Gabriele; Labarta, Jesus; Gimenez, Judit

    2004-01-01

    With the current trend in parallel computer architectures towards clusters of shared memory symmetric multi-processors, parallel programming techniques have evolved that support parallelism beyond a single level. When comparing the performance of applications based on different programming paradigms, it is important to differentiate between the influence of the programming model itself and other factors, such as implementation specific behavior of the operating system (OS) or architectural issues. Rewriting-a large scientific application in order to employ a new programming paradigms is usually a time consuming and error prone task. Before embarking on such an endeavor it is important to determine that there is really a gain that would not be possible with the current implementation. A detailed performance analysis is crucial to clarify these issues. The multilevel programming paradigms considered in this study are hybrid MPI/OpenMP, MLP, and nested OpenMP. The hybrid MPI/OpenMP approach is based on using MPI [7] for the coarse grained parallelization and OpenMP [9] for fine grained loop level parallelism. The MPI programming paradigm assumes a private address space for each process. Data is transferred by explicitly exchanging messages via calls to the MPI library. This model was originally designed for distributed memory architectures but is also suitable for shared memory systems. The second paradigm under consideration is MLP which was developed by Taft. The approach is similar to MPi/OpenMP, using a mix of coarse grain process level parallelization and loop level OpenMP parallelization. As it is the case with MPI, a private address space is assumed for each process. The MLP approach was developed for ccNUMA architectures and explicitly takes advantage of the availability of shared memory. A shared memory arena which is accessible by all processes is required. Communication is done by reading from and writing to the shared memory.

  19. Experiences Using OpenMP Based on Compiler Directed Software DSM on a PC Cluster

    NASA Technical Reports Server (NTRS)

    Hess, Matthias; Jost, Gabriele; Mueller, Matthias; Ruehle, Roland; Biegel, Bryan (Technical Monitor)

    2002-01-01

    In this work we report on our experiences running OpenMP (message passing) programs on a commodity cluster of PCs (personal computers) running a software distributed shared memory (DSM) system. We describe our test environment and report on the performance of a subset of the NAS (NASA Advanced Supercomputing) Parallel Benchmarks that have been automatically parallelized for OpenMP. We compare the performance of the OpenMP implementations with that of their message passing counterparts and discuss performance differences.

  20. The Research of the Parallel Computing Development from the Angle of Cloud Computing

    NASA Astrophysics Data System (ADS)

    Peng, Zhensheng; Gong, Qingge; Duan, Yanyu; Wang, Yun

    2017-10-01

    Cloud computing is the development of parallel computing, distributed computing and grid computing. The development of cloud computing makes parallel computing come into people’s lives. Firstly, this paper expounds the concept of cloud computing and introduces two several traditional parallel programming model. Secondly, it analyzes and studies the principles, advantages and disadvantages of OpenMP, MPI and Map Reduce respectively. Finally, it takes MPI, OpenMP models compared to Map Reduce from the angle of cloud computing. The results of this paper are intended to provide a reference for the development of parallel computing.

  1. Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

    NASA Technical Reports Server (NTRS)

    Jost, Gabriele; Jin, Hao-Qiang; anMey, Dieter; Hatay, Ferhat F.

    2003-01-01

    Clusters of SMP (Symmetric Multi-Processors) nodes provide support for a wide range of parallel programming paradigms. The shared address space within each node is suitable for OpenMP parallelization. Message passing can be employed within and across the nodes of a cluster. Multiple levels of parallelism can be achieved by combining message passing and OpenMP parallelization. Which programming paradigm is the best will depend on the nature of the given problem, the hardware components of the cluster, the network, and the available software. In this study we compare the performance of different implementations of the same CFD benchmark application, using the same numerical algorithm but employing different programming paradigms.

  2. Argobots: A Lightweight Low-Level Threading and Tasking Framework

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Seo, Sangmin; Amer, Abdelhalim; Balaji, Pavan

    In the past few decades, a number of user-level threading and tasking models have been proposed in the literature to address the shortcomings of OS-level threads, primarily with respect to cost and flexibility. Current state-of-the-art user-level threading and tasking models, however, either are too specific to applications or architectures or are not as powerful or flexible. In this paper, we present Argobots, a lightweight, low-level threading and tasking framework that is designed as a portable and performant substrate for high-level programming models or runtime systems. Argobots offers a carefully designed execution model that balances generality of functionality with providing amore » rich set of controls to allow specialization by end users or high-level programming models. We describe the design, implementation, and performance characterization of Argobots and present integrations with three high-level models: OpenMP, MPI, and colocated I/O services. Evaluations show that (1) Argobots, while providing richer capabilities, is competitive with existing simpler generic threading runtimes; (2) our OpenMP runtime offers more efficient interoperability capabilities than production OpenMP runtimes do; (3) when MPI interoperates with Argobots instead of Pthreads, it enjoys reduced synchronization costs and better latency-hiding capabilities; and (4) I/O services with Argobots reduce interference with colocated applications while achieving performance competitive with that of a Pthreads approach.« less

  3. Argobots: A Lightweight Low-Level Threading and Tasking Framework

    DOE PAGES

    Seo, Sangmin; Amer, Abdelhalim; Balaji, Pavan; ...

    2017-10-24

    In the past few decades, a number of user-level threading and tasking models have been proposed in the literature to address the shortcomings of OS-level threads, primarily with respect to cost and flexibility. Current state-of-the-art user-level threading and tasking models, however, are either too specific to applications or architectures or are not as powerful or flexible. In this article, we present Argobots, a lightweight, low-level threading and tasking framework that is designed as a portable and performant substrate for high-level programming models or runtime systems. Argobots offers a carefully designed execution model that balances generality of functionality with providing amore » rich set of controls to allow specialization by the user or high-level programming model. Here, we describe the design, implementation, and optimization of Argobots and present integrations with three example high-level models: OpenMP, MPI, and co-located I/O service. Evaluations show that (1) Argobots outperforms existing generic threading runtimes; (2) our OpenMP runtime offers more efficient interoperability capabilities than production OpenMP runtimes do; (3) when MPI interoperates with Argobots instead of Pthreads, it enjoys reduced synchronization costs and better latency hiding capabilities; and (4) I/O service with Argobots reduces interference with co-located applications, achieving performance competitive with that of the Pthreads version.« less

  4. Argobots: A Lightweight Low-Level Threading and Tasking Framework

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Seo, Sangmin; Amer, Abdelhalim; Balaji, Pavan

    In the past few decades, a number of user-level threading and tasking models have been proposed in the literature to address the shortcomings of OS-level threads, primarily with respect to cost and flexibility. Current state-of-the-art user-level threading and tasking models, however, are either too specific to applications or architectures or are not as powerful or flexible. In this article, we present Argobots, a lightweight, low-level threading and tasking framework that is designed as a portable and performant substrate for high-level programming models or runtime systems. Argobots offers a carefully designed execution model that balances generality of functionality with providing amore » rich set of controls to allow specialization by the user or high-level programming model. Here, we describe the design, implementation, and optimization of Argobots and present integrations with three example high-level models: OpenMP, MPI, and co-located I/O service. Evaluations show that (1) Argobots outperforms existing generic threading runtimes; (2) our OpenMP runtime offers more efficient interoperability capabilities than production OpenMP runtimes do; (3) when MPI interoperates with Argobots instead of Pthreads, it enjoys reduced synchronization costs and better latency hiding capabilities; and (4) I/O service with Argobots reduces interference with co-located applications, achieving performance competitive with that of the Pthreads version.« less

  5. Automatic Generation of OpenMP Directives and Its Application to Computational Fluid Dynamics Codes

    NASA Technical Reports Server (NTRS)

    Yan, Jerry; Jin, Haoqiang; Frumkin, Michael; Yan, Jerry (Technical Monitor)

    2000-01-01

    The shared-memory programming model is a very effective way to achieve parallelism on shared memory parallel computers. As great progress was made in hardware and software technologies, performance of parallel programs with compiler directives has demonstrated large improvement. The introduction of OpenMP directives, the industrial standard for shared-memory programming, has minimized the issue of portability. In this study, we have extended CAPTools, a computer-aided parallelization toolkit, to automatically generate OpenMP-based parallel programs with nominal user assistance. We outline techniques used in the implementation of the tool and discuss the application of this tool on the NAS Parallel Benchmarks and several computational fluid dynamics codes. This work demonstrates the great potential of using the tool to quickly port parallel programs and also achieve good performance that exceeds some of the commercial tools.

  6. Automatic Generation of Directive-Based Parallel Programs for Shared Memory Parallel Systems

    NASA Technical Reports Server (NTRS)

    Jin, Hao-Qiang; Yan, Jerry; Frumkin, Michael

    2000-01-01

    The shared-memory programming model is a very effective way to achieve parallelism on shared memory parallel computers. As great progress was made in hardware and software technologies, performance of parallel programs with compiler directives has demonstrated large improvement. The introduction of OpenMP directives, the industrial standard for shared-memory programming, has minimized the issue of portability. Due to its ease of programming and its good performance, the technique has become very popular. In this study, we have extended CAPTools, a computer-aided parallelization toolkit, to automatically generate directive-based, OpenMP, parallel programs. We outline techniques used in the implementation of the tool and present test results on the NAS parallel benchmarks and ARC3D, a CFD application. This work demonstrates the great potential of using computer-aided tools to quickly port parallel programs and also achieve good performance.

  7. Early Experiences Writing Performance Portable OpenMP 4 Codes

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Joubert, Wayne; Hernandez, Oscar R

    In this paper, we evaluate the recently available directives in OpenMP 4 to parallelize a computational kernel using both the traditional shared memory approach and the newer accelerator targeting capabilities. In addition, we explore various transformations that attempt to increase application performance portability, and examine the expressiveness and performance implications of using these approaches. For example, we want to understand if the target map directives in OpenMP 4 improve data locality when mapped to a shared memory system, as opposed to the traditional first touch policy approach in traditional OpenMP. To that end, we use recent Cray and Intel compilersmore » to measure the performance variations of a simple application kernel when executed on the OLCF s Titan supercomputer with NVIDIA GPUs and the Beacon system with Intel Xeon Phi accelerators attached. To better understand these trade-offs, we compare our results from traditional OpenMP shared memory implementations to the newer accelerator programming model when it is used to target both the CPU and an attached heterogeneous device. We believe the results and lessons learned as presented in this paper will be useful to the larger user community by providing guidelines that can assist programmers in the development of performance portable code.« less

  8. OpenMP performance for benchmark 2D shallow water equations using LBM

    NASA Astrophysics Data System (ADS)

    Sabri, Khairul; Rabbani, Hasbi; Gunawan, Putu Harry

    2018-03-01

    Shallow water equations or commonly referred as Saint-Venant equations are used to model fluid phenomena. These equations can be solved numerically using several methods, like Lattice Boltzmann method (LBM), SIMPLE-like Method, Finite Difference Method, Godunov-type Method, and Finite Volume Method. In this paper, the shallow water equation will be approximated using LBM or known as LABSWE and will be simulated in performance of parallel programming using OpenMP. To evaluate the performance between 2 and 4 threads parallel algorithm, ten various number of grids Lx and Ly are elaborated. The results show that using OpenMP platform, the computational time for solving LABSWE can be decreased. For instance using grid sizes 1000 × 500, the speedup of 2 and 4 threads is observed 93.54 s and 333.243 s respectively.

  9. MPI, HPF or OpenMP: A Study with the NAS Benchmarks

    NASA Technical Reports Server (NTRS)

    Jin, Hao-Qiang; Frumkin, Michael; Hribar, Michelle; Waheed, Abdul; Yan, Jerry; Saini, Subhash (Technical Monitor)

    1999-01-01

    Porting applications to new high performance parallel and distributed platforms is a challenging task. Writing parallel code by hand is time consuming and costly, but the task can be simplified by high level languages and would even better be automated by parallelizing tools and compilers. The definition of HPF (High Performance Fortran, based on data parallel model) and OpenMP (based on shared memory parallel model) standards has offered great opportunity in this respect. Both provide simple and clear interfaces to language like FORTRAN and simplify many tedious tasks encountered in writing message passing programs. In our study we implemented the parallel versions of the NAS Benchmarks with HPF and OpenMP directives. Comparison of their performance with the MPI implementation and pros and cons of different approaches will be discussed along with experience of using computer-aided tools to help parallelize these benchmarks. Based on the study,potentials of applying some of the techniques to realistic aerospace applications will be presented

  10. MPI, HPF or OpenMP: A Study with the NAS Benchmarks

    NASA Technical Reports Server (NTRS)

    Jin, H.; Frumkin, M.; Hribar, M.; Waheed, A.; Yan, J.; Saini, Subhash (Technical Monitor)

    1999-01-01

    Porting applications to new high performance parallel and distributed platforms is a challenging task. Writing parallel code by hand is time consuming and costly, but this task can be simplified by high level languages and would even better be automated by parallelizing tools and compilers. The definition of HPF (High Performance Fortran, based on data parallel model) and OpenMP (based on shared memory parallel model) standards has offered great opportunity in this respect. Both provide simple and clear interfaces to language like FORTRAN and simplify many tedious tasks encountered in writing message passing programs. In our study, we implemented the parallel versions of the NAS Benchmarks with HPF and OpenMP directives. Comparison of their performance with the MPI implementation and pros and cons of different approaches will be discussed along with experience of using computer-aided tools to help parallelize these benchmarks. Based on the study, potentials of applying some of the techniques to realistic aerospace applications will be presented.

  11. Parallelization of interpolation, solar radiation and water flow simulation modules in GRASS GIS using OpenMP

    NASA Astrophysics Data System (ADS)

    Hofierka, Jaroslav; Lacko, Michal; Zubal, Stanislav

    2017-10-01

    In this paper, we describe the parallelization of three complex and computationally intensive modules of GRASS GIS using the OpenMP application programming interface for multi-core computers. These include the v.surf.rst module for spatial interpolation, the r.sun module for solar radiation modeling and the r.sim.water module for water flow simulation. We briefly describe the functionality of the modules and parallelization approaches used in the modules. Our approach includes the analysis of the module's functionality, identification of source code segments suitable for parallelization and proper application of OpenMP parallelization code to create efficient threads processing the subtasks. We document the efficiency of the solutions using the airborne laser scanning data representing land surface in the test area and derived high-resolution digital terrain model grids. We discuss the performance speed-up and parallelization efficiency depending on the number of processor threads. The study showed a substantial increase in computation speeds on a standard multi-core computer while maintaining the accuracy of results in comparison to the output from original modules. The presented parallelization approach showed the simplicity and efficiency of the parallelization of open-source GRASS GIS modules using OpenMP, leading to an increased performance of this geospatial software on standard multi-core computers.

  12. Performance of hybrid programming models for multiscale cardiac simulations: preparing for petascale computation.

    PubMed

    Pope, Bernard J; Fitch, Blake G; Pitman, Michael C; Rice, John J; Reumann, Matthias

    2011-10-01

    Future multiscale and multiphysics models that support research into human disease, translational medical science, and treatment can utilize the power of high-performance computing (HPC) systems. We anticipate that computationally efficient multiscale models will require the use of sophisticated hybrid programming models, mixing distributed message-passing processes [e.g., the message-passing interface (MPI)] with multithreading (e.g., OpenMP, Pthreads). The objective of this study is to compare the performance of such hybrid programming models when applied to the simulation of a realistic physiological multiscale model of the heart. Our results show that the hybrid models perform favorably when compared to an implementation using only the MPI and, furthermore, that OpenMP in combination with the MPI provides a satisfactory compromise between performance and code complexity. Having the ability to use threads within MPI processes enables the sophisticated use of all processor cores for both computation and communication phases. Considering that HPC systems in 2012 will have two orders of magnitude more cores than what was used in this study, we believe that faster than real-time multiscale cardiac simulations can be achieved on these systems.

  13. The OpenMP Implementation of NAS Parallel Benchmarks and its Performance

    NASA Technical Reports Server (NTRS)

    Jin, Hao-Qiang; Frumkin, Michael; Yan, Jerry

    1999-01-01

    As the new ccNUMA architecture became popular in recent years, parallel programming with compiler directives on these machines has evolved to accommodate new needs. In this study, we examine the effectiveness of OpenMP directives for parallelizing the NAS Parallel Benchmarks. Implementation details will be discussed and performance will be compared with the MPI implementation. We have demonstrated that OpenMP can achieve very good results for parallelization on a shared memory system, but effective use of memory and cache is very important.

  14. Effective Vectorization with OpenMP 4.5

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Huber, Joseph N.; Hernandez, Oscar R.; Lopez, Matthew Graham

    This paper describes how the Single Instruction Multiple Data (SIMD) model and its extensions in OpenMP work, and how these are implemented in different compilers. Modern processors are highly parallel computational machines which often include multiple processors capable of executing several instructions in parallel. Understanding SIMD and executing instructions in parallel allows the processor to achieve higher performance without increasing the power required to run it. SIMD instructions can significantly reduce the runtime of code by executing a single operation on large groups of data. The SIMD model is so integral to the processor s potential performance that, if SIMDmore » is not utilized, less than half of the processor is ever actually used. Unfortunately, using SIMD instructions is a challenge in higher level languages because most programming languages do not have a way to describe them. Most compilers are capable of vectorizing code by using the SIMD instructions, but there are many code features important for SIMD vectorization that the compiler cannot determine at compile time. OpenMP attempts to solve this by extending the C++/C and Fortran programming languages with compiler directives that express SIMD parallelism. OpenMP is used to pass hints to the compiler about the code to be executed in SIMD. This is a key resource for making optimized code, but it does not change whether or not the code can use SIMD operations. However, in many cases critical functions are limited by a poor understanding of how SIMD instructions are actually implemented, as SIMD can be implemented through vector instructions or simultaneous multi-threading (SMT). We have found that it is often the case that code cannot be vectorized, or is vectorized poorly, because the programmer does not have sufficient knowledge of how SIMD instructions work.« less

  15. Lattice QCD simulations using the OpenACC platform

    NASA Astrophysics Data System (ADS)

    Majumdar, Pushan

    2016-10-01

    In this article we will explore the OpenACC platform for programming Graphics Processing Units (GPUs). The OpenACC platform offers a directive based programming model for GPUs which avoids the detailed data flow control and memory management necessary in a CUDA programming environment. In the OpenACC model, programs can be written in high level languages with OpenMP like directives. We present some examples of QCD simulation codes using OpenACC and discuss their performance on the Fermi and Kepler GPUs.

  16. Characterizing and Mitigating Work Time Inflation in Task Parallel Programs

    DOE PAGES

    Olivier, Stephen L.; de Supinski, Bronis R.; Schulz, Martin; ...

    2013-01-01

    Task parallelism raises the level of abstraction in shared memory parallel programming to simplify the development of complex applications. However, task parallel applications can exhibit poor performance due to thread idleness, scheduling overheads, and work time inflation – additional time spent by threads in a multithreaded computation beyond the time required to perform the same work in a sequential computation. We identify the contributions of each factor to lost efficiency in various task parallel OpenMP applications and diagnose the causes of work time inflation in those applications. Increased data access latency can cause significant work time inflation in NUMA systems.more » Our locality framework for task parallel OpenMP programs mitigates this cause of work time inflation. Our extensions to the Qthreads library demonstrate that locality-aware scheduling can improve performance up to 3X compared to the Intel OpenMP task scheduler.« less

  17. Optics Program Modified for Multithreaded Parallel Computing

    NASA Technical Reports Server (NTRS)

    Lou, John; Bedding, Dave; Basinger, Scott

    2006-01-01

    A powerful high-performance computer program for simulating and analyzing adaptive and controlled optical systems has been developed by modifying the serial version of the Modeling and Analysis for Controlled Optical Systems (MACOS) program to impart capabilities for multithreaded parallel processing on computing systems ranging from supercomputers down to Symmetric Multiprocessing (SMP) personal computers. The modifications included the incorporation of OpenMP, a portable and widely supported application interface software, that can be used to explicitly add multithreaded parallelism to an application program under a shared-memory programming model. OpenMP was applied to parallelize ray-tracing calculations, one of the major computing components in MACOS. Multithreading is also used in the diffraction propagation of light in MACOS based on pthreads [POSIX Thread, (where "POSIX" signifies a portable operating system for UNIX)]. In tests of the parallelized version of MACOS, the speedup in ray-tracing calculations was found to be linear, or proportional to the number of processors, while the speedup in diffraction calculations ranged from 50 to 60 percent, depending on the type and number of processors. The parallelized version of MACOS is portable, and, to the user, its interface is basically the same as that of the original serial version of MACOS.

  18. Benchmarking and Evaluating Unified Memory for OpenMP GPU Offloading

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Mishra, Alok; Li, Lingda; Kong, Martin

    Here, the latest OpenMP standard offers automatic device offloading capabilities which facilitate GPU programming. Despite this, there remain many challenges. One of these is the unified memory feature introduced in recent GPUs. GPUs in current and future HPC systems have enhanced support for unified memory space. In such systems, CPU and GPU can access each other's memory transparently, that is, the data movement is managed automatically by the underlying system software and hardware. Memory over subscription is also possible in these systems. However, there is a significant lack of knowledge about how this mechanism will perform, and how programmers shouldmore » use it. We have modified several benchmarks codes, in the Rodinia benchmark suite, to study the behavior of OpenMP accelerator extensions and have used them to explore the impact of unified memory in an OpenMP context. We moreover modified the open source LLVM compiler to allow OpenMP programs to exploit unified memory. The results of our evaluation reveal that, while the performance of unified memory is comparable with that of normal GPU offloading for benchmarks with little data reuse, it suffers from significant overhead when GPU memory is over subcribed for benchmarks with large amount of data reuse. Based on these results, we provide several guidelines for programmers to achieve better performance with unified memory.« less

  19. Innovative Language-Based & Object-Oriented Structured AMR Using Fortran 90 and OpenMP

    NASA Technical Reports Server (NTRS)

    Norton, C.; Balsara, D.

    1999-01-01

    Parallel adaptive mesh refinement (AMR) is an important numerical technique that leads to the efficient solution of many physical and engineering problems. In this paper, we describe how AMR programing can be performed in an object-oreinted way using the modern aspects of Fortran 90 combined with the parallelization features of OpenMP.

  20. Parameters analysis of a porous medium model for treatment with hyperthermia using OpenMP

    NASA Astrophysics Data System (ADS)

    Freitas Reis, Ruy; dos Santos Loureiro, Felipe; Lobosco, Marcelo

    2015-09-01

    Cancer is the second cause of death in the world so treatments have been developed trying to work around this world health problem. Hyperthermia is not a new technique, but its use in cancer treatment is still at early stage of development. This treatment is based on overheat the target area to a threshold temperature that causes cancerous cell necrosis and apoptosis. To simulate this phenomenon using magnetic nanoparticles in an under skin cancer treatment, a three-dimensional porous medium model was adopted. This study presents a sensibility analysis of the model parameters such as the porosity and blood velocity. To ensure a second-order solution approach, a 7-points centered finite difference method was used for space discretization while a predictor-corrector method was used to time evolution. Due to the massive computations required to find the solution of a three-dimensional model, this paper also presents a first attempt to improve performance using OpenMP, a parallel programming API.

  1. A feasibility study on porting the community land model onto accelerators using OpenACC

    DOE PAGES

    Wang, Dali; Wu, Wei; Winkler, Frank; ...

    2014-01-01

    As environmental models (such as Accelerated Climate Model for Energy (ACME), Parallel Reactive Flow and Transport Model (PFLOTRAN), Arctic Terrestrial Simulator (ATS), etc.) became more and more complicated, we are facing enormous challenges regarding to porting those applications onto hybrid computing architecture. OpenACC appears as a very promising technology, therefore, we have conducted a feasibility analysis on porting the Community Land Model (CLM), a terrestrial ecosystem model within the Community Earth System Models (CESM)). Specifically, we used automatic function testing platform to extract a small computing kernel out of CLM, then we apply this kernel into the actually CLM dataflowmore » procedure, and investigate the strategy of data parallelization and the benefit of data movement provided by current implementation of OpenACC. Even it is a non-intensive kernel, on a single 16-core computing node, the performance (based on the actual computation time using one GPU) of OpenACC implementation is 2.3 time faster than that of OpenMP implementation using single OpenMP thread, but it is 2.8 times slower than the performance of OpenMP implementation using 16 threads. On multiple nodes, MPI_OpenACC implementation demonstrated very good scalability on up to 128 GPUs on 128 computing nodes. This study also provides useful information for us to look into the potential benefits of “deep copy” capability and “routine” feature of OpenACC standards. In conclusion, we believe that our experience on the environmental model, CLM, can be beneficial to many other scientific research programs who are interested to porting their large scale scientific code using OpenACC onto high-end computers, empowered by hybrid computing architecture.« less

  2. Numerical modeling of exciton-polariton Bose-Einstein condensate in a microcavity

    NASA Astrophysics Data System (ADS)

    Voronych, Oksana; Buraczewski, Adam; Matuszewski, Michał; Stobińska, Magdalena

    2017-06-01

    A novel, optimized numerical method of modeling of an exciton-polariton superfluid in a semiconductor microcavity was proposed. Exciton-polaritons are spin-carrying quasiparticles formed from photons strongly coupled to excitons. They possess unique properties, interesting from the point of view of fundamental research as well as numerous potential applications. However, their numerical modeling is challenging due to the structure of nonlinear differential equations describing their evolution. In this paper, we propose to solve the equations with a modified Runge-Kutta method of 4th order, further optimized for efficient computations. The algorithms were implemented in form of C++ programs fitted for parallel environments and utilizing vector instructions. The programs form the EPCGP suite which has been used for theoretical investigation of exciton-polaritons. Catalogue identifier: AFBQ_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AFBQ_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: BSD-3 No. of lines in distributed program, including test data, etc.: 2157 No. of bytes in distributed program, including test data, etc.: 498994 Distribution format: tar.gz Programming language: C++ with OpenMP extensions (main numerical program), Python (helper scripts). Computer: Modern PC (tested on AMD and Intel processors), HP BL2x220. Operating system: Unix/Linux and Windows. Has the code been vectorized or parallelized?: Yes (OpenMP) RAM: 200 MB for single run Classification: 7, 7.7. Nature of problem: An exciton-polariton superfluid is a novel, interesting physical system allowing investigation of high temperature Bose-Einstein condensation of exciton-polaritons-quasiparticles carrying spin. They have brought a lot of attention due to their unique properties and potential applications in polariton-based optoelectronic integrated circuits. This is an out-of-equilibrium quantum system confined within a semiconductor microcavity. It is described by a set of nonlinear differential equations similar in spirit to the Gross-Pitaevskii (GP) equation, but their unique properties do not allow standard GP solving frameworks to be utilized. Finding an accurate and efficient numerical algorithm as well as development of optimized numerical software is necessary for effective theoretical investigation of exciton-polaritons. Solution method: A Runge-Kutta method of 4th order was employed to solve the set of differential equations describing exciton-polariton superfluids. The method was fitted for the exciton-polariton equations and further optimized. The C++ programs utilize OpenMP extensions and vector operations in order to fully utilize the computer hardware. Running time: 6h for 100 ps evolution, depending on the values of parameters

  3. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Hornung, Richard D.; Hones, Holger E.

    The RAJA Performance Suite is designed to evaluate performance of the RAJA performance portability library on a wide variety of important high performance computing (HPC) algorithmic lulmels. These kernels assess compiler optimizations and various parallel programming model backends accessible through RAJA, such as OpenMP, CUDA, etc. The Initial version of the suite contains 25 computational kernels, each of which appears in 6 variants: Baseline SequcntiaJ, RAJA SequentiaJ, Baseline OpenMP, RAJA OpenMP, Baseline CUDA, RAJA CUDA. All variants of each kernel perform essentially the same mathematical operations and the loop body code for each kernel is identical across all variants. Theremore » are a few kernels, such as those that contain reduction operations, that require CUDA-specific coding for their CUDA variants. ActuaJ computer instructions executed and how they run in parallel differs depending on the parallel programming model backend used and which optimizations are perfonned by the compiler used to build the Perfonnance Suite executable. The Suite will be used primarily by RAJA developers to perform regular assessments of RAJA performance across a range of hardware platforms and compilers as RAJA features are being developed. It will also be used by LLNL hardware and software vendor panners for new defining requirements for future computing platform procurements and acceptance testing. In particular, the RAJA Performance Suite will be used for compiler acceptance testing of the upcoming CORAUSierra machine {initial LLNL delivery expected in late-2017/early 2018) and the CORAL-2 procurement. The Suite will aJso be used to generate concise source code reproducers of compiler and runtime issues we uncover so that we may provide them to relevant vendors to be fixed.« less

  4. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Shipman, Galen M.

    These are the slides for a presentation on programming models in HPC, at the Los Alamos National Laboratory's Parallel Computing Summer School. The following topics are covered: Flynn's Taxonomy of computer architectures; single instruction single data; single instruction multiple data; multiple instruction multiple data; address space organization; definition of Trinity (Intel Xeon-Phi is a MIMD architecture); single program multiple data; multiple program multiple data; ExMatEx workflow overview; definition of a programming model, programming languages, runtime systems; programming model and environments; MPI (Message Passing Interface); OpenMP; Kokkos (Performance Portable Thread-Parallel Programming Model); Kokkos abstractions, patterns, policies, and spaces; RAJA, a systematicmore » approach to node-level portability and tuning; overview of the Legion Programming Model; mapping tasks and data to hardware resources; interoperability: supporting task-level models; Legion S3D execution and performance details; workflow, integration of external resources into the programming model.« less

  5. Hierarchical resilience with lightweight threads.

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Wheeler, Kyle Bruce

    2011-10-01

    This paper proposes methodology for providing robustness and resilience for a highly threaded distributed- and shared-memory environment based on well-defined inputs and outputs to lightweight tasks. These inputs and outputs form a failure 'barrier', allowing tasks to be restarted or duplicated as necessary. These barriers must be expanded based on task behavior, such as communication between tasks, but do not prohibit any given behavior. One of the trends in high-performance computing codes seems to be a trend toward self-contained functions that mimic functional programming. Software designers are trending toward a model of software design where their core functions are specifiedmore » in side-effect free or low-side-effect ways, wherein the inputs and outputs of the functions are well-defined. This provides the ability to copy the inputs to wherever they need to be - whether that's the other side of the PCI bus or the other side of the network - do work on that input using local memory, and then copy the outputs back (as needed). This design pattern is popular among new distributed threading environment designs. Such designs include the Barcelona STARS system, distributed OpenMP systems, the Habanero-C and Habanero-Java systems from Vivek Sarkar at Rice University, the HPX/ParalleX model from LSU, as well as our own Scalable Parallel Runtime effort (SPR) and the Trilinos stateless kernels. This design pattern is also shared by CUDA and several OpenMP extensions for GPU-type accelerators (e.g. the PGI OpenMP extensions).« less

  6. A portable approach for PIC on emerging architectures

    NASA Astrophysics Data System (ADS)

    Decyk, Viktor

    2016-03-01

    A portable approach for designing Particle-in-Cell (PIC) algorithms on emerging exascale computers, is based on the recognition that 3 distinct programming paradigms are needed. They are: low level vector (SIMD) processing, middle level shared memory parallel programing, and high level distributed memory programming. In addition, there is a memory hierarchy associated with each level. Such algorithms can be initially developed using vectorizing compilers, OpenMP, and MPI. This is the approach recommended by Intel for the Phi processor. These algorithms can then be translated and possibly specialized to other programming models and languages, as needed. For example, the vector processing and shared memory programming might be done with CUDA instead of vectorizing compilers and OpenMP, but generally the algorithm itself is not greatly changed. The UCLA PICKSC web site at http://www.idre.ucla.edu/ contains example open source skeleton codes (mini-apps) illustrating each of these three programming models, individually and in combination. Fortran2003 now supports abstract data types, and design patterns can be used to support a variety of implementations within the same code base. Fortran2003 also supports interoperability with C so that implementations in C languages are also easy to use. Finally, main codes can be translated into dynamic environments such as Python, while still taking advantage of high performing compiled languages. Parallel languages are still evolving with interesting developments in co-Array Fortran, UPC, and OpenACC, among others, and these can also be supported within the same software architecture. Work supported by NSF and DOE Grants.

  7. Nonlinear Wave Simulation on the Xeon Phi Knights Landing Processor

    NASA Astrophysics Data System (ADS)

    Hristov, Ivan; Goranov, Goran; Hristova, Radoslava

    2018-02-01

    We consider an interesting from computational point of view standing wave simulation by solving coupled 2D perturbed Sine-Gordon equations. We make an OpenMP realization which explores both thread and SIMD levels of parallelism. We test the OpenMP program on two different energy equivalent Intel architectures: 2× Xeon E5-2695 v2 processors, (code-named "Ivy Bridge-EP") in the Hybrilit cluster, and Xeon Phi 7250 processor (code-named "Knights Landing" (KNL). The results show 2 times better performance on KNL processor.

  8. Automatic Multilevel Parallelization Using OpenMP

    NASA Technical Reports Server (NTRS)

    Jin, Hao-Qiang; Jost, Gabriele; Yan, Jerry; Ayguade, Eduard; Gonzalez, Marc; Martorell, Xavier; Biegel, Bryan (Technical Monitor)

    2002-01-01

    In this paper we describe the extension of the CAPO (CAPtools (Computer Aided Parallelization Toolkit) OpenMP) parallelization support tool to support multilevel parallelism based on OpenMP directives. CAPO generates OpenMP directives with extensions supported by the NanosCompiler to allow for directive nesting and definition of thread groups. We report some results for several benchmark codes and one full application that have been parallelized using our system.

  9. Multilevel Parallelization of AutoDock 4.2.

    PubMed

    Norgan, Andrew P; Coffman, Paul K; Kocher, Jean-Pierre A; Katzmann, David J; Sosa, Carlos P

    2011-04-28

    Virtual (computational) screening is an increasingly important tool for drug discovery. AutoDock is a popular open-source application for performing molecular docking, the prediction of ligand-receptor interactions. AutoDock is a serial application, though several previous efforts have parallelized various aspects of the program. In this paper, we report on a multi-level parallelization of AutoDock 4.2 (mpAD4). Using MPI and OpenMP, AutoDock 4.2 was parallelized for use on MPI-enabled systems and to multithread the execution of individual docking jobs. In addition, code was implemented to reduce input/output (I/O) traffic by reusing grid maps at each node from docking to docking. Performance of mpAD4 was examined on two multiprocessor computers. Using MPI with OpenMP multithreading, mpAD4 scales with near linearity on the multiprocessor systems tested. In situations where I/O is limiting, reuse of grid maps reduces both system I/O and overall screening time. Multithreading of AutoDock's Lamarkian Genetic Algorithm with OpenMP increases the speed of execution of individual docking jobs, and when combined with MPI parallelization can significantly reduce the execution time of virtual screens. This work is significant in that mpAD4 speeds the execution of certain molecular docking workloads and allows the user to optimize the degree of system-level (MPI) and node-level (OpenMP) parallelization to best fit both workloads and computational resources.

  10. Fast Acceleration of 2D Wave Propagation Simulations Using Modern Computational Accelerators

    PubMed Central

    Wang, Wei; Xu, Lifan; Cavazos, John; Huang, Howie H.; Kay, Matthew

    2014-01-01

    Recent developments in modern computational accelerators like Graphics Processing Units (GPUs) and coprocessors provide great opportunities for making scientific applications run faster than ever before. However, efficient parallelization of scientific code using new programming tools like CUDA requires a high level of expertise that is not available to many scientists. This, plus the fact that parallelized code is usually not portable to different architectures, creates major challenges for exploiting the full capabilities of modern computational accelerators. In this work, we sought to overcome these challenges by studying how to achieve both automated parallelization using OpenACC and enhanced portability using OpenCL. We applied our parallelization schemes using GPUs as well as Intel Many Integrated Core (MIC) coprocessor to reduce the run time of wave propagation simulations. We used a well-established 2D cardiac action potential model as a specific case-study. To the best of our knowledge, we are the first to study auto-parallelization of 2D cardiac wave propagation simulations using OpenACC. Our results identify several approaches that provide substantial speedups. The OpenACC-generated GPU code achieved more than speedup above the sequential implementation and required the addition of only a few OpenACC pragmas to the code. An OpenCL implementation provided speedups on GPUs of at least faster than the sequential implementation and faster than a parallelized OpenMP implementation. An implementation of OpenMP on Intel MIC coprocessor provided speedups of with only a few code changes to the sequential implementation. We highlight that OpenACC provides an automatic, efficient, and portable approach to achieve parallelization of 2D cardiac wave simulations on GPUs. Our approach of using OpenACC, OpenCL, and OpenMP to parallelize this particular model on modern computational accelerators should be applicable to other computational models of wave propagation in multi-dimensional media. PMID:24497950

  11. Employing Nested OpenMP for the Parallelization of Multi-Zone Computational Fluid Dynamics Applications

    NASA Technical Reports Server (NTRS)

    Ayguade, Eduard; Gonzalez, Marc; Martorell, Xavier; Jost, Gabriele

    2004-01-01

    In this paper we describe the parallelization of the multi-zone code versions of the NAS Parallel Benchmarks employing multi-level OpenMP parallelism. For our study we use the NanosCompiler, which supports nesting of OpenMP directives and provides clauses to control the grouping of threads, load balancing, and synchronization. We report the benchmark results, compare the timings with those of different hybrid parallelization paradigms and discuss OpenMP implementation issues which effect the performance of multi-level parallel applications.

  12. Petascale computation performance of lightweight multiscale cardiac models using hybrid programming models.

    PubMed

    Pope, Bernard J; Fitch, Blake G; Pitman, Michael C; Rice, John J; Reumann, Matthias

    2011-01-01

    Future multiscale and multiphysics models must use the power of high performance computing (HPC) systems to enable research into human disease, translational medical science, and treatment. Previously we showed that computationally efficient multiscale models will require the use of sophisticated hybrid programming models, mixing distributed message passing processes (e.g. the message passing interface (MPI)) with multithreading (e.g. OpenMP, POSIX pthreads). The objective of this work is to compare the performance of such hybrid programming models when applied to the simulation of a lightweight multiscale cardiac model. Our results show that the hybrid models do not perform favourably when compared to an implementation using only MPI which is in contrast to our results using complex physiological models. Thus, with regards to lightweight multiscale cardiac models, the user may not need to increase programming complexity by using a hybrid programming approach. However, considering that model complexity will increase as well as the HPC system size in both node count and number of cores per node, it is still foreseeable that we will achieve faster than real time multiscale cardiac simulations on these systems using hybrid programming models.

  13. A numerical differentiation library exploiting parallel architectures

    NASA Astrophysics Data System (ADS)

    Voglis, C.; Hadjidoukas, P. E.; Lagaris, I. E.; Papageorgiou, D. G.

    2009-08-01

    We present a software library for numerically estimating first and second order partial derivatives of a function by finite differencing. Various truncation schemes are offered resulting in corresponding formulas that are accurate to order O(h), O(h), and O(h), h being the differencing step. The derivatives are calculated via forward, backward and central differences. Care has been taken that only feasible points are used in the case where bound constraints are imposed on the variables. The Hessian may be approximated either from function or from gradient values. There are three versions of the software: a sequential version, an OpenMP version for shared memory architectures and an MPI version for distributed systems (clusters). The parallel versions exploit the multiprocessing capability offered by computer clusters, as well as modern multi-core systems and due to the independent character of the derivative computation, the speedup scales almost linearly with the number of available processors/cores. Program summaryProgram title: NDL (Numerical Differentiation Library) Catalogue identifier: AEDG_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEDG_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 73 030 No. of bytes in distributed program, including test data, etc.: 630 876 Distribution format: tar.gz Programming language: ANSI FORTRAN-77, ANSI C, MPI, OPENMP Computer: Distributed systems (clusters), shared memory systems Operating system: Linux, Solaris Has the code been vectorised or parallelized?: Yes RAM: The library uses O(N) internal storage, N being the dimension of the problem Classification: 4.9, 4.14, 6.5 Nature of problem: The numerical estimation of derivatives at several accuracy levels is a common requirement in many computational tasks, such as optimization, solution of nonlinear systems, etc. The parallel implementation that exploits systems with multiple CPUs is very important for large scale and computationally expensive problems. Solution method: Finite differencing is used with carefully chosen step that minimizes the sum of the truncation and round-off errors. The parallel versions employ both OpenMP and MPI libraries. Restrictions: The library uses only double precision arithmetic. Unusual features: The software takes into account bound constraints, in the sense that only feasible points are used to evaluate the derivatives, and given the level of the desired accuracy, the proper formula is automatically employed. Running time: Running time depends on the function's complexity. The test run took 15 ms for the serial distribution, 0.6 s for the OpenMP and 4.2 s for the MPI parallel distribution on 2 processors.

  14. Data Race Benchmark Collection

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Liao, Chunhua; Lin, Pei-Hung; Asplund, Joshua

    2017-03-21

    This project is a benchmark suite of Open-MP parallel codes that have been checked for data races. The programs are marked to show which do and do not have races. This allows them to be leveraged while testing and developing race detection tools.

  15. Automatic Multilevel Parallelization Using OpenMP

    NASA Technical Reports Server (NTRS)

    Jin, Hao-Qiang; Jost, Gabriele; Yan, Jerry; Ayguade, Eduard; Gonzalez, Marc; Martorell, Xavier; Biegel, Bryan (Technical Monitor)

    2002-01-01

    In this paper we describe the extension of the CAPO parallelization support tool to support multilevel parallelism based on OpenMP directives. CAPO generates OpenMP directives with extensions supported by the NanosCompiler to allow for directive nesting and definition of thread groups. We report first results for several benchmark codes and one full application that have been parallelized using our system.

  16. Computer-Aided Parallelizer and Optimizer

    NASA Technical Reports Server (NTRS)

    Jin, Haoqiang

    2011-01-01

    The Computer-Aided Parallelizer and Optimizer (CAPO) automates the insertion of compiler directives (see figure) to facilitate parallel processing on Shared Memory Parallel (SMP) machines. While CAPO currently is integrated seamlessly into CAPTools (developed at the University of Greenwich, now marketed as ParaWise), CAPO was independently developed at Ames Research Center as one of the components for the Legacy Code Modernization (LCM) project. The current version takes serial FORTRAN programs, performs interprocedural data dependence analysis, and generates OpenMP directives. Due to the widely supported OpenMP standard, the generated OpenMP codes have the potential to run on a wide range of SMP machines. CAPO relies on accurate interprocedural data dependence information currently provided by CAPTools. Compiler directives are generated through identification of parallel loops in the outermost level, construction of parallel regions around parallel loops and optimization of parallel regions, and insertion of directives with automatic identification of private, reduction, induction, and shared variables. Attempts also have been made to identify potential pipeline parallelism (implemented with point-to-point synchronization). Although directives are generated automatically, user interaction with the tool is still important for producing good parallel codes. A comprehensive graphical user interface is included for users to interact with the parallelization process.

  17. CLOMP v1.5

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Gyllenhaal, J.

    CLOMP is the C version of the Livermore OpenMP benchmark developed to measure OpenMP overheads and other performance impacts due to threading. For simplicity, it does not use MPI by default but it is expected to be run on the resources a threaded MPI task would use (e.g., a portion of a shared memory compute node). Compiling with -DWITH_MPI allows packing one or more nodes with CLOMP tasks and having CLOMP report OpenMP performance for the slowest MPI task. On current systems, the strong scaling performance results for 4, 8, or 16 threads are of the most interest. Suggested weakmore » scaling inputs are provided for evaluating future systems. Since MPI is often used to place at least one MPI task per coherence or NUMA domain, it is recommended to focus OpenMP runtime measurements on a subset of node hardware where it is most possible to have low OpenMP overheads (e.g., within one coherence domain or NUMA domain).« less

  18. Final Report from The University of Texas at Austin for DEGAS: Dynamic Global Address Space programming environments

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Erez, Mattan; Yelick, Katherine; Sarkar, Vivek

    The Dynamic, Exascale Global Address Space programming environment (DEGAS) project will develop the next generation of programming models and runtime systems to meet the challenges of Exascale computing. Our approach is to provide an efficient and scalable programming model that can be adapted to application needs through the use of dynamic runtime features and domain-specific languages for computational kernels. We address the following technical challenges: Programmability: Rich set of programming constructs based on a Hierarchical Partitioned Global Address Space (HPGAS) model, demonstrated in UPC++. Scalability: Hierarchical locality control, lightweight communication (extended GASNet), and ef- ficient synchronization mechanisms (Phasers). Performance Portability:more » Just-in-time specialization (SEJITS) for generating hardware-specific code and scheduling libraries for domain-specific adaptive runtimes (Habanero). Energy Efficiency: Communication-optimal code generation to optimize energy efficiency by re- ducing data movement. Resilience: Containment Domains for flexible, domain-specific resilience, using state capture mechanisms and lightweight, asynchronous recovery mechanisms. Interoperability: Runtime and language interoperability with MPI and OpenMP to encourage broad adoption.« less

  19. The MOLDY short-range molecular dynamics package

    NASA Astrophysics Data System (ADS)

    Ackland, G. J.; D'Mellow, K.; Daraszewicz, S. L.; Hepburn, D. J.; Uhrin, M.; Stratford, K.

    2011-12-01

    We describe a parallelised version of the MOLDY molecular dynamics program. This Fortran code is aimed at systems which may be described by short-range potentials and specifically those which may be addressed with the embedded atom method. This includes a wide range of transition metals and alloys. MOLDY provides a range of options in terms of the molecular dynamics ensemble used and the boundary conditions which may be applied. A number of standard potentials are provided, and the modular structure of the code allows new potentials to be added easily. The code is parallelised using OpenMP and can therefore be run on shared memory systems, including modern multicore processors. Particular attention is paid to the updates required in the main force loop, where synchronisation is often required in OpenMP implementations of molecular dynamics. We examine the performance of the parallel code in detail and give some examples of applications to realistic problems, including the dynamic compression of copper and carbon migration in an iron-carbon alloy. Program summaryProgram title: MOLDY Catalogue identifier: AEJU_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEJU_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GNU General Public License version 2 No. of lines in distributed program, including test data, etc.: 382 881 No. of bytes in distributed program, including test data, etc.: 6 705 242 Distribution format: tar.gz Programming language: Fortran 95/OpenMP Computer: Any Operating system: Any Has the code been vectorised or parallelized?: Yes. OpenMP is required for parallel execution RAM: 100 MB or more Classification: 7.7 Nature of problem: Moldy addresses the problem of many atoms (of order 10 6) interacting via a classical interatomic potential on a timescale of microseconds. It is designed for problems where statistics must be gathered over a number of equivalent runs, such as measuring thermodynamic properities, diffusion, radiation damage, fracture, twinning deformation, nucleation and growth of phase transitions, sputtering etc. In the vast majority of materials, the interactions are non-pairwise, and the code must be able to deal with many-body forces. Solution method: Molecular dynamics involves integrating Newton's equations of motion. MOLDY uses verlet (for good energy conservation) or predictor-corrector (for accurate trajectories) algorithms. It is parallelised using open MP. It also includes a static minimisation routine to find the lowest energy structure. Boundary conditions for surfaces, clusters, grain boundaries, thermostat (Nose), barostat (Parrinello-Rahman), and externally applied strain are provided. The initial configuration can be either a repeated unit cell or have all atoms given explictly. Initial velocities are generated internally, but it is also possible to specify the velocity of a particular atom. A wide range of interatomic force models are implemented, including embedded atom, Morse or Lennard-Jones. Thus the program is especially well suited to calculations of metals. Restrictions: The code is designed for short-ranged potentials, and there is no Ewald sum. Thus for long range interactions where all particles interact with all others, the order- N scaling will fail. Different interatomic potential forms require recompilation of the code. Additional comments: There is a set of associated open-source analysis software for postprocessing and visualisation. This includes local crystal structure recognition and identification of topological defects. Running time: A set of test modules for running time are provided. The code scales as order N. The parallelisation shows near-linear scaling with number of processors in a shared memory environment. A typical run of a few tens of nanometers for a few nanoseconds will run on a timescale of days on a multiprocessor desktop.

  20. Clomp

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Gylenhaal, J.; Bronevetsky, G.

    2007-05-25

    CLOMP is the C version of the Livermore OpenMP benchmark deeloped to measure OpenMP overheads and other performance impacts due to threading (like NUMA memory layouts, memory contention, cache effects, etc.) in order to influence future system design. Current best-in-class implementations of OpenMP have overheads at least ten times larger than is required by many of our applications for effective use of OpenMP. This benchmark shows the significant negative performance impact of these relatively large overheads and of other thread effects. The CLOMP benchmark highly configurable to allow a variety of problem sizes and threading effects to be studied andmore » it carefully checks its results to catch many common threading errors. This benchmark is expected to be included as part of the Sequoia Benchmark suite for the Sequoia procurement.« less

  1. Parallelization of elliptic solver for solving 1D Boussinesq model

    NASA Astrophysics Data System (ADS)

    Tarwidi, D.; Adytia, D.

    2018-03-01

    In this paper, a parallel implementation of an elliptic solver in solving 1D Boussinesq model is presented. Numerical solution of Boussinesq model is obtained by implementing a staggered grid scheme to continuity, momentum, and elliptic equation of Boussinesq model. Tridiagonal system emerging from numerical scheme of elliptic equation is solved by cyclic reduction algorithm. The parallel implementation of cyclic reduction is executed on multicore processors with shared memory architectures using OpenMP. To measure the performance of parallel program, large number of grids is varied from 28 to 214. Two test cases of numerical experiment, i.e. propagation of solitary and standing wave, are proposed to evaluate the parallel program. The numerical results are verified with analytical solution of solitary and standing wave. The best speedup of solitary and standing wave test cases is about 2.07 with 214 of grids and 1.86 with 213 of grids, respectively, which are executed by using 8 threads. Moreover, the best efficiency of parallel program is 76.2% and 73.5% for solitary and standing wave test cases, respectively.

  2. A multi-threaded version of MCFM

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Campbell, John M.; Ellis, R. Keith; Giele, Walter T.

    We report on our findings modifying MCFM using OpenMP to implement multi-threading. By using OpenMP, the modified MCFM will execute on any processor, automatically adjusting to the number of available threads. We then modified the integration routine VEGAS to distribute the event evaluation over the threads, while combining all events at the end of every iteration to optimize the numerical integration. Furthermore, we took special care so that the results of the Monte Carlo integration were independent of the number of threads used, to facilitate the validation of the OpenMP version of MCFM.

  3. Cellular automata-based modelling and simulation of biofilm structure on multi-core computers.

    PubMed

    Skoneczny, Szymon

    2015-01-01

    The article presents a mathematical model of biofilm growth for aerobic biodegradation of a toxic carbonaceous substrate. Modelling of biofilm growth has fundamental significance in numerous processes of biotechnology and mathematical modelling of bioreactors. The process following double-substrate kinetics with substrate inhibition proceeding in a biofilm has not been modelled so far by means of cellular automata. Each process in the model proposed, i.e. diffusion of substrates, uptake of substrates, growth and decay of microorganisms and biofilm detachment, is simulated in a discrete manner. It was shown that for flat biofilm of constant thickness, the results of the presented model agree with those of a continuous model. The primary outcome of the study was to propose a mathematical model of biofilm growth; however a considerable amount of focus was also placed on the development of efficient algorithms for its solution. Two parallel algorithms were created, differing in the way computations are distributed. Computer programs were created using OpenMP Application Programming Interface for C++ programming language. Simulations of biofilm growth were performed on three high-performance computers. Speed-up coefficients of computer programs were compared. Both algorithms enabled a significant reduction of computation time. It is important, inter alia, in modelling and simulation of bioreactor dynamics.

  4. A Locality-Based Threading Algorithm for the Configuration-Interaction Method

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Shan, Hongzhang; Williams, Samuel; Johnson, Calvin

    The Configuration Interaction (CI) method has been widely used to solve the non-relativistic many-body Schrodinger equation. One great challenge to implementing it efficiently on manycore architectures is its immense memory and data movement requirements. To address this issue, within each node, we exploit a hybrid MPI+OpenMP programming model in lieu of the traditional flat MPI programming model. Here in this paper, we develop optimizations that partition the workloads among OpenMP threads based on data locality,-which is essential in ensuring applications with complex data access patterns scale well on manycore architectures. The new algorithm scales to 256 threadson the 64-core Intelmore » Knights Landing (KNL) manycore processor and 24 threads on dual-socket Ivy Bridge (Xeon) nodes. Compared with the original implementation, the performance has been improved by up to 7× on theKnights Landing processor and 3× on the dual-socket Ivy Bridge node.« less

  5. A Locality-Based Threading Algorithm for the Configuration-Interaction Method

    DOE PAGES

    Shan, Hongzhang; Williams, Samuel; Johnson, Calvin; ...

    2017-07-03

    The Configuration Interaction (CI) method has been widely used to solve the non-relativistic many-body Schrodinger equation. One great challenge to implementing it efficiently on manycore architectures is its immense memory and data movement requirements. To address this issue, within each node, we exploit a hybrid MPI+OpenMP programming model in lieu of the traditional flat MPI programming model. Here in this paper, we develop optimizations that partition the workloads among OpenMP threads based on data locality,-which is essential in ensuring applications with complex data access patterns scale well on manycore architectures. The new algorithm scales to 256 threadson the 64-core Intelmore » Knights Landing (KNL) manycore processor and 24 threads on dual-socket Ivy Bridge (Xeon) nodes. Compared with the original implementation, the performance has been improved by up to 7× on theKnights Landing processor and 3× on the dual-socket Ivy Bridge node.« less

  6. Exact diagonalization of quantum lattice models on coprocessors

    NASA Astrophysics Data System (ADS)

    Siro, T.; Harju, A.

    2016-10-01

    We implement the Lanczos algorithm on an Intel Xeon Phi coprocessor and compare its performance to a multi-core Intel Xeon CPU and an NVIDIA graphics processor. The Xeon and the Xeon Phi are parallelized with OpenMP and the graphics processor is programmed with CUDA. The performance is evaluated by measuring the execution time of a single step in the Lanczos algorithm. We study two quantum lattice models with different particle numbers, and conclude that for small systems, the multi-core CPU is the fastest platform, while for large systems, the graphics processor is the clear winner, reaching speedups of up to 7.6 compared to the CPU. The Xeon Phi outperforms the CPU with sufficiently large particle number, reaching a speedup of 2.5.

  7. Implementing Shared Memory Parallelism in MCBEND

    NASA Astrophysics Data System (ADS)

    Bird, Adam; Long, David; Dobson, Geoff

    2017-09-01

    MCBEND is a general purpose radiation transport Monte Carlo code from AMEC Foster Wheelers's ANSWERS® Software Service. MCBEND is well established in the UK shielding community for radiation shielding and dosimetry assessments. The existing MCBEND parallel capability effectively involves running the same calculation on many processors. This works very well except when the memory requirements of a model restrict the number of instances of a calculation that will fit on a machine. To more effectively utilise parallel hardware OpenMP has been used to implement shared memory parallelism in MCBEND. This paper describes the reasoning behind the choice of OpenMP, notes some of the challenges of multi-threading an established code such as MCBEND and assesses the performance of the parallel method implemented in MCBEND.

  8. Use Computer-Aided Tools to Parallelize Large CFD Applications

    NASA Technical Reports Server (NTRS)

    Jin, H.; Frumkin, M.; Yan, J.

    2000-01-01

    Porting applications to high performance parallel computers is always a challenging task. It is time consuming and costly. With rapid progressing in hardware architectures and increasing complexity of real applications in recent years, the problem becomes even more sever. Today, scalability and high performance are mostly involving handwritten parallel programs using message-passing libraries (e.g. MPI). However, this process is very difficult and often error-prone. The recent reemergence of shared memory parallel (SMP) architectures, such as the cache coherent Non-Uniform Memory Access (ccNUMA) architecture used in the SGI Origin 2000, show good prospects for scaling beyond hundreds of processors. Programming on an SMP is simplified by working in a globally accessible address space. The user can supply compiler directives, such as OpenMP, to parallelize the code. As an industry standard for portable implementation of parallel programs for SMPs, OpenMP is a set of compiler directives and callable runtime library routines that extend Fortran, C and C++ to express shared memory parallelism. It promises an incremental path for parallel conversion of existing software, as well as scalability and performance for a complete rewrite or an entirely new development. Perhaps the main disadvantage of programming with directives is that inserted directives may not necessarily enhance performance. In the worst cases, it can create erroneous results. While vendors have provided tools to perform error-checking and profiling, automation in directive insertion is very limited and often failed on large programs, primarily due to the lack of a thorough enough data dependence analysis. To overcome the deficiency, we have developed a toolkit, CAPO, to automatically insert OpenMP directives in Fortran programs and apply certain degrees of optimization. CAPO is aimed at taking advantage of detailed inter-procedural dependence analysis provided by CAPTools, developed by the University of Greenwich, to reduce potential errors made by users. Earlier tests on NAS Benchmarks and ARC3D have demonstrated good success of this tool. In this study, we have applied CAPO to parallelize three large applications in the area of computational fluid dynamics (CFD): OVERFLOW, TLNS3D and INS3D. These codes are widely used for solving Navier-Stokes equations with complicated boundary conditions and turbulence model in multiple zones. Each one comprises of from 50K to 1,00k lines of FORTRAN77. As an example, CAPO took 77 hours to complete the data dependence analysis of OVERFLOW on a workstation (SGI, 175MHz, R10K processor). A fair amount of effort was spent on correcting false dependencies due to lack of necessary knowledge during the analysis. Even so, CAPO provides an easy way for user to interact with the parallelization process. The OpenMP version was generated within a day after the analysis was completed. Due to sequential algorithms involved, code sections in TLNS3D and INS3D need to be restructured by hand to produce more efficient parallel codes. An included figure shows preliminary test results of the generated OVERFLOW with several test cases in single zone. The MPI data points for the small test case were taken from a handcoded MPI version. As we can see, CAPO's version has achieved 18 fold speed up on 32 nodes of the SGI O2K. For the small test case, it outperformed the MPI version. These results are very encouraging, but further work is needed. For example, although CAPO attempts to place directives on the outer- most parallel loops in an interprocedural framework, it does not insert directives based on the best manual strategy. In particular, it lacks the support of parallelization at the multi-zone level. Future work will emphasize on the development of methodology to work in a multi-zone level and with a hybrid approach. Development of tools to perform more complicated code transformation is also needed.

  9. Code Parallelization with CAPO: A User Manual

    NASA Technical Reports Server (NTRS)

    Jin, Hao-Qiang; Frumkin, Michael; Yan, Jerry; Biegel, Bryan (Technical Monitor)

    2001-01-01

    A software tool has been developed to assist the parallelization of scientific codes. This tool, CAPO, extends an existing parallelization toolkit, CAPTools developed at the University of Greenwich, to generate OpenMP parallel codes for shared memory architectures. This is an interactive toolkit to transform a serial Fortran application code to an equivalent parallel version of the software - in a small fraction of the time normally required for a manual parallelization. We first discuss the way in which loop types are categorized and how efficient OpenMP directives can be defined and inserted into the existing code using the in-depth interprocedural analysis. The use of the toolkit on a number of application codes ranging from benchmark to real-world application codes is presented. This will demonstrate the great potential of using the toolkit to quickly parallelize serial programs as well as the good performance achievable on a large number of toolkit to quickly parallelize serial programs as well as the good performance achievable on a large number of processors. The second part of the document gives references to the parameters and the graphic user interface implemented in the toolkit. Finally a set of tutorials is included for hands-on experiences with this toolkit.

  10. A Wideband Fast Multipole Method for the two-dimensional complex Helmholtz equation

    NASA Astrophysics Data System (ADS)

    Cho, Min Hyung; Cai, Wei

    2010-12-01

    A Wideband Fast Multipole Method (FMM) for the 2D Helmholtz equation is presented. It can evaluate the interactions between N particles governed by the fundamental solution of 2D complex Helmholtz equation in a fast manner for a wide range of complex wave number k, which was not easy with the original FMM due to the instability of the diagonalized conversion operator. This paper includes the description of theoretical backgrounds, the FMM algorithm, software structures, and some test runs. Program summaryProgram title: 2D-WFMM Catalogue identifier: AEHI_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEHI_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 4636 No. of bytes in distributed program, including test data, etc.: 82 582 Distribution format: tar.gz Programming language: C Computer: Any Operating system: Any operating system with gcc version 4.2 or newer Has the code been vectorized or parallelized?: Multi-core processors with shared memory RAM: Depending on the number of particles N and the wave number k Classification: 4.8, 4.12 External routines: OpenMP ( http://openmp.org/wp/) Nature of problem: Evaluate interaction between N particles governed by the fundamental solution of 2D Helmholtz equation with complex k. Solution method: Multilevel Fast Multipole Algorithm in a hierarchical quad-tree structure with cutoff level which combines low frequency method and high frequency method. Running time: Depending on the number of particles N, wave number k, and number of cores in CPU. CPU time increases as N log N.

  11. Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

    NASA Technical Reports Server (NTRS)

    Jost, Gabriele; Jin, Haoqiang; anMey, Dieter; Hatay, Ferhat F.

    2003-01-01

    With the advent of parallel hardware and software technologies users are faced with the challenge to choose a programming paradigm best suited for the underlying computer architecture. With the current trend in parallel computer architectures towards clusters of shared memory symmetric multi-processors (SMP), parallel programming techniques have evolved to support parallelism beyond a single level. Which programming paradigm is the best will depend on the nature of the given problem, the hardware architecture, and the available software. In this study we will compare different programming paradigms for the parallelization of a selected benchmark application on a cluster of SMP nodes. We compare the timings of different implementations of the same CFD benchmark application employing the same numerical algorithm on a cluster of Sun Fire SMP nodes. The rest of the paper is structured as follows: In section 2 we briefly discuss the programming models under consideration. We describe our compute platform in section 3. The different implementations of our benchmark code are described in section 4 and the performance results are presented in section 5. We conclude our study in section 6.

  12. NDL-v2.0: A new version of the numerical differentiation library for parallel architectures

    NASA Astrophysics Data System (ADS)

    Hadjidoukas, P. E.; Angelikopoulos, P.; Voglis, C.; Papageorgiou, D. G.; Lagaris, I. E.

    2014-07-01

    We present a new version of the numerical differentiation library (NDL) used for the numerical estimation of first and second order partial derivatives of a function by finite differencing. In this version we have restructured the serial implementation of the code so as to achieve optimal task-based parallelization. The pure shared-memory parallelization of the library has been based on the lightweight OpenMP tasking model allowing for the full extraction of the available parallelism and efficient scheduling of multiple concurrent library calls. On multicore clusters, parallelism is exploited by means of TORC, an MPI-based multi-threaded tasking library. The new MPI implementation of NDL provides optimal performance in terms of function calls and, furthermore, supports asynchronous execution of multiple library calls within legacy MPI programs. In addition, a Python interface has been implemented for all cases, exporting the functionality of our library to sequential Python codes. Catalog identifier: AEDG_v2_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEDG_v2_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 63036 No. of bytes in distributed program, including test data, etc.: 801872 Distribution format: tar.gz Programming language: ANSI Fortran-77, ANSI C, Python. Computer: Distributed systems (clusters), shared memory systems. Operating system: Linux, Unix. Has the code been vectorized or parallelized?: Yes. RAM: The library uses O(N) internal storage, N being the dimension of the problem. It can use up to O(N2) internal storage for Hessian calculations, if a task throttling factor has not been set by the user. Classification: 4.9, 4.14, 6.5. Catalog identifier of previous version: AEDG_v1_0 Journal reference of previous version: Comput. Phys. Comm. 180(2009)1404 Does the new version supersede the previous version?: Yes Nature of problem: The numerical estimation of derivatives at several accuracy levels is a common requirement in many computational tasks, such as optimization, solution of nonlinear systems, and sensitivity analysis. For a large number of scientific and engineering applications, the underlying functions correspond to simulation codes for which analytical estimation of derivatives is difficult or almost impossible. A parallel implementation that exploits systems with multiple CPUs is very important for large scale and computationally expensive problems. Solution method: Finite differencing is used with a carefully chosen step that minimizes the sum of the truncation and round-off errors. The parallel versions employ both OpenMP and MPI libraries. Reasons for new version: The updated version was motivated by our endeavors to extend a parallel Bayesian uncertainty quantification framework [1], by incorporating higher order derivative information as in most state-of-the-art stochastic simulation methods such as Stochastic Newton MCMC [2] and Riemannian Manifold Hamiltonian MC [3]. The function evaluations are simulations with significant time-to-solution, which also varies with the input parameters such as in [1, 4]. The runtime of the N-body-type of problem changes considerably with the introduction of a longer cut-off between the bodies. In the first version of the library, the OpenMP-parallel subroutines spawn a new team of threads and distribute the function evaluations with a PARALLEL DO directive. This limits the functionality of the library as multiple concurrent calls require nested parallelism support from the OpenMP environment. Therefore, either their function evaluations will be serialized or processor oversubscription is likely to occur due to the increased number of OpenMP threads. In addition, the Hessian calculations include two explicit parallel regions that compute first the diagonal and then the off-diagonal elements of the array. Due to the barrier between the two regions, the parallelism of the calculations is not fully exploited. These issues have been addressed in the new version by first restructuring the serial code and then running the function evaluations in parallel using OpenMP tasks. Although the MPI-parallel implementation of the first version is capable of fully exploiting the task parallelism of the PNDL routines, it does not utilize the caching mechanism of the serial code and, therefore, performs some redundant function evaluations in the Hessian and Jacobian calculations. This can lead to: (a) higher execution times if the number of available processors is lower than the total number of tasks, and (b) significant energy consumption due to wasted processor cycles. Overcoming these drawbacks, which become critical as the time of a single function evaluation increases, was the primary goal of this new version. Due to the code restructure, the MPI-parallel implementation (and the OpenMP-parallel in accordance) avoids redundant calls, providing optimal performance in terms of the number of function evaluations. Another limitation of the library was that the library subroutines were collective and synchronous calls. In the new version, each MPI process can issue any number of subroutines for asynchronous execution. We introduce two library calls that provide global and local task synchronizations, similarly to the BARRIER and TASKWAIT directives of OpenMP. The new MPI-implementation is based on TORC, a new tasking library for multicore clusters [5-7]. TORC improves the portability of the software, as it relies exclusively on the POSIX-Threads and MPI programming interfaces. It allows MPI processes to utilize multiple worker threads, offering a hybrid programming and execution environment similar to MPI+OpenMP, in a completely transparent way. Finally, to further improve the usability of our software, a Python interface has been implemented on top of both the OpenMP and MPI versions of the library. This allows sequential Python codes to exploit shared and distributed memory systems. Summary of revisions: The revised code improves the performance of both parallel (OpenMP and MPI) implementations. The functionality and the user-interface of the MPI-parallel version have been extended to support the asynchronous execution of multiple PNDL calls, issued by one or multiple MPI processes. A new underlying tasking library increases portability and allows MPI processes to have multiple worker threads. For both implementations, an interface to the Python programming language has been added. Restrictions: The library uses only double precision arithmetic. The MPI implementation assumes the homogeneity of the execution environment provided by the operating system. Specifically, the processes of a single MPI application must have identical address space and a user function resides at the same virtual address. In addition, address space layout randomization should not be used for the application. Unusual features: The software takes into account bound constraints, in the sense that only feasible points are used to evaluate the derivatives, and given the level of the desired accuracy, the proper formula is automatically employed. Running time: Running time depends on the function's complexity. The test run took 23 ms for the serial distribution, 25 ms for the OpenMP with 2 threads, 53 ms and 1.01 s for the MPI parallel distribution using 2 threads and 2 processes respectively and yield-time for idle workers equal to 10 ms. References: [1] P. Angelikopoulos, C. Paradimitriou, P. Koumoutsakos, Bayesian uncertainty quantification and propagation in molecular dynamics simulations: a high performance computing framework, J. Chem. Phys 137 (14). [2] H.P. Flath, L.C. Wilcox, V. Akcelik, J. Hill, B. van Bloemen Waanders, O. Ghattas, Fast algorithms for Bayesian uncertainty quantification in large-scale linear inverse problems based on low-rank partial Hessian approximations, SIAM J. Sci. Comput. 33 (1) (2011) 407-432. [3] M. Girolami, B. Calderhead, Riemann manifold Langevin and Hamiltonian Monte Carlo methods, J. R. Stat. Soc. Ser. B (Stat. Methodol.) 73 (2) (2011) 123-214. [4] P. Angelikopoulos, C. Paradimitriou, P. Koumoutsakos, Data driven, predictive molecular dynamics for nanoscale flow simulations under uncertainty, J. Phys. Chem. B 117 (47) (2013) 14808-14816. [5] P.E. Hadjidoukas, E. Lappas, V.V. Dimakopoulos, A runtime library for platform-independent task parallelism, in: PDP, IEEE, 2012, pp. 229-236. [6] C. Voglis, P.E. Hadjidoukas, D.G. Papageorgiou, I. Lagaris, A parallel hybrid optimization algorithm for fitting interatomic potentials, Appl. Soft Comput. 13 (12) (2013) 4481-4492. [7] P.E. Hadjidoukas, C. Voglis, V.V. Dimakopoulos, I. Lagaris, D.G. Papageorgiou, Supporting adaptive and irregular parallelism for non-linear numerical optimization, Appl. Math. Comput. 231 (2014) 544-559.

  13. Implementing the PM Programming Language using MPI and OpenMP - a New Tool for Programming Geophysical Models on Parallel Systems

    NASA Astrophysics Data System (ADS)

    Bellerby, Tim

    2015-04-01

    PM (Parallel Models) is a new parallel programming language specifically designed for writing environmental and geophysical models. The language is intended to enable implementers to concentrate on the science behind the model rather than the details of running on parallel hardware. At the same time PM leaves the programmer in control - all parallelisation is explicit and the parallel structure of any given program may be deduced directly from the code. This paper describes a PM implementation based on the Message Passing Interface (MPI) and Open Multi-Processing (OpenMP) standards, looking at issues involved with translating the PM parallelisation model to MPI/OpenMP protocols and considering performance in terms of the competing factors of finer-grained parallelisation and increased communication overhead. In order to maximise portability, the implementation stays within the MPI 1.3 standard as much as possible, with MPI-2 MPI-IO file handling the only significant exception. Moreover, it does not assume a thread-safe implementation of MPI. PM adopts a two-tier abstract representation of parallel hardware. A PM processor is a conceptual unit capable of efficiently executing a set of language tasks, with a complete parallel system consisting of an abstract N-dimensional array of such processors. PM processors may map to single cores executing tasks using cooperative multi-tasking, to multiple cores or even to separate processing nodes, efficiently sharing tasks using algorithms such as work stealing. While tasks may move between hardware elements within a PM processor, they may not move between processors without specific programmer intervention. Tasks are assigned to processors using a nested parallelism approach, building on ideas from Reyes et al. (2009). The main program owns all available processors. When the program enters a parallel statement then either processors are divided out among the newly generated tasks (number of new tasks < number of processors) or tasks are divided out among the available processors (number of tasks > number of processors). Nested parallel statements may further subdivide the processor set owned by a given task. Tasks or processors are distributed evenly by default, but uneven distributions are possible under programmer control. It is also possible to explicitly enable child tasks to migrate within the processor set owned by their parent task, reducing load unbalancing at the potential cost of increased inter-processor message traffic. PM incorporates some programming structures from the earlier MIST language presented at a previous EGU General Assembly, while adopting a significantly different underlying parallelisation model and type system. PM code is available at www.pm-lang.org under an unrestrictive MIT license. Reference Ruymán Reyes, Antonio J. Dorta, Francisco Almeida, Francisco de Sande, 2009. Automatic Hybrid MPI+OpenMP Code Generation with llc, Recent Advances in Parallel Virtual Machine and Message Passing Interface, Lecture Notes in Computer Science Volume 5759, 185-195

  14. OpenMP parallelization of a gridded SWAT (SWATG)

    NASA Astrophysics Data System (ADS)

    Zhang, Ying; Hou, Jinliang; Cao, Yongpan; Gu, Juan; Huang, Chunlin

    2017-12-01

    Large-scale, long-term and high spatial resolution simulation is a common issue in environmental modeling. A Gridded Hydrologic Response Unit (HRU)-based Soil and Water Assessment Tool (SWATG) that integrates grid modeling scheme with different spatial representations also presents such problems. The time-consuming problem affects applications of very high resolution large-scale watershed modeling. The OpenMP (Open Multi-Processing) parallel application interface is integrated with SWATG (called SWATGP) to accelerate grid modeling based on the HRU level. Such parallel implementation takes better advantage of the computational power of a shared memory computer system. We conducted two experiments at multiple temporal and spatial scales of hydrological modeling using SWATG and SWATGP on a high-end server. At 500-m resolution, SWATGP was found to be up to nine times faster than SWATG in modeling over a roughly 2000 km2 watershed with 1 CPU and a 15 thread configuration. The study results demonstrate that parallel models save considerable time relative to traditional sequential simulation runs. Parallel computations of environmental models are beneficial for model applications, especially at large spatial and temporal scales and at high resolutions. The proposed SWATGP model is thus a promising tool for large-scale and high-resolution water resources research and management in addition to offering data fusion and model coupling ability.

  15. SToRM: A Model for 2D environmental hydraulics

    USGS Publications Warehouse

    Simões, Francisco J. M.

    2017-01-01

    A two-dimensional (depth-averaged) finite volume Godunov-type shallow water model developed for flow over complex topography is presented. The model, SToRM, is based on an unstructured cell-centered finite volume formulation and on nonlinear strong stability preserving Runge-Kutta time stepping schemes. The numerical discretization is founded on the classical and well established shallow water equations in hyperbolic conservative form, but the convective fluxes are calculated using auto-switching Riemann and diffusive numerical fluxes. Computational efficiency is achieved through a parallel implementation based on the OpenMP standard and the Fortran programming language. SToRM’s implementation within a graphical user interface is discussed. Field application of SToRM is illustrated by utilizing it to estimate peak flow discharges in a flooding event of the St. Vrain Creek in Colorado, U.S.A., in 2013, which reached 850 m3/s (~30,000 f3 /s) at the location of this study.

  16. Parallel transformation of K-SVD solar image denoising algorithm

    NASA Astrophysics Data System (ADS)

    Liang, Youwen; Tian, Yu; Li, Mei

    2017-02-01

    The images obtained by observing the sun through a large telescope always suffered with noise due to the low SNR. K-SVD denoising algorithm can effectively remove Gauss white noise. Training dictionaries for sparse representations is a time consuming task, due to the large size of the data involved and to the complexity of the training algorithms. In this paper, an OpenMP parallel programming language is proposed to transform the serial algorithm to the parallel version. Data parallelism model is used to transform the algorithm. Not one atom but multiple atoms updated simultaneously is the biggest change. The denoising effect and acceleration performance are tested after completion of the parallel algorithm. Speedup of the program is 13.563 in condition of using 16 cores. This parallel version can fully utilize the multi-core CPU hardware resources, greatly reduce running time and easily to transplant in multi-core platform.

  17. OpenMP GNU and Intel Fortran programs for solving the time-dependent Gross-Pitaevskii equation

    NASA Astrophysics Data System (ADS)

    Young-S., Luis E.; Muruganandam, Paulsamy; Adhikari, Sadhan K.; Lončar, Vladimir; Vudragović, Dušan; Balaž, Antun

    2017-11-01

    We present Open Multi-Processing (OpenMP) version of Fortran 90 programs for solving the Gross-Pitaevskii (GP) equation for a Bose-Einstein condensate in one, two, and three spatial dimensions, optimized for use with GNU and Intel compilers. We use the split-step Crank-Nicolson algorithm for imaginary- and real-time propagation, which enables efficient calculation of stationary and non-stationary solutions, respectively. The present OpenMP programs are designed for computers with multi-core processors and optimized for compiling with both commercially-licensed Intel Fortran and popular free open-source GNU Fortran compiler. The programs are easy to use and are elaborated with helpful comments for the users. All input parameters are listed at the beginning of each program. Different output files provide physical quantities such as energy, chemical potential, root-mean-square sizes, densities, etc. We also present speedup test results for new versions of the programs. Program files doi:http://dx.doi.org/10.17632/y8zk3jgn84.2 Licensing provisions: Apache License 2.0 Programming language: OpenMP GNU and Intel Fortran 90. Computer: Any multi-core personal computer or workstation with the appropriate OpenMP-capable Fortran compiler installed. Number of processors used: All available CPU cores on the executing computer. Journal reference of previous version: Comput. Phys. Commun. 180 (2009) 1888; ibid.204 (2016) 209. Does the new version supersede the previous version?: Not completely. It does supersede previous Fortran programs from both references above, but not OpenMP C programs from Comput. Phys. Commun. 204 (2016) 209. Nature of problem: The present Open Multi-Processing (OpenMP) Fortran programs, optimized for use with commercially-licensed Intel Fortran and free open-source GNU Fortran compilers, solve the time-dependent nonlinear partial differential (GP) equation for a trapped Bose-Einstein condensate in one (1d), two (2d), and three (3d) spatial dimensions for six different trap symmetries: axially and radially symmetric traps in 3d, circularly symmetric traps in 2d, fully isotropic (spherically symmetric) and fully anisotropic traps in 2d and 3d, as well as 1d traps, where no spatial symmetry is considered. Solution method: We employ the split-step Crank-Nicolson algorithm to discretize the time-dependent GP equation in space and time. The discretized equation is then solved by imaginary- or real-time propagation, employing adequately small space and time steps, to yield the solution of stationary and non-stationary problems, respectively. Reasons for the new version: Previously published Fortran programs [1,2] have now become popular tools [3] for solving the GP equation. These programs have been translated to the C programming language [4] and later extended to the more complex scenario of dipolar atoms [5]. Now virtually all computers have multi-core processors and some have motherboards with more than one physical computer processing unit (CPU), which may increase the number of available CPU cores on a single computer to several tens. The C programs have been adopted to be very fast on such multi-core modern computers using general-purpose graphic processing units (GPGPU) with Nvidia CUDA and computer clusters using Message Passing Interface (MPI) [6]. Nevertheless, previously developed Fortran programs are also commonly used for scientific computation and most of them use a single CPU core at a time in modern multi-core laptops, desktops, and workstations. Unless the Fortran programs are made aware and capable of making efficient use of the available CPU cores, the solution of even a realistic dynamical 1d problem, not to mention the more complicated 2d and 3d problems, could be time consuming using the Fortran programs. Previously, we published auto-parallel Fortran programs [2] suitable for Intel (but not GNU) compiler for solving the GP equation. Hence, a need for the full OpenMP version of the Fortran programs to reduce the execution time cannot be overemphasized. To address this issue, we provide here such OpenMP Fortran programs, optimized for both Intel and GNU Fortran compilers and capable of using all available CPU cores, which can significantly reduce the execution time. Summary of revisions: Previous Fortran programs [1] for solving the time-dependent GP equation in 1d, 2d, and 3d with different trap symmetries have been parallelized using the OpenMP interface to reduce the execution time on multi-core processors. There are six different trap symmetries considered, resulting in six programs for imaginary-time propagation and six for real-time propagation, totaling to 12 programs included in BEC-GP-OMP-FOR software package. All input data (number of atoms, scattering length, harmonic oscillator trap length, trap anisotropy, etc.) are conveniently placed at the beginning of each program, as before [2]. Present programs introduce a new input parameter, which is designated by Number_of_Threads and defines the number of CPU cores of the processor to be used in the calculation. If one sets the value 0 for this parameter, all available CPU cores will be used. For the most efficient calculation it is advisable to leave one CPU core unused for the background system's jobs. For example, on a machine with 20 CPU cores such that we used for testing, it is advisable to use up to 19 CPU cores. However, the total number of used CPU cores can be divided into more than one job. For instance, one can run three simulations simultaneously using 10, 4, and 5 CPU cores, respectively, thus totaling to 19 used CPU cores on a 20-core computer. The Fortran source programs are located in the directory src, and can be compiled by the make command using the makefile in the root directory BEC-GP-OMP-FOR of the software package. The examples of produced output files can be found in the directory output, although some large density files are omitted, to save space. The programs calculate the values of actually used dimensionless nonlinearities from the physical input parameters, where the input parameters correspond to the identical nonlinearity values as in the previously published programs [1], so that the output files of the old and new programs can be directly compared. The output files are conveniently named such that their contents can be easily identified, following the naming convention introduced in Ref. [2]. For example, a file named -out.txt, where is a name of the individual program, represents the general output file containing input data, time and space steps, nonlinearity, energy and chemical potential, and was named fort.7 in the old Fortran version of programs [1]. A file named -den.txt is the output file with the condensate density, which had the names fort.3 and fort.4 in the old Fortran version [1] for imaginary- and real-time propagation programs, respectively. Other possible density outputs, such as the initial density, are commented out in the programs to have a simpler set of output files, but users can uncomment and re-enable them, if needed. In addition, there are output files for reduced (integrated) 1d and 2d densities for different programs. In the real-time programs there is also an output file reporting the dynamics of evolution of root-mean-square sizes after a perturbation is introduced. The supplied real-time programs solve the stationary GP equation, and then calculate the dynamics. As the imaginary-time programs are more accurate than the real-time programs for the solution of a stationary problem, one can first solve the stationary problem using the imaginary-time programs, adapt the real-time programs to read the pre-calculated wave function and then study the dynamics. In that case the parameter NSTP in the real-time programs should be set to zero and the space mesh and nonlinearity parameters should be identical in both programs. The reader is advised to consult our previous publication where a complete description of the output files is given [2]. A readme.txt file, included in the root directory, explains the procedure to compile and run the programs. We tested our programs on a workstation with two 10-core Intel Xeon E5-2650 v3 CPUs. The parameters used for testing are given in sample input files, provided in the corresponding directory together with the programs. In Table 1 we present wall-clock execution times for runs on 1, 6, and 19 CPU cores for programs compiled using Intel and GNU Fortran compilers. The corresponding columns "Intel speedup" and "GNU speedup" give the ratio of wall-clock execution times of runs on 1 and 19 CPU cores, and denote the actual measured speedup for 19 CPU cores. In all cases and for all numbers of CPU cores, although the GNU Fortran compiler gives excellent results, the Intel Fortran compiler turns out to be slightly faster. Note that during these tests we always ran only a single simulation on a workstation at a time, to avoid any possible interference issues. Therefore, the obtained wall-clock times are more reliable than the ones that could be measured with two or more jobs running simultaneously. We also studied the speedup of the programs as a function of the number of CPU cores used. The performance of the Intel and GNU Fortran compilers is illustrated in Fig. 1, where we plot the speedup and actual wall-clock times as functions of the number of CPU cores for 2d and 3d programs. We see that the speedup increases monotonically with the number of CPU cores in all cases and has large values (between 10 and 14 for 3d programs) for the maximal number of cores. This fully justifies the development of OpenMP programs, which enable much faster and more efficient solving of the GP equation. However, a slow saturation in the speedup with the further increase in the number of CPU cores is observed in all cases, as expected. The speedup tends to increase for programs in higher dimensions, as they become more complex and have to process more data. This is why the speedups of the supplied 2d and 3d programs are larger than those of 1d programs. Also, for a single program the speedup increases with the size of the spatial grid, i.e., with the number of spatial discretization points, since this increases the amount of calculations performed by the program. To demonstrate this, we tested the supplied real2d-th program and varied the number of spatial discretization points NX=NY from 20 to 1000. The measured speedup obtained when running this program on 19 CPU cores as a function of the number of discretization points is shown in Fig. 2. The speedup first increases rapidly with the number of discretization points and eventually saturates. Additional comments: Example inputs provided with the programs take less than 30 minutes to run on a workstation with two Intel Xeon E5-2650 v3 processors (2 QPI links, 10 CPU cores, 25 MB cache, 2.3 GHz).

  18. OpenMP Parallelization and Optimization of Graph-Based Machine Learning Algorithms

    DOE PAGES

    Meng, Zhaoyi; Koniges, Alice; He, Yun Helen; ...

    2016-09-21

    In this paper, we investigate the OpenMP parallelization and optimization of two novel data classification algorithms. The new algorithms are based on graph and PDE solution techniques and provide significant accuracy and performance advantages over traditional data classification algorithms in serial mode. The methods leverage the Nystrom extension to calculate eigenvalue/eigenvectors of the graph Laplacian and this is a self-contained module that can be used in conjunction with other graph-Laplacian based methods such as spectral clustering. We use performance tools to collect the hotspots and memory access of the serial codes and use OpenMP as the parallelization language to parallelizemore » the most time-consuming parts. Where possible, we also use library routines. We then optimize the OpenMP implementations and detail the performance on traditional supercomputer nodes (in our case a Cray XC30), and test the optimization steps on emerging testbed systems based on Intel’s Knights Corner and Landing processors. We show both performance improvement and strong scaling behavior. Finally, a large number of optimization techniques and analyses are necessary before the algorithm reaches almost ideal scaling.« less

  19. A Non-Equilibrium Sediment Transport Model for Coastal Inlets and Navigation Channels

    DTIC Science & Technology

    2011-01-01

    exchange of water , sediment, and nutrients between estuaries and the ocean. Because of the multiple interacting forces (waves, wind, tide, river...in parallel using OpenMP. The CMS takes advantage of the Surface- water Modeling System (SMS) interface for grid generation and model setup, as well...as for plotting and post- processing (Zundel, 2000). The circulation model in the CMS (called CMS-Flow) computes the unsteady water level and

  20. Computational performance of a smoothed particle hydrodynamics simulation for shared-memory parallel computing

    NASA Astrophysics Data System (ADS)

    Nishiura, Daisuke; Furuichi, Mikito; Sakaguchi, Hide

    2015-09-01

    The computational performance of a smoothed particle hydrodynamics (SPH) simulation is investigated for three types of current shared-memory parallel computer devices: many integrated core (MIC) processors, graphics processing units (GPUs), and multi-core CPUs. We are especially interested in efficient shared-memory allocation methods for each chipset, because the efficient data access patterns differ between compute unified device architecture (CUDA) programming for GPUs and OpenMP programming for MIC processors and multi-core CPUs. We first introduce several parallel implementation techniques for the SPH code, and then examine these on our target computer architectures to determine the most effective algorithms for each processor unit. In addition, we evaluate the effective computing performance and power efficiency of the SPH simulation on each architecture, as these are critical metrics for overall performance in a multi-device environment. In our benchmark test, the GPU is found to produce the best arithmetic performance as a standalone device unit, and gives the most efficient power consumption. The multi-core CPU obtains the most effective computing performance. The computational speed of the MIC processor on Xeon Phi approached that of two Xeon CPUs. This indicates that using MICs is an attractive choice for existing SPH codes on multi-core CPUs parallelized by OpenMP, as it gains computational acceleration without the need for significant changes to the source code.

  1. Real-time implementations of image segmentation algorithms on shared memory multicore architecture: a survey (Conference Presentation)

    NASA Astrophysics Data System (ADS)

    Akil, Mohamed

    2017-05-01

    The real-time processing is getting more and more important in many image processing applications. Image segmentation is one of the most fundamental tasks image analysis. As a consequence, many different approaches for image segmentation have been proposed. The watershed transform is a well-known image segmentation tool. The watershed transform is a very data intensive task. To achieve acceleration and obtain real-time processing of watershed algorithms, parallel architectures and programming models for multicore computing have been developed. This paper focuses on the survey of the approaches for parallel implementation of sequential watershed algorithms on multicore general purpose CPUs: homogeneous multicore processor with shared memory. To achieve an efficient parallel implementation, it's necessary to explore different strategies (parallelization/distribution/distributed scheduling) combined with different acceleration and optimization techniques to enhance parallelism. In this paper, we give a comparison of various parallelization of sequential watershed algorithms on shared memory multicore architecture. We analyze the performance measurements of each parallel implementation and the impact of the different sources of overhead on the performance of the parallel implementations. In this comparison study, we also discuss the advantages and disadvantages of the parallel programming models. Thus, we compare the OpenMP (an application programming interface for multi-Processing) with Ptheads (POSIX Threads) to illustrate the impact of each parallel programming model on the performance of the parallel implementations.

  2. GPU accelerated particle visualization with Splotch

    NASA Astrophysics Data System (ADS)

    Rivi, M.; Gheller, C.; Dykes, T.; Krokos, M.; Dolag, K.

    2014-07-01

    Splotch is a rendering algorithm for exploration and visual discovery in particle-based datasets coming from astronomical observations or numerical simulations. The strengths of the approach are production of high quality imagery and support for very large-scale datasets through an effective mix of the OpenMP and MPI parallel programming paradigms. This article reports our experiences in re-designing Splotch for exploiting emerging HPC architectures nowadays increasingly populated with GPUs. A performance model is introduced to guide our re-factoring of Splotch. A number of parallelization issues are discussed, in particular relating to race conditions and workload balancing, towards achieving optimal performances. Our implementation was accomplished by using the CUDA programming paradigm. Our strategy is founded on novel schemes achieving optimized data organization and classification of particles. We deploy a reference cosmological simulation to present performance results on acceleration gains and scalability. We finally outline our vision for future work developments including possibilities for further optimizations and exploitation of hybrid systems and emerging accelerators.

  3. Optimized multiple quantum MAS lineshape simulations in solid state NMR

    NASA Astrophysics Data System (ADS)

    Brouwer, William J.; Davis, Michael C.; Mueller, Karl T.

    2009-10-01

    The majority of nuclei available for study in solid state Nuclear Magnetic Resonance have half-integer spin I>1/2, with corresponding electric quadrupole moment. As such, they may couple with a surrounding electric field gradient. This effect introduces anisotropic line broadening to spectra, arising from distinct chemical species within polycrystalline solids. In Multiple Quantum Magic Angle Spinning (MQMAS) experiments, a second frequency dimension is created, devoid of quadrupolar anisotropy. As a result, the center of gravity of peaks in the high resolution dimension is a function of isotropic second order quadrupole and chemical shift alone. However, for complex materials, these parameters take on a stochastic nature due in turn to structural and chemical disorder. Lineshapes may still overlap in the isotropic dimension, complicating the task of assignment and interpretation. A distributed computational approach is presented here which permits simulation of the two-dimensional MQMAS spectrum, generated by random variates from model distributions of isotropic chemical and quadrupole shifts. Owing to the non-convex nature of the residual sum of squares (RSS) function between experimental and simulated spectra, simulated annealing is used to optimize the simulation parameters. In this manner, local chemical environments for disordered materials may be characterized, and via a re-sampling approach, error estimates for parameters produced. Program summaryProgram title: mqmasOPT Catalogue identifier: AEEC_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEEC_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 3650 No. of bytes in distributed program, including test data, etc.: 73 853 Distribution format: tar.gz Programming language: C, OCTAVE Computer: UNIX/Linux Operating system: UNIX/Linux Has the code been vectorised or parallelized?: Yes RAM: Example: (1597 powder angles) × (200 Samples) × (81 F2 frequency pts) × (31 F1 frequency points) = 3.5M, SMP AMD opteron Classification: 2.3 External routines: OCTAVE ( http://www.gnu.org/software/octave/), GNU Scientific Library ( http://www.gnu.org/software/gsl/), OPENMP ( http://openmp.org/wp/) Nature of problem: The optimal simulation and modeling of multiple quantum magic angle spinning NMR spectra, for general systems, especially those with mild to significant disorder. The approach outlined and implemented in C and OCTAVE also produces model parameter error estimates. Solution method: A model for each distinct chemical site is first proposed, for the individual contribution of crystallite orientations to the spectrum. This model is averaged over all powder angles [1], as well as the (stochastic) parameters; isotropic chemical shift and quadrupole coupling constant. The latter is accomplished via sampling from a bi-variate Gaussian distribution, using the Box-Muller algorithm to transform Sobol (quasi) random numbers [2]. A simulated annealing optimization is performed, and finally the non-linear jackknife [3] is applied in developing model parameter error estimates. Additional comments: The distribution contains a script, mqmasOpt.m, which runs in the OCTAVE language workspace. Running time: Example: (1597 powder angles) × (200 Samples) × (81 F2 frequency pts) × (31 F1 frequency points) = 58.35 seconds, SMP AMD opteron. References:S.K. Zaremba, Annali di Matematica Pura ed Applicata 73 (1966) 293. H. Niederreiter, Random Number Generation and Quasi-Monte Carlo Methods, SIAM, 1992. T. Fox, D. Hinkley, K. Larntz, Technometrics 22 (1980) 29.

  4. Rambrain - a library for virtually extending physical memory

    NASA Astrophysics Data System (ADS)

    Imgrund, Maximilian; Arth, Alexander

    2017-08-01

    We introduce Rambrain, a user space library that manages memory consumption of your code. Using Rambrain you can overcommit memory over the size of physical memory present in the system. Rambrain takes care of temporarily swapping out data to disk and can handle multiples of the physical memory size present. Rambrain is thread-safe, OpenMP and MPI compatible and supports Asynchronous IO. The library was designed to require minimal changes to existing programs and to be easy to use.

  5. Enhancing Application Performance Using Mini-Apps: Comparison of Hybrid Parallel Programming Paradigms

    NASA Technical Reports Server (NTRS)

    Lawson, Gary; Poteat, Michael; Sosonkina, Masha; Baurle, Robert; Hammond, Dana

    2016-01-01

    In this work, several mini-apps have been created to enhance a real-world application performance, namely the VULCAN code for complex flow analysis developed at the NASA Langley Research Center. These mini-apps explore hybrid parallel programming paradigms with Message Passing Interface (MPI) for distributed memory access and either Shared MPI (SMPI) or OpenMP for shared memory accesses. Performance testing shows that MPI+SMPI yields the best execution performance, while requiring the largest number of code changes. A maximum speedup of 23X was measured for MPI+SMPI, but only 10X was measured for MPI+OpenMP.

  6. Data preprocessing for determining outer/inner parallelization in the nested loop problem using OpenMP

    NASA Astrophysics Data System (ADS)

    Handhika, T.; Bustamam, A.; Ernastuti, Kerami, D.

    2017-07-01

    Multi-thread programming using OpenMP on the shared-memory architecture with hyperthreading technology allows the resource to be accessed by multiple processors simultaneously. Each processor can execute more than one thread for a certain period of time. However, its speedup depends on the ability of the processor to execute threads in limited quantities, especially the sequential algorithm which contains a nested loop. The number of the outer loop iterations is greater than the maximum number of threads that can be executed by a processor. The thread distribution technique that had been found previously only be applied by the high-level programmer. This paper generates a parallelization procedure for low-level programmer in dealing with 2-level nested loop problems with the maximum number of threads that can be executed by a processor is smaller than the number of the outer loop iterations. Data preprocessing which is related to the number of the outer loop and the inner loop iterations, the computational time required to execute each iteration and the maximum number of threads that can be executed by a processor are used as a strategy to determine which parallel region that will produce optimal speedup.

  7. Extension of the AMBER molecular dynamics software to Intel's Many Integrated Core (MIC) architecture

    NASA Astrophysics Data System (ADS)

    Needham, Perri J.; Bhuiyan, Ashraf; Walker, Ross C.

    2016-04-01

    We present an implementation of explicit solvent particle mesh Ewald (PME) classical molecular dynamics (MD) within the PMEMD molecular dynamics engine, that forms part of the AMBER v14 MD software package, that makes use of Intel Xeon Phi coprocessors by offloading portions of the PME direct summation and neighbor list build to the coprocessor. We refer to this implementation as pmemd MIC offload and in this paper present the technical details of the algorithm, including basic models for MPI and OpenMP configuration, and analyze the resultant performance. The algorithm provides the best performance improvement for large systems (>400,000 atoms), achieving a ∼35% performance improvement for satellite tobacco mosaic virus (1,067,095 atoms) when 2 Intel E5-2697 v2 processors (2 ×12 cores, 30M cache, 2.7 GHz) are coupled to an Intel Xeon Phi coprocessor (Model 7120P-1.238/1.333 GHz, 61 cores). The implementation utilizes a two-fold decomposition strategy: spatial decomposition using an MPI library and thread-based decomposition using OpenMP. We also present compiler optimization settings that improve the performance on Intel Xeon processors, while retaining simulation accuracy.

  8. Development of full wave code for modeling RF fields in hot non-uniform plasmas

    NASA Astrophysics Data System (ADS)

    Zhao, Liangji; Svidzinski, Vladimir; Spencer, Andrew; Kim, Jin-Soo

    2016-10-01

    FAR-TECH, Inc. is developing a full wave RF modeling code to model RF fields in fusion devices and in general plasma applications. As an important component of the code, an adaptive meshless technique is introduced to solve the wave equations, which allows resolving plasma resonances efficiently and adapting to the complexity of antenna geometry and device boundary. The computational points are generated using either a point elimination method or a force balancing method based on the monitor function, which is calculated by solving the cold plasma dispersion equation locally. Another part of the code is the conductivity kernel calculation, used for modeling the nonlocal hot plasma dielectric response. The conductivity kernel is calculated on a coarse grid of test points and then interpolated linearly onto the computational points. All the components of the code are parallelized using MPI and OpenMP libraries to optimize the execution speed and memory. The algorithm and the results of our numerical approach to solving 2-D wave equations in a tokamak geometry will be presented. Work is supported by the U.S. DOE SBIR program.

  9. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Trędak, Przemysław, E-mail: przemyslaw.tredak@fuw.edu.pl; Rudnicki, Witold R.; Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw, ul. Pawińskiego 5a, 02-106 Warsaw

    The second generation Reactive Bond Order (REBO) empirical potential is commonly used to accurately model a wide range hydrocarbon materials. It is also extensible to other atom types and interactions. REBO potential assumes complex multi-body interaction model, that is difficult to represent efficiently in the SIMD or SIMT programming model. Hence, despite its importance, no efficient GPGPU implementation has been developed for this potential. Here we present a detailed description of a highly efficient GPGPU implementation of molecular dynamics algorithm using REBO potential. The presented algorithm takes advantage of rarely used properties of the SIMT architecture of a modern GPUmore » to solve difficult synchronizations issues that arise in computations of multi-body potential. Techniques developed for this problem may be also used to achieve efficient solutions of different problems. The performance of proposed algorithm is assessed using a range of model systems. It is compared to highly optimized CPU implementation (both single core and OpenMP) available in LAMMPS package. These experiments show up to 6x improvement in forces computation time using single processor of the NVIDIA Tesla K80 compared to high end 16-core Intel Xeon processor.« less

  10. Second International Workshop on Software Engineering and Code Design in Parallel Meteorological and Oceanographic Applications

    NASA Technical Reports Server (NTRS)

    OKeefe, Matthew (Editor); Kerr, Christopher L. (Editor)

    1998-01-01

    This report contains the abstracts and technical papers from the Second International Workshop on Software Engineering and Code Design in Parallel Meteorological and Oceanographic Applications, held June 15-18, 1998, in Scottsdale, Arizona. The purpose of the workshop is to bring together software developers in meteorology and oceanography to discuss software engineering and code design issues for parallel architectures, including Massively Parallel Processors (MPP's), Parallel Vector Processors (PVP's), Symmetric Multi-Processors (SMP's), Distributed Shared Memory (DSM) multi-processors, and clusters. Issues to be discussed include: (1) code architectures for current parallel models, including basic data structures, storage allocation, variable naming conventions, coding rules and styles, i/o and pre/post-processing of data; (2) designing modular code; (3) load balancing and domain decomposition; (4) techniques that exploit parallelism efficiently yet hide the machine-related details from the programmer; (5) tools for making the programmer more productive; and (6) the proliferation of programming models (F--, OpenMP, MPI, and HPF).

  11. Performance Characteristics of the Multi-Zone NAS Parallel Benchmarks

    NASA Technical Reports Server (NTRS)

    Jin, Haoqiang; VanderWijngaart, Rob F.

    2003-01-01

    We describe a new suite of computational benchmarks that models applications featuring multiple levels of parallelism. Such parallelism is often available in realistic flow computations on systems of grids, but had not previously been captured in bench-marks. The new suite, named NPB Multi-Zone, is extended from the NAS Parallel Benchmarks suite, and involves solving the application benchmarks LU, BT and SP on collections of loosely coupled discretization meshes. The solutions on the meshes are updated independently, but after each time step they exchange boundary value information. This strategy provides relatively easily exploitable coarse-grain parallelism between meshes. Three reference implementations are available: one serial, one hybrid using the Message Passing Interface (MPI) and OpenMP, and another hybrid using a shared memory multi-level programming model (SMP+OpenMP). We examine the effectiveness of hybrid parallelization paradigms in these implementations on three different parallel computers. We also use an empirical formula to investigate the performance characteristics of the multi-zone benchmarks.

  12. Performance evaluation of throughput computing workloads using multi-core processors and graphics processors

    NASA Astrophysics Data System (ADS)

    Dave, Gaurav P.; Sureshkumar, N.; Blessy Trencia Lincy, S. S.

    2017-11-01

    Current trend in processor manufacturing focuses on multi-core architectures rather than increasing the clock speed for performance improvement. Graphic processors have become as commodity hardware for providing fast co-processing in computer systems. Developments in IoT, social networking web applications, big data created huge demand for data processing activities and such kind of throughput intensive applications inherently contains data level parallelism which is more suited for SIMD architecture based GPU. This paper reviews the architectural aspects of multi/many core processors and graphics processors. Different case studies are taken to compare performance of throughput computing applications using shared memory programming in OpenMP and CUDA API based programming.

  13. GPU accelerated implementation of NCI calculations using promolecular density.

    PubMed

    Rubez, Gaëtan; Etancelin, Jean-Matthieu; Vigouroux, Xavier; Krajecki, Michael; Boisson, Jean-Charles; Hénon, Eric

    2017-05-30

    The NCI approach is a modern tool to reveal chemical noncovalent interactions. It is particularly attractive to describe ligand-protein binding. A custom implementation for NCI using promolecular density is presented. It is designed to leverage the computational power of NVIDIA graphics processing unit (GPU) accelerators through the CUDA programming model. The code performances of three versions are examined on a test set of 144 systems. NCI calculations are particularly well suited to the GPU architecture, which reduces drastically the computational time. On a single compute node, the dual-GPU version leads to a 39-fold improvement for the biggest instance compared to the optimal OpenMP parallel run (C code, icc compiler) with 16 CPU cores. Energy consumption measurements carried out on both CPU and GPU NCI tests show that the GPU approach provides substantial energy savings. © 2017 Wiley Periodicals, Inc. © 2017 Wiley Periodicals, Inc.

  14. Roofline model toolkit: A practical tool for architectural and program analysis

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Lo, Yu Jung; Williams, Samuel; Van Straalen, Brian

    We present preliminary results of the Roofline Toolkit for multicore, many core, and accelerated architectures. This paper focuses on the processor architecture characterization engine, a collection of portable instrumented micro benchmarks implemented with Message Passing Interface (MPI), and OpenMP used to express thread-level parallelism. These benchmarks are specialized to quantify the behavior of different architectural features. Compared to previous work on performance characterization, these microbenchmarks focus on capturing the performance of each level of the memory hierarchy, along with thread-level parallelism, instruction-level parallelism and explicit SIMD parallelism, measured in the context of the compilers and run-time environments. We also measuremore » sustained PCIe throughput with four GPU memory managed mechanisms. By combining results from the architecture characterization with the Roofline model based solely on architectural specifications, this work offers insights for performance prediction of current and future architectures and their software systems. To that end, we instrument three applications and plot their resultant performance on the corresponding Roofline model when run on a Blue Gene/Q architecture.« less

  15. Performance Analysis of a Hybrid Overset Multi-Block Application on Multiple Architectures

    NASA Technical Reports Server (NTRS)

    Djomehri, M. Jahed; Biswas, Rupak

    2003-01-01

    This paper presents a detailed performance analysis of a multi-block overset grid compu- tational fluid dynamics app!ication on multiple state-of-the-art computer architectures. The application is implemented using a hybrid MPI+OpenMP programming paradigm that exploits both coarse and fine-grain parallelism; the former via MPI message passing and the latter via OpenMP directives. The hybrid model also extends the applicability of multi-block programs to large clusters of SNIP nodes by overcoming the restriction that the number of processors be less than the number of grid blocks. A key kernel of the application, namely the LU-SGS linear solver, had to be modified to enhance the performance of the hybrid approach on the target machines. Investigations were conducted on cacheless Cray SX6 vector processors, cache-based IBM Power3 and Power4 architectures, and single system image SGI Origin3000 platforms. Overall results for complex vortex dynamics simulations demonstrate that the SX6 achieves the highest performance and outperforms the RISC-based architectures; however, the best scaling performance was achieved on the Power3.

  16. A heterogeneous computing accelerated SCE-UA global optimization method using OpenMP, OpenCL, CUDA, and OpenACC.

    PubMed

    Kan, Guangyuan; He, Xiaoyan; Ding, Liuqian; Li, Jiren; Liang, Ke; Hong, Yang

    2017-10-01

    The shuffled complex evolution optimization developed at the University of Arizona (SCE-UA) has been successfully applied in various kinds of scientific and engineering optimization applications, such as hydrological model parameter calibration, for many years. The algorithm possesses good global optimality, convergence stability and robustness. However, benchmark and real-world applications reveal the poor computational efficiency of the SCE-UA. This research aims at the parallelization and acceleration of the SCE-UA method based on powerful heterogeneous computing technology. The parallel SCE-UA is implemented on Intel Xeon multi-core CPU (by using OpenMP and OpenCL) and NVIDIA Tesla many-core GPU (by using OpenCL, CUDA, and OpenACC). The serial and parallel SCE-UA were tested based on the Griewank benchmark function. Comparison results indicate the parallel SCE-UA significantly improves computational efficiency compared to the original serial version. The OpenCL implementation obtains the best overall acceleration results however, with the most complex source code. The parallel SCE-UA has bright prospects to be applied in real-world applications.

  17. Variational data assimilation system "INM RAS - Black Sea"

    NASA Astrophysics Data System (ADS)

    Parmuzin, Eugene; Agoshkov, Valery; Assovskiy, Maksim; Giniatulin, Sergey; Zakharova, Natalia; Kuimov, Grigory; Fomin, Vladimir

    2013-04-01

    Development of Informational-Computational Systems (ICS) for Data Assimilation Procedures is one of multidisciplinary problems. To study and solve these problems one needs to apply modern results from different disciplines and recent developments in: mathematical modeling; theory of adjoint equations and optimal control; inverse problems; numerical methods theory; numerical algebra and scientific computing. The problems discussed above are studied in the Institute of Numerical Mathematics of the Russian Academy of Science (INM RAS) in ICS for Personal Computers (PC). Special problems and questions arise while effective ICS versions for PC are being developed. These problems and questions can be solved with applying modern methods of numerical mathematics and by solving "parallelism problem" using OpenMP technology and special linear algebra packages. In this work the results on the ICS development for PC-ICS "INM RAS - Black Sea" are presented. In the work the following problems and questions are discussed: practical problems that can be studied by ICS; parallelism problems and their solutions with applying of OpenMP technology and the linear algebra packages used in ICS "INM - Black Sea"; Interface of ICS. The results of ICS "INM RAS - Black Sea" testing are presented. Efficiency of technologies and methods applied are discussed. The work was supported by RFBR, grants No. 13-01-00753, 13-05-00715 and by The Ministry of education and science of Russian Federation, project 8291, project 11.519.11.1005 References: [1] V.I. Agoshkov, M.V. Assovskii, S.A. Lebedev, Numerical simulation of Black Sea hydrothermodynamics taking into account tide-forming forces. Russ. J. Numer. Anal. Math. Modelling (2012) 27, No.1, 5-31 [2] E.I. Parmuzin, V.I. Agoshkov, Numerical solution of the variational assimilation problem for sea surface temperature in the model of the Black Sea dynamics. Russ. J. Numer. Anal. Math. Modelling (2012) 27, No.1, 69-94 [3] V.B. Zalesny, N.A. Diansky, V.V. Fomin, S.N. Moshonkin, S.G. Demyshev, Numerical model of the circulation of Black Sea and Sea of Azov. Russ. J. Numer. Anal. Math. Modelling (2012) 27, No.1, 95-111 [4] V.I. Agoshkov, S.V. Giniatulin, G.V. Kuimov. OpenMP technology and linear algebra packages in the variation data assimilation systems. - Abstracts of the 1-st China-Russia Conference on Numerical Algebra with Applications in Radiactive Hydrodynamics, Beijing, China, October 16-18, 2012. [5] Zakharova N.B., Agoshkov V.I., Parmuzin E.I., The new method of ARGO buoys system observation data interpolation. Russian Journal of Numerical Analysis and Mathematical Modelling. Vol. 28, Issue 1, 2013.

  18. Efficient implementation of the many-body Reactive Bond Order (REBO) potential on GPU

    NASA Astrophysics Data System (ADS)

    Trędak, Przemysław; Rudnicki, Witold R.; Majewski, Jacek A.

    2016-09-01

    The second generation Reactive Bond Order (REBO) empirical potential is commonly used to accurately model a wide range hydrocarbon materials. It is also extensible to other atom types and interactions. REBO potential assumes complex multi-body interaction model, that is difficult to represent efficiently in the SIMD or SIMT programming model. Hence, despite its importance, no efficient GPGPU implementation has been developed for this potential. Here we present a detailed description of a highly efficient GPGPU implementation of molecular dynamics algorithm using REBO potential. The presented algorithm takes advantage of rarely used properties of the SIMT architecture of a modern GPU to solve difficult synchronizations issues that arise in computations of multi-body potential. Techniques developed for this problem may be also used to achieve efficient solutions of different problems. The performance of proposed algorithm is assessed using a range of model systems. It is compared to highly optimized CPU implementation (both single core and OpenMP) available in LAMMPS package. These experiments show up to 6x improvement in forces computation time using single processor of the NVIDIA Tesla K80 compared to high end 16-core Intel Xeon processor.

  19. ATDM LANL FleCSI: Topology and Execution Framework

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Bergen, Benjamin Karl

    FleCSI is a compile-time configurable C++ framework designed to support multi-physics application development. As such, FleCSI attempts to provide a very general set of infrastructure design patterns that can be specialized and extended to suit the needs of a broad variety of solver and data requirements. This means that FleCSI is potentially useful to many different ECP projects. Current support includes multidimensional mesh topology, mesh geometry, and mesh adjacency information, n-dimensional hashed-tree data structures, graph partitioning interfaces, and dependency closures (to identify data dependencies between distributed-memory address spaces). FleCSI introduces a functional programming model with control, execution, and data abstractionsmore » that are consistent with state-of-the-art task-based runtimes such as Legion and Charm++. The model also provides support for fine-grained, data-parallel execution with backend support for runtimes such as OpenMP and C++17. The FleCSI abstraction layer provides the developer with insulation from the underlying runtimes, while allowing support for multiple runtime systems, including conventional models like asynchronous MPI. The intent is to give developers a concrete set of user-friendly programming tools that can be used now, while allowing flexibility in choosing runtime implementations and optimizations that can be applied to architectures and runtimes that arise in the future. This project is essential to the ECP Ristra Next-Generation Code project, part of ASC ATDM, because it provides a hierarchically parallel programming model that is consistent with the design of modern system architectures, but which allows for the straightforward expression of algorithmic parallelism in a portably performant manner.« less

  20. Using Intel's Knight Landing Processor to Accelerate Global Nested Air Quality Prediction Modeling System (GNAQPMS) Model

    NASA Astrophysics Data System (ADS)

    Wang, H.; Chen, H.; Chen, X.; Wu, Q.; Wang, Z.

    2016-12-01

    The Global Nested Air Quality Prediction Modeling System for Hg (GNAQPMS-Hg) is a global chemical transport model coupled Hg transport module to investigate the mercury pollution. In this study, we present our work of transplanting the GNAQPMS model on Intel Xeon Phi processor, Knights Landing (KNL) to accelerate the model. KNL is the second-generation product adopting Many Integrated Core Architecture (MIC) architecture. Compared with the first generation Knight Corner (KNC), KNL has more new hardware features, that it can be used as unique processor as well as coprocessor with other CPU. According to the Vtune tool, the high overhead modules in GNAQPMS model have been addressed, including CBMZ gas chemistry, advection and convection module, and wet deposition module. These high overhead modules were accelerated by optimizing code and using new techniques of KNL. The following optimized measures was done: 1) Changing the pure MPI parallel mode to hybrid parallel mode with MPI and OpenMP; 2.Vectorizing the code to using the 512-bit wide vector computation unit. 3. Reducing unnecessary memory access and calculation. 4. Reducing Thread Local Storage (TLS) for common variables with each OpenMP thread in CBMZ. 5. Changing the way of global communication from files writing and reading to MPI functions. After optimization, the performance of GNAQPMS is greatly increased both on CPU and KNL platform, the single-node test showed that optimized version has 2.6x speedup on two sockets CPU platform and 3.3x speedup on one socket KNL platform compared with the baseline version code, which means the KNL has 1.29x speedup when compared with 2 sockets CPU platform.

  1. MIST: An Open Source Environmental Modelling Programming Language Incorporating Easy to Use Data Parallelism.

    NASA Astrophysics Data System (ADS)

    Bellerby, Tim

    2014-05-01

    Model Integration System (MIST) is open-source environmental modelling programming language that directly incorporates data parallelism. The language is designed to enable straightforward programming structures, such as nested loops and conditional statements to be directly translated into sequences of whole-array (or more generally whole data-structure) operations. MIST thus enables the programmer to use well-understood constructs, directly relating to the mathematical structure of the model, without having to explicitly vectorize code or worry about details of parallelization. A range of common modelling operations are supported by dedicated language structures operating on cell neighbourhoods rather than individual cells (e.g.: the 3x3 local neighbourhood needed to implement an averaging image filter can be simply accessed from within a simple loop traversing all image pixels). This facility hides details of inter-process communication behind more mathematically relevant descriptions of model dynamics. The MIST automatic vectorization/parallelization process serves both to distribute work among available nodes and separately to control storage requirements for intermediate expressions - enabling operations on very large domains for which memory availability may be an issue. MIST is designed to facilitate efficient interpreter based implementations. A prototype open source interpreter is available, coded in standard FORTRAN 95, with tools to rapidly integrate existing FORTRAN 77 or 95 code libraries. The language is formally specified and thus not limited to FORTRAN implementation or to an interpreter-based approach. A MIST to FORTRAN compiler is under development and volunteers are sought to create an ANSI-C implementation. Parallel processing is currently implemented using OpenMP. However, parallelization code is fully modularised and could be replaced with implementations using other libraries. GPU implementation is potentially possible.

  2. Optimization of groundwater artificial recharge systems using a genetic algorithm: a case study in Beijing, China

    NASA Astrophysics Data System (ADS)

    Hao, Qichen; Shao, Jingli; Cui, Yali; Zhang, Qiulan; Huang, Linxian

    2018-05-01

    An optimization approach is used for the operation of groundwater artificial recharge systems in an alluvial fan in Beijing, China. The optimization model incorporates a transient groundwater flow model, which allows for simulation of the groundwater response to artificial recharge. The facilities' operation with regard to recharge rates is formulated as a nonlinear programming problem to maximize the volume of surface water recharged into the aquifers under specific constraints. This optimization problem is solved by the parallel genetic algorithm (PGA) based on OpenMP, which could substantially reduce the computation time. To solve the PGA with constraints, the multiplicative penalty method is applied. In addition, the facilities' locations are implicitly determined on the basis of the results of the recharge-rate optimizations. Two scenarios are optimized and the optimal results indicate that the amount of water recharged into the aquifers will increase without exceeding the upper limits of the groundwater levels. Optimal operation of this artificial recharge system can also contribute to the more effective recovery of the groundwater storage capacity.

  3. GPU accelerated dynamic functional connectivity analysis for functional MRI data.

    PubMed

    Akgün, Devrim; Sakoğlu, Ünal; Esquivel, Johnny; Adinoff, Bryon; Mete, Mutlu

    2015-07-01

    Recent advances in multi-core processors and graphics card based computational technologies have paved the way for an improved and dynamic utilization of parallel computing techniques. Numerous applications have been implemented for the acceleration of computationally-intensive problems in various computational science fields including bioinformatics, in which big data problems are prevalent. In neuroimaging, dynamic functional connectivity (DFC) analysis is a computationally demanding method used to investigate dynamic functional interactions among different brain regions or networks identified with functional magnetic resonance imaging (fMRI) data. In this study, we implemented and analyzed a parallel DFC algorithm based on thread-based and block-based approaches. The thread-based approach was designed to parallelize DFC computations and was implemented in both Open Multi-Processing (OpenMP) and Compute Unified Device Architecture (CUDA) programming platforms. Another approach developed in this study to better utilize CUDA architecture is the block-based approach, where parallelization involves smaller parts of fMRI time-courses obtained by sliding-windows. Experimental results showed that the proposed parallel design solutions enabled by the GPUs significantly reduce the computation time for DFC analysis. Multicore implementation using OpenMP on 8-core processor provides up to 7.7× speed-up. GPU implementation using CUDA yielded substantial accelerations ranging from 18.5× to 157× speed-up once thread-based and block-based approaches were combined in the analysis. Proposed parallel programming solutions showed that multi-core processor and CUDA-supported GPU implementations accelerated the DFC analyses significantly. Developed algorithms make the DFC analyses more practical for multi-subject studies with more dynamic analyses. Copyright © 2015 Elsevier Ltd. All rights reserved.

  4. Parallel implementation of approximate atomistic models of the AMOEBA polarizable model

    NASA Astrophysics Data System (ADS)

    Demerdash, Omar; Head-Gordon, Teresa

    2016-11-01

    In this work we present a replicated data hybrid OpenMP/MPI implementation of a hierarchical progression of approximate classical polarizable models that yields speedups of up to ∼10 compared to the standard OpenMP implementation of the exact parent AMOEBA polarizable model. In addition, our parallel implementation exhibits reasonable weak and strong scaling. The resulting parallel software will prove useful for those who are interested in how molecular properties converge in the condensed phase with respect to the MBE, it provides a fruitful test bed for exploring different electrostatic embedding schemes, and offers an interesting possibility for future exascale computing paradigms.

  5. Parallelization of GeoClaw code for modeling geophysical flows with adaptive mesh refinement on many-core systems

    USGS Publications Warehouse

    Zhang, S.; Yuen, D.A.; Zhu, A.; Song, S.; George, D.L.

    2011-01-01

    We parallelized the GeoClaw code on one-level grid using OpenMP in March, 2011 to meet the urgent need of simulating tsunami waves at near-shore from Tohoku 2011 and achieved over 75% of the potential speed-up on an eight core Dell Precision T7500 workstation [1]. After submitting that work to SC11 - the International Conference for High Performance Computing, we obtained an unreleased OpenMP version of GeoClaw from David George, who developed the GeoClaw code as part of his PH.D thesis. In this paper, we will show the complementary characteristics of the two approaches used in parallelizing GeoClaw and the speed-up obtained by combining the advantage of each of the two individual approaches with adaptive mesh refinement (AMR), demonstrating the capabilities of running GeoClaw efficiently on many-core systems. We will also show a novel simulation of the Tohoku 2011 Tsunami waves inundating the Sendai airport and Fukushima Nuclear Power Plants, over which the finest grid distance of 20 meters is achieved through a 4-level AMR. This simulation yields quite good predictions about the wave-heights and travel time of the tsunami waves. ?? 2011 IEEE.

  6. BioFVM: an efficient, parallelized diffusive transport solver for 3-D biological simulations

    PubMed Central

    Ghaffarizadeh, Ahmadreza; Friedman, Samuel H.; Macklin, Paul

    2016-01-01

    Motivation: Computational models of multicellular systems require solving systems of PDEs for release, uptake, decay and diffusion of multiple substrates in 3D, particularly when incorporating the impact of drugs, growth substrates and signaling factors on cell receptors and subcellular systems biology. Results: We introduce BioFVM, a diffusive transport solver tailored to biological problems. BioFVM can simulate release and uptake of many substrates by cell and bulk sources, diffusion and decay in large 3D domains. It has been parallelized with OpenMP, allowing efficient simulations on desktop workstations or single supercomputer nodes. The code is stable even for large time steps, with linear computational cost scalings. Solutions are first-order accurate in time and second-order accurate in space. The code can be run by itself or as part of a larger simulator. Availability and implementation: BioFVM is written in C ++ with parallelization in OpenMP. It is maintained and available for download at http://BioFVM.MathCancer.org and http://BioFVM.sf.net under the Apache License (v2.0). Contact: paul.macklin@usc.edu. Supplementary information: Supplementary data are available at Bioinformatics online. PMID:26656933

  7. Acceleration for 2D time-domain elastic full waveform inversion using a single GPU card

    NASA Astrophysics Data System (ADS)

    Jiang, Jinpeng; Zhu, Peimin

    2018-05-01

    Full waveform inversion (FWI) is a challenging procedure due to the high computational cost related to the modeling, especially for the elastic case. The graphics processing unit (GPU) has become a popular device for the high-performance computing (HPC). To reduce the long computation time, we design and implement the GPU-based 2D elastic FWI (EFWI) in time domain using a single GPU card. We parallelize the forward modeling and gradient calculations using the CUDA programming language. To overcome the limitation of relatively small global memory on GPU, the boundary saving strategy is exploited to reconstruct the forward wavefield. Moreover, the L-BFGS optimization method used in the inversion increases the convergence of the misfit function. A multiscale inversion strategy is performed in the workflow to obtain the accurate inversion results. In our tests, the GPU-based implementations using a single GPU device achieve >15 times speedup in forward modeling, and about 12 times speedup in gradient calculation, compared with the eight-core CPU implementations optimized by OpenMP. The test results from the GPU implementations are verified to have enough accuracy by comparing the results obtained from the CPU implementations.

  8. Parallel processing implementation for the coupled transport of photons and electrons using OpenMP

    NASA Astrophysics Data System (ADS)

    Doerner, Edgardo

    2016-05-01

    In this work the use of OpenMP to implement the parallel processing of the Monte Carlo (MC) simulation of the coupled transport for photons and electrons is presented. This implementation was carried out using a modified EGSnrc platform which enables the use of the Microsoft Visual Studio 2013 (VS2013) environment, together with the developing tools available in the Intel Parallel Studio XE 2015 (XE2015). The performance study of this new implementation was carried out in a desktop PC with a multi-core CPU, taking as a reference the performance of the original platform. The results were satisfactory, both in terms of scalability as parallelization efficiency.

  9. Issues Identified During September 2016 IBM OpenMP 4.5 Hackathon

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Richards, David F.

    In September, 2016 IBM hosted an OpenMP 4.5 Hackathon at the TJ Watson Research Center. Teams from LLNL, ORNL, SNL, LANL, and LBNL attended the event. As with the 2015 hackathon, IBM produced an extremely useful and successful event with unmatched support from compiler team, applications staff, and facilities. Approximately 24 IBM staff supported 4-day hackathon and spent significant time 4-6 weeks out to prepare environment and become familiar with apps. This hackathon was also the first event to feature LLVM & XL C/C++ and Fortran compilers. This report records many of the issues encountered by the LLNL teams duringmore » the hackathon.« less

  10. Automation of Data Traffic Control on DSM Architecture

    NASA Technical Reports Server (NTRS)

    Frumkin, Michael; Jin, Hao-Qiang; Yan, Jerry

    2001-01-01

    The design of distributed shared memory (DSM) computers liberates users from the duty to distribute data across processors and allows for the incremental development of parallel programs using, for example, OpenMP or Java threads. DSM architecture greatly simplifies the development of parallel programs having good performance on a few processors. However, to achieve a good program scalability on DSM computers requires that the user understand data flow in the application and use various techniques to avoid data traffic congestions. In this paper we discuss a number of such techniques, including data blocking, data placement, data transposition and page size control and evaluate their efficiency on the NAS (NASA Advanced Supercomputing) Parallel Benchmarks. We also present a tool which automates the detection of constructs causing data congestions in Fortran array oriented codes and advises the user on code transformations for improving data traffic in the application.

  11. Evaluating the networking characteristics of the Cray XC-40 Intel Knights Landing-based Cori supercomputer at NERSC

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Doerfler, Douglas; Austin, Brian; Cook, Brandon

    There are many potential issues associated with deploying the Intel Xeon Phi™ (code named Knights Landing [KNL]) manycore processor in a large-scale supercomputer. One in particular is the ability to fully utilize the high-speed communications network, given that the serial performance of a Xeon Phi TM core is a fraction of a Xeon®core. In this paper, we take a look at the trade-offs associated with allocating enough cores to fully utilize the Aries high-speed network versus cores dedicated to computation, e.g., the trade-off between MPI and OpenMP. In addition, we evaluate new features of Cray MPI in support of KNL,more » such as internode optimizations. We also evaluate one-sided programming models such as Unified Parallel C. We quantify the impact of the above trade-offs and features using a suite of National Energy Research Scientific Computing Center applications.« less

  12. A Parallel 2D Numerical Simulation of Tumor Cells Necrosis by Local Hyperthermia

    NASA Astrophysics Data System (ADS)

    Reis, R. F.; Loureiro, F. S.; Lobosco, M.

    2014-03-01

    Hyperthermia has been widely used in cancer treatment to destroy tumors. The main idea of the hyperthermia is to heat a specific region like a tumor so that above a threshold temperature the tumor cells are destroyed. This can be accomplished by many heat supply techniques and the use of magnetic nanoparticles that generate heat when an alternating magnetic field is applied has emerged as a promise technique. In the present paper, the Pennes bioheat transfer equation is adopted to model the thermal tumor ablation in the context of magnetic nanoparticles. Numerical simulations are carried out considering different injection sites for the nanoparticles in an attempt to achieve better hyperthermia conditions. Explicit finite difference method is employed to solve the equations. However, a large amount of computation is required for this purpose. Therefore, this work also presents an initial attempt to improve performance using OpenMP, a parallel programming API. Experimental results were quite encouraging: speedups around 35 were obtained on a 64-core machine.

  13. Efficient Thread Labeling for Monitoring Programs with Nested Parallelism

    NASA Astrophysics Data System (ADS)

    Ha, Ok-Kyoon; Kim, Sun-Sook; Jun, Yong-Kee

    It is difficult and cumbersome to detect data races occurred in an execution of parallel programs. Any on-the-fly race detection techniques using Lamport's happened-before relation needs a thread labeling scheme for generating unique identifiers which maintain logical concurrency information for the parallel threads. NR labeling is an efficient thread labeling scheme for the fork-join program model with nested parallelism, because its efficiency depends only on the nesting depth for every fork and join operation. This paper presents an improved NR labeling, called e-NR labeling, in which every thread generates its label by inheriting the pointer to its ancestor list from the parent threads or by updating the pointer in a constant amount of time and space. This labeling is more efficient than the NR labeling, because its efficiency does not depend on the nesting depth for every fork and join operation. Some experiments were performed with OpenMP programs having nesting depths of three or four and maximum parallelisms varying from 10,000 to 1,000,000. The results show that e-NR is 5 times faster than NR labeling and 4.3 times faster than OS labeling in the average time for creating and maintaining the thread labels. In average space required for labeling, it is 3.5 times smaller than NR labeling and 3 times smaller than OS labeling.

  14. Heterogeneous Hardware Parallelism Review of the IN2P3 2016 Computing School

    NASA Astrophysics Data System (ADS)

    Lafage, Vincent

    2017-11-01

    Parallel and hybrid Monte Carlo computation. The Monte Carlo method is the main workhorse for computation of particle physics observables. This paper provides an overview of various HPC technologies that can be used today: multicore (OpenMP, HPX), manycore (OpenCL). The rewrite of a twenty years old Fortran 77 Monte Carlo will illustrate the various programming paradigms in use beyond language implementation. The problem of parallel random number generator will be addressed. We will give a short report of the one week school dedicated to these recent approaches, that took place in École Polytechnique in May 2016.

  15. Parallel protein secondary structure prediction based on neural networks.

    PubMed

    Zhong, Wei; Altun, Gulsah; Tian, Xinmin; Harrison, Robert; Tai, Phang C; Pan, Yi

    2004-01-01

    Protein secondary structure prediction has a fundamental influence on today's bioinformatics research. In this work, binary and tertiary classifiers of protein secondary structure prediction are implemented on Denoeux belief neural network (DBNN) architecture. Hydrophobicity matrix, orthogonal matrix, BLOSUM62 and PSSM (position specific scoring matrix) are experimented separately as the encoding schemes for DBNN. The experimental results contribute to the design of new encoding schemes. New binary classifier for Helix versus not Helix ( approximately H) for DBNN produces prediction accuracy of 87% when PSSM is used for the input profile. The performance of DBNN binary classifier is comparable to other best prediction methods. The good test results for binary classifiers open a new approach for protein structure prediction with neural networks. Due to the time consuming task of training the neural networks, Pthread and OpenMP are employed to parallelize DBNN in the hyperthreading enabled Intel architecture. Speedup for 16 Pthreads is 4.9 and speedup for 16 OpenMP threads is 4 in the 4 processors shared memory architecture. Both speedup performance of OpenMP and Pthread is superior to that of other research. With the new parallel training algorithm, thousands of amino acids can be processed in reasonable amount of time. Our research also shows that hyperthreading technology for Intel architecture is efficient for parallel biological algorithms.

  16. On a model of three-dimensional bursting and its parallel implementation

    NASA Astrophysics Data System (ADS)

    Tabik, S.; Romero, L. F.; Garzón, E. M.; Ramos, J. I.

    2008-04-01

    A mathematical model for the simulation of three-dimensional bursting phenomena and its parallel implementation are presented. The model consists of four nonlinearly coupled partial differential equations that include fast and slow variables, and exhibits bursting in the absence of diffusion. The differential equations have been discretized by means of a second-order accurate in both space and time, linearly-implicit finite difference method in equally-spaced grids. The resulting system of linear algebraic equations at each time level has been solved by means of the Preconditioned Conjugate Gradient (PCG) method. Three different parallel implementations of the proposed mathematical model have been developed; two of these implementations, i.e., the MPI and the PETSc codes, are based on a message passing paradigm, while the third one, i.e., the OpenMP code, is based on a shared space address paradigm. These three implementations are evaluated on two current high performance parallel architectures, i.e., a dual-processor cluster and a Shared Distributed Memory (SDM) system. A novel representation of the results that emphasizes the most relevant factors that affect the performance of the paralled implementations, is proposed. The comparative analysis of the computational results shows that the MPI and the OpenMP implementations are about twice more efficient than the PETSc code on the SDM system. It is also shown that, for the conditions reported here, the nonlinear dynamics of the three-dimensional bursting phenomena exhibits three stages characterized by asynchronous, synchronous and then asynchronous oscillations, before a quiescent state is reached. It is also shown that the fast system reaches steady state in much less time than the slow variables.

  17. mm_par2.0: An object-oriented molecular dynamics simulation program parallelized using a hierarchical scheme with MPI and OPENMP

    NASA Astrophysics Data System (ADS)

    Oh, Kwang Jin; Kang, Ji Hoon; Myung, Hun Joo

    2012-02-01

    We have revised a general purpose parallel molecular dynamics simulation program mm_par using the object-oriented programming. We parallelized the revised version using a hierarchical scheme in order to utilize more processors for a given system size. The benchmark result will be presented here. New version program summaryProgram title: mm_par2.0 Catalogue identifier: ADXP_v2_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADXP_v2_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC license, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 2 390 858 No. of bytes in distributed program, including test data, etc.: 25 068 310 Distribution format: tar.gz Programming language: C++ Computer: Any system operated by Linux or Unix Operating system: Linux Classification: 7.7 External routines: We provide wrappers for FFTW [1], Intel MKL library [2] FFT routine, and Numerical recipes [3] FFT, random number generator, and eigenvalue solver routines, SPRNG [4] random number generator, Mersenne Twister [5] random number generator, space filling curve routine. Catalogue identifier of previous version: ADXP_v1_0 Journal reference of previous version: Comput. Phys. Comm. 174 (2006) 560 Does the new version supersede the previous version?: Yes Nature of problem: Structural, thermodynamic, and dynamical properties of fluids and solids from microscopic scales to mesoscopic scales. Solution method: Molecular dynamics simulation in NVE, NVT, and NPT ensemble, Langevin dynamics simulation, dissipative particle dynamics simulation. Reasons for new version: First, object-oriented programming has been used, which is known to be open for extension and closed for modification. It is also known to be better for maintenance. Second, version 1.0 was based on atom decomposition and domain decomposition scheme [6] for parallelization. However, atom decomposition is not popular due to its poor scalability. On the other hand, domain decomposition scheme is better for scalability. It still has a limitation in utilizing a large number of cores on recent petascale computers due to the requirement that the domain size is larger than the potential cutoff distance. To go beyond such a limitation, a hierarchical parallelization scheme has been adopted in this new version and implemented using MPI [7] and OPENMP [8]. Summary of revisions: (1) Object-oriented programming has been used. (2) A hierarchical parallelization scheme has been adopted. (3) SPME routine has been fully parallelized with parallel 3D FFT using volumetric decomposition scheme [9]. K.J.O. thanks Mr. Seung Min Lee for useful discussion on programming and debugging. Running time: Running time depends on system size and methods used. For test system containing a protein (PDB id: 5DHFR) with CHARMM22 force field [10] and 7023 TIP3P [11] waters in simulation box having dimension 62.23 Å×62.23 Å×62.23 Å, the benchmark results are given in Fig. 1. Here the potential cutoff distance was set to 12 Å and the switching function was applied from 10 Å for the force calculation in real space. For the SPME [12] calculation, K, K, and K were set to 64 and the interpolation order was set to 4. To do the fast Fourier transform, we used Intel MKL library. All bonds including hydrogen atoms were constrained using SHAKE/RATTLE algorithms [13,14]. The code was compiled using Intel compiler version 11.1 and mvapich2 version 1.5. Fig. 2 shows performance gains from using CUDA-enabled version [15] of mm_par for 5DHFR simulation in water on Intel Core2Quad 2.83 GHz and GeForce GTX 580. Even though mm_par2.0 is not ported yet for GPU, its performance data would be useful to expect mm_par2.0 performance on GPU. Timing results for 1000 MD steps. 1, 2, 4, and 8 in the figure mean the number of OPENMP threads. Timing results for 1000 MD steps from double precision simulation on CPU, single precision simulation on GPU, and double precision simulation on GPU.

  18. Application of a hybrid MPI/OpenMP approach for parallel groundwater model calibration using multi-core computers

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Tang, Guoping; D'Azevedo, Ed F; Zhang, Fan

    2010-01-01

    Calibration of groundwater models involves hundreds to thousands of forward solutions, each of which may solve many transient coupled nonlinear partial differential equations, resulting in a computationally intensive problem. We describe a hybrid MPI/OpenMP approach to exploit two levels of parallelisms in software and hardware to reduce calibration time on multi-core computers. HydroGeoChem 5.0 (HGC5) is parallelized using OpenMP for direct solutions for a reactive transport model application, and a field-scale coupled flow and transport model application. In the reactive transport model, a single parallelizable loop is identified to account for over 97% of the total computational time using GPROF.more » Addition of a few lines of OpenMP compiler directives to the loop yields a speedup of about 10 on a 16-core compute node. For the field-scale model, parallelizable loops in 14 of 174 HGC5 subroutines that require 99% of the execution time are identified. As these loops are parallelized incrementally, the scalability is found to be limited by a loop where Cray PAT detects over 90% cache missing rates. With this loop rewritten, similar speedup as the first application is achieved. The OpenMP-parallelized code can be run efficiently on multiple workstations in a network or multiple compute nodes on a cluster as slaves using parallel PEST to speedup model calibration. To run calibration on clusters as a single task, the Levenberg Marquardt algorithm is added to HGC5 with the Jacobian calculation and lambda search parallelized using MPI. With this hybrid approach, 100 200 compute cores are used to reduce the calibration time from weeks to a few hours for these two applications. This approach is applicable to most of the existing groundwater model codes for many applications.« less

  19. MILC Code Performance on High End CPU and GPU Supercomputer Clusters

    NASA Astrophysics Data System (ADS)

    DeTar, Carleton; Gottlieb, Steven; Li, Ruizi; Toussaint, Doug

    2018-03-01

    With recent developments in parallel supercomputing architecture, many core, multi-core, and GPU processors are now commonplace, resulting in more levels of parallelism, memory hierarchy, and programming complexity. It has been necessary to adapt the MILC code to these new processors starting with NVIDIA GPUs, and more recently, the Intel Xeon Phi processors. We report on our efforts to port and optimize our code for the Intel Knights Landing architecture. We consider performance of the MILC code with MPI and OpenMP, and optimizations with QOPQDP and QPhiX. For the latter approach, we concentrate on the staggered conjugate gradient and gauge force. We also consider performance on recent NVIDIA GPUs using the QUDA library.

  20. Analysis OpenMP performance of AMD and Intel architecture for breaking waves simulation using MPS

    NASA Astrophysics Data System (ADS)

    Alamsyah, M. N. A.; Utomo, A.; Gunawan, P. H.

    2018-03-01

    Simulation of breaking waves by using Navier-Stokes equation via moving particle semi-implicit method (MPS) over close domain is given. The results show the parallel computing on multicore architecture using OpenMP platform can reduce the computational time almost half of the serial time. Here, the comparison using two computer architectures (AMD and Intel) are performed. The results using Intel architecture is shown better than AMD architecture in CPU time. However, in efficiency, the computer with AMD architecture gives slightly higher than the Intel. For the simulation by 1512 number of particles, the CPU time using Intel and AMD are 12662.47 and 28282.30 respectively. Moreover, the efficiency using similar number of particles, AMD obtains 50.09 % and Intel up to 49.42 %.

  1. Parallelizing a peanut butter sandwich

    NASA Astrophysics Data System (ADS)

    Quenette, S. M.

    2005-12-01

    This poster aims to demonstrate, in a novel way, why contemporary computational code development is seemingly hard to a geodynamics modeler (i.e. a non-computer-scientist). For example, to utilise comtemporary computer hardware, parallelisation is required. But why do we chose the explicit approach (MPI) over an implicit (OpenMP) one? How does this relate to the typical geodynamics codes. And do we face this same style of problems in every day life? We aim to demonstrate that the little bit of complexity, fore-thought and effort is worth its while.

  2. Massively parallel sparse matrix function calculations with NTPoly

    NASA Astrophysics Data System (ADS)

    Dawson, William; Nakajima, Takahito

    2018-04-01

    We present NTPoly, a massively parallel library for computing the functions of sparse, symmetric matrices. The theory of matrix functions is a well developed framework with a wide range of applications including differential equations, graph theory, and electronic structure calculations. One particularly important application area is diagonalization free methods in quantum chemistry. When the input and output of the matrix function are sparse, methods based on polynomial expansions can be used to compute matrix functions in linear time. We present a library based on these methods that can compute a variety of matrix functions. Distributed memory parallelization is based on a communication avoiding sparse matrix multiplication algorithm. OpenMP task parallellization is utilized to implement hybrid parallelization. We describe NTPoly's interface and show how it can be integrated with programs written in many different programming languages. We demonstrate the merits of NTPoly by performing large scale calculations on the K computer.

  3. A High Order, Locally-Adaptive Method for the Navier-Stokes Equations

    NASA Astrophysics Data System (ADS)

    Chan, Daniel

    1998-11-01

    I have extended the FOSLS method of Cai, Manteuffel and McCormick (1997) and implemented it within the framework of a spectral element formulation using the Legendre polynomial basis function. The FOSLS method solves the Navier-Stokes equations as a system of coupled first-order equations and provides the ellipticity that is needed for fast iterative matrix solvers like multigrid to operate efficiently. Each element is treated as an object and its properties are self-contained. Only C^0 continuity is imposed across element interfaces; this design allows local grid refinement and coarsening without the burden of having an elaborate data structure, since only information along element boundaries is needed. With the FORTRAN 90 programming environment, I can maintain a high computational efficiency by employing a hybrid parallel processing model. The OpenMP directives provides parallelism in the loop level which is executed in a shared-memory SMP and the MPI protocol allows the distribution of elements to a cluster of SMP's connected via a commodity network. This talk will provide timing results and a comparison with a second order finite difference method.

  4. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns

    DOE PAGES

    Carter Edwards, H.; Trott, Christian R.; Sunderland, Daniel

    2014-07-22

    The manycore revolution can be characterized by increasing thread counts, decreasing memory per thread, and diversity of continually evolving manycore architectures. High performance computing (HPC) applications and libraries must exploit increasingly finer levels of parallelism within their codes to sustain scalability on these devices. We found that a major obstacle to performance portability is the diverse and conflicting set of constraints on memory access patterns across devices. Contemporary portable programming models address manycore parallelism (e.g., OpenMP, OpenACC, OpenCL) but fail to address memory access patterns. The Kokkos C++ library enables applications and domain libraries to achieve performance portability on diversemore » manycore architectures by unifying abstractions for both fine-grain data parallelism and memory access patterns. In this paper we describe Kokkos’ abstractions, summarize its application programmer interface (API), present performance results for unit-test kernels and mini-applications, and outline an incremental strategy for migrating legacy C++ codes to Kokkos. Furthermore, the Kokkos library is under active research and development to incorporate capabilities from new generations of manycore architectures, and to address a growing list of applications and domain libraries.« less

  5. Multi-GPU parallel algorithm design and analysis for improved inversion of probability tomography with gravity gradiometry data

    NASA Astrophysics Data System (ADS)

    Hou, Zhenlong; Huang, Danian

    2017-09-01

    In this paper, we make a study on the inversion of probability tomography (IPT) with gravity gradiometry data at first. The space resolution of the results is improved by multi-tensor joint inversion, depth weighting matrix and the other methods. Aiming at solving the problems brought by the big data in the exploration, we present the parallel algorithm and the performance analysis combining Compute Unified Device Architecture (CUDA) with Open Multi-Processing (OpenMP) based on Graphics Processing Unit (GPU) accelerating. In the test of the synthetic model and real data from Vinton Dome, we get the improved results. It is also proved that the improved inversion algorithm is effective and feasible. The performance of parallel algorithm we designed is better than the other ones with CUDA. The maximum speedup could be more than 200. In the performance analysis, multi-GPU speedup and multi-GPU efficiency are applied to analyze the scalability of the multi-GPU programs. The designed parallel algorithm is demonstrated to be able to process larger scale of data and the new analysis method is practical.

  6. GPU acceleration of a petascale application for turbulent mixing at high Schmidt number using OpenMP 4.5

    NASA Astrophysics Data System (ADS)

    Clay, M. P.; Buaria, D.; Yeung, P. K.; Gotoh, T.

    2018-07-01

    This paper reports on the successful implementation of a massively parallel GPU-accelerated algorithm for the direct numerical simulation of turbulent mixing at high Schmidt number. The work stems from a recent development (Comput. Phys. Commun., vol. 219, 2017, 313-328), in which a low-communication algorithm was shown to attain high degrees of scalability on the Cray XE6 architecture when overlapping communication and computation via dedicated communication threads. An even higher level of performance has now been achieved using OpenMP 4.5 on the Cray XK7 architecture, where on each node the 16 integer cores of an AMD Interlagos processor share a single Nvidia K20X GPU accelerator. In the new algorithm, data movements are minimized by performing virtually all of the intensive scalar field computations in the form of combined compact finite difference (CCD) operations on the GPUs. A memory layout in departure from usual practices is found to provide much better performance for a specific kernel required to apply the CCD scheme. Asynchronous execution enabled by adding the OpenMP 4.5 NOWAIT clause to TARGET constructs improves scalability when used to overlap computation on the GPUs with computation and communication on the CPUs. On the 27-petaflops supercomputer Titan at Oak Ridge National Laboratory, USA, a GPU-to-CPU speedup factor of approximately 5 is consistently observed at the largest problem size of 81923 grid points for the scalar field computed with 8192 XK7 nodes.

  7. [Series: Medical Applications of the PHITS Code (2): Acceleration by Parallel Computing].

    PubMed

    Furuta, Takuya; Sato, Tatsuhiko

    2015-01-01

    Time-consuming Monte Carlo dose calculation becomes feasible owing to the development of computer technology. However, the recent development is due to emergence of the multi-core high performance computers. Therefore, parallel computing becomes a key to achieve good performance of software programs. A Monte Carlo simulation code PHITS contains two parallel computing functions, the distributed-memory parallelization using protocols of message passing interface (MPI) and the shared-memory parallelization using open multi-processing (OpenMP) directives. Users can choose the two functions according to their needs. This paper gives the explanation of the two functions with their advantages and disadvantages. Some test applications are also provided to show their performance using a typical multi-core high performance workstation.

  8. Shared Memory Parallelization of an Implicit ADI-type CFD Code

    NASA Technical Reports Server (NTRS)

    Hauser, Th.; Huang, P. G.

    1999-01-01

    A parallelization study designed for ADI-type algorithms is presented using the OpenMP specification for shared-memory multiprocessor programming. Details of optimizations specifically addressed to cache-based computer architectures are described and performance measurements for the single and multiprocessor implementation are summarized. The paper demonstrates that optimization of memory access on a cache-based computer architecture controls the performance of the computational algorithm. A hybrid MPI/OpenMP approach is proposed for clusters of shared memory machines to further enhance the parallel performance. The method is applied to develop a new LES/DNS code, named LESTool. A preliminary DNS calculation of a fully developed channel flow at a Reynolds number of 180, Re(sub tau) = 180, has shown good agreement with existing data.

  9. a Modeling Method of Fluttering Leaves Based on Point Cloud

    NASA Astrophysics Data System (ADS)

    Tang, J.; Wang, Y.; Zhao, Y.; Hao, W.; Ning, X.; Lv, K.; Shi, Z.; Zhao, M.

    2017-09-01

    Leaves falling gently or fluttering are common phenomenon in nature scenes. The authenticity of leaves falling plays an important part in the dynamic modeling of natural scenes. The leaves falling model has a widely applications in the field of animation and virtual reality. We propose a novel modeling method of fluttering leaves based on point cloud in this paper. According to the shape, the weight of leaves and the wind speed, three basic trajectories of leaves falling are defined, which are the rotation falling, the roll falling and the screw roll falling. At the same time, a parallel algorithm based on OpenMP is implemented to satisfy the needs of real-time in practical applications. Experimental results demonstrate that the proposed method is amenable to the incorporation of a variety of desirable effects.

  10. GPU Accelerated Browser for Neuroimaging Genomics.

    PubMed

    Zigon, Bob; Li, Huang; Yao, Xiaohui; Fang, Shiaofen; Hasan, Mohammad Al; Yan, Jingwen; Moore, Jason H; Saykin, Andrew J; Shen, Li

    2018-04-25

    Neuroimaging genomics is an emerging field that provides exciting opportunities to understand the genetic basis of brain structure and function. The unprecedented scale and complexity of the imaging and genomics data, however, have presented critical computational bottlenecks. In this work we present our initial efforts towards building an interactive visual exploratory system for mining big data in neuroimaging genomics. A GPU accelerated browsing tool for neuroimaging genomics is created that implements the ANOVA algorithm for single nucleotide polymorphism (SNP) based analysis and the VEGAS algorithm for gene-based analysis, and executes them at interactive rates. The ANOVA algorithm is 110 times faster than the 4-core OpenMP version, while the VEGAS algorithm is 375 times faster than its 4-core OpenMP counter part. This approach lays a solid foundation for researchers to address the challenges of mining large-scale imaging genomics datasets via interactive visual exploration.

  11. Thread-Level Parallelization and Optimization of NWChem for the Intel MIC Architecture

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Shan, Hongzhang; Williams, Samuel; Jong, Wibe de

    In the multicore era it was possible to exploit the increase in on-chip parallelism by simply running multiple MPI processes per chip. Unfortunately, manycore processors' greatly increased thread- and data-level parallelism coupled with a reduced memory capacity demand an altogether different approach. In this paper we explore augmenting two NWChem modules, triples correction of the CCSD(T) and Fock matrix construction, with OpenMP in order that they might run efficiently on future manycore architectures. As the next NERSC machine will be a self-hosted Intel MIC (Xeon Phi) based supercomputer, we leverage an existing MIC testbed at NERSC to evaluate our experiments.more » In order to proxy the fact that future MIC machines will not have a host processor, we run all of our experiments in tt native mode. We found that while straightforward application of OpenMP to the deep loop nests associated with the tensor contractions of CCSD(T) was sufficient in attaining high performance, significant effort was required to safely and efficiently thread the TEXAS integral package when constructing the Fock matrix. Ultimately, our new MPI OpenMP hybrid implementations attain up to 65x better performance for the triples part of the CCSD(T) due in large part to the fact that the limited on-card memory limits the existing MPI implementation to a single process per card. Additionally, we obtain up to 1.6x better performance on Fock matrix constructions when compared with the best MPI implementations running multiple processes per card.« less

  12. A Hybrid MPI/OpenMP Approach for Parallel Groundwater Model Calibration on Multicore Computers

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Tang, Guoping; D'Azevedo, Ed F; Zhang, Fan

    2010-01-01

    Groundwater model calibration is becoming increasingly computationally time intensive. We describe a hybrid MPI/OpenMP approach to exploit two levels of parallelism in software and hardware to reduce calibration time on multicore computers with minimal parallelization effort. At first, HydroGeoChem 5.0 (HGC5) is parallelized using OpenMP for a uranium transport model with over a hundred species involving nearly a hundred reactions, and a field scale coupled flow and transport model. In the first application, a single parallelizable loop is identified to consume over 97% of the total computational time. With a few lines of OpenMP compiler directives inserted into the code,more » the computational time reduces about ten times on a compute node with 16 cores. The performance is further improved by selectively parallelizing a few more loops. For the field scale application, parallelizable loops in 15 of the 174 subroutines in HGC5 are identified to take more than 99% of the execution time. By adding the preconditioned conjugate gradient solver and BICGSTAB, and using a coloring scheme to separate the elements, nodes, and boundary sides, the subroutines for finite element assembly, soil property update, and boundary condition application are parallelized, resulting in a speedup of about 10 on a 16-core compute node. The Levenberg-Marquardt (LM) algorithm is added into HGC5 with the Jacobian calculation and lambda search parallelized using MPI. With this hybrid approach, compute nodes at the number of adjustable parameters (when the forward difference is used for Jacobian approximation), or twice that number (if the center difference is used), are used to reduce the calibration time from days and weeks to a few hours for the two applications. This approach can be extended to global optimization scheme and Monte Carol analysis where thousands of compute nodes can be efficiently utilized.« less

  13. Fast hydrological model calibration based on the heterogeneous parallel computing accelerated shuffled complex evolution method

    NASA Astrophysics Data System (ADS)

    Kan, Guangyuan; He, Xiaoyan; Ding, Liuqian; Li, Jiren; Hong, Yang; Zuo, Depeng; Ren, Minglei; Lei, Tianjie; Liang, Ke

    2018-01-01

    Hydrological model calibration has been a hot issue for decades. The shuffled complex evolution method developed at the University of Arizona (SCE-UA) has been proved to be an effective and robust optimization approach. However, its computational efficiency deteriorates significantly when the amount of hydrometeorological data increases. In recent years, the rise of heterogeneous parallel computing has brought hope for the acceleration of hydrological model calibration. This study proposed a parallel SCE-UA method and applied it to the calibration of a watershed rainfall-runoff model, the Xinanjiang model. The parallel method was implemented on heterogeneous computing systems using OpenMP and CUDA. Performance testing and sensitivity analysis were carried out to verify its correctness and efficiency. Comparison results indicated that heterogeneous parallel computing-accelerated SCE-UA converged much more quickly than the original serial version and possessed satisfactory accuracy and stability for the task of fast hydrological model calibration.

  14. Cpu/gpu Computing for AN Implicit Multi-Block Compressible Navier-Stokes Solver on Heterogeneous Platform

    NASA Astrophysics Data System (ADS)

    Deng, Liang; Bai, Hanli; Wang, Fang; Xu, Qingxin

    2016-06-01

    CPU/GPU computing allows scientists to tremendously accelerate their numerical codes. In this paper, we port and optimize a double precision alternating direction implicit (ADI) solver for three-dimensional compressible Navier-Stokes equations from our in-house Computational Fluid Dynamics (CFD) software on heterogeneous platform. First, we implement a full GPU version of the ADI solver to remove a lot of redundant data transfers between CPU and GPU, and then design two fine-grain schemes, namely “one-thread-one-point” and “one-thread-one-line”, to maximize the performance. Second, we present a dual-level parallelization scheme using the CPU/GPU collaborative model to exploit the computational resources of both multi-core CPUs and many-core GPUs within the heterogeneous platform. Finally, considering the fact that memory on a single node becomes inadequate when the simulation size grows, we present a tri-level hybrid programming pattern MPI-OpenMP-CUDA that merges fine-grain parallelism using OpenMP and CUDA threads with coarse-grain parallelism using MPI for inter-node communication. We also propose a strategy to overlap the computation with communication using the advanced features of CUDA and MPI programming. We obtain speedups of 6.0 for the ADI solver on one Tesla M2050 GPU in contrast to two Xeon X5670 CPUs. Scalability tests show that our implementation can offer significant performance improvement on heterogeneous platform.

  15. Enhancing Application Performance Using Mini-Apps: Comparison of Hybrid Parallel Programming Paradigms

    NASA Technical Reports Server (NTRS)

    Lawson, Gary; Sosonkina, Masha; Baurle, Robert; Hammond, Dana

    2017-01-01

    In many fields, real-world applications for High Performance Computing have already been developed. For these applications to stay up-to-date, new parallel strategies must be explored to yield the best performance; however, restructuring or modifying a real-world application may be daunting depending on the size of the code. In this case, a mini-app may be employed to quickly explore such options without modifying the entire code. In this work, several mini-apps have been created to enhance a real-world application performance, namely the VULCAN code for complex flow analysis developed at the NASA Langley Research Center. These mini-apps explore hybrid parallel programming paradigms with Message Passing Interface (MPI) for distributed memory access and either Shared MPI (SMPI) or OpenMP for shared memory accesses. Performance testing shows that MPI+SMPI yields the best execution performance, while requiring the largest number of code changes. A maximum speedup of 23 was measured for MPI+SMPI, but only 11 was measured for MPI+OpenMP.

  16. On the Performance of an Algebraic MultigridSolver on Multicore Clusters

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Baker, A H; Schulz, M; Yang, U M

    2010-04-29

    Algebraic multigrid (AMG) solvers have proven to be extremely efficient on distributed-memory architectures. However, when executed on modern multicore cluster architectures, we face new challenges that can significantly harm AMG's performance. We discuss our experiences on such an architecture and present a set of techniques that help users to overcome the associated problems, including thread and process pinning and correct memory associations. We have implemented most of the techniques in a MultiCore SUPport library (MCSup), which helps to map OpenMP applications to multicore machines. We present results using both an MPI-only and a hybrid MPI/OpenMP model.

  17. OSCAR API for Real-Time Low-Power Multicores and Its Performance on Multicores and SMP Servers

    NASA Astrophysics Data System (ADS)

    Kimura, Keiji; Mase, Masayoshi; Mikami, Hiroki; Miyamoto, Takamichi; Shirako, Jun; Kasahara, Hironori

    OSCAR (Optimally Scheduled Advanced Multiprocessor) API has been designed for real-time embedded low-power multicores to generate parallel programs for various multicores from different vendors by using the OSCAR parallelizing compiler. The OSCAR API has been developed by Waseda University in collaboration with Fujitsu Laboratory, Hitachi, NEC, Panasonic, Renesas Technology, and Toshiba in an METI/NEDO project entitled "Multicore Technology for Realtime Consumer Electronics." By using the OSCAR API as an interface between the OSCAR compiler and backend compilers, the OSCAR compiler enables hierarchical multigrain parallel processing with memory optimization under capacity restriction for cache memory, local memory, distributed shared memory, and on-chip/off-chip shared memory; data transfer using a DMA controller; and power reduction control using DVFS (Dynamic Voltage and Frequency Scaling), clock gating, and power gating for various embedded multicores. In addition, a parallelized program automatically generated by the OSCAR compiler with OSCAR API can be compiled by the ordinary OpenMP compilers since the OSCAR API is designed on a subset of the OpenMP. This paper describes the OSCAR API and its compatibility with the OSCAR compiler by showing code examples. Performance evaluations of the OSCAR compiler and the OSCAR API are carried out using an IBM Power5+ workstation, an IBM Power6 high-end SMP server, and a newly developed consumer electronics multicore chip RP2 by Renesas, Hitachi and Waseda. From the results of scalability evaluation, it is found that on an average, the OSCAR compiler with the OSCAR API can exploit 5.8 times speedup over the sequential execution on the Power5+ workstation with eight cores and 2.9 times speedup on RP2 with four cores, respectively. In addition, the OSCAR compiler can accelerate an IBM XL Fortran compiler up to 3.3 times on the Power6 SMP server. Due to low-power optimization on RP2, the OSCAR compiler with the OSCAR API achieves a maximum power reduction of 84% in the real-time execution mode.

  18. Performance evaluation of canny edge detection on a tiled multicore architecture

    NASA Astrophysics Data System (ADS)

    Brethorst, Andrew Z.; Desai, Nehal; Enright, Douglas P.; Scrofano, Ronald

    2011-01-01

    In the last few years, a variety of multicore architectures have been used to parallelize image processing applications. In this paper, we focus on assessing the parallel speed-ups of different Canny edge detection parallelization strategies on the Tile64, a tiled multicore architecture developed by the Tilera Corporation. Included in these strategies are different ways Canny edge detection can be parallelized, as well as differences in data management. The two parallelization strategies examined were loop-level parallelism and domain decomposition. Loop-level parallelism is achieved through the use of OpenMP,1 and it is capable of parallelization across the range of values over which a loop iterates. Domain decomposition is the process of breaking down an image into subimages, where each subimage is processed independently, in parallel. The results of the two strategies show that for the same number of threads, programmer implemented, domain decomposition exhibits higher speed-ups than the compiler managed, loop-level parallelism implemented with OpenMP.

  19. Ice-sheet modelling accelerated by graphics cards

    NASA Astrophysics Data System (ADS)

    Brædstrup, Christian Fredborg; Damsgaard, Anders; Egholm, David Lundbek

    2014-11-01

    Studies of glaciers and ice sheets have increased the demand for high performance numerical ice flow models over the past decades. When exploring the highly non-linear dynamics of fast flowing glaciers and ice streams, or when coupling multiple flow processes for ice, water, and sediment, researchers are often forced to use super-computing clusters. As an alternative to conventional high-performance computing hardware, the Graphical Processing Unit (GPU) is capable of massively parallel computing while retaining a compact design and low cost. In this study, we present a strategy for accelerating a higher-order ice flow model using a GPU. By applying the newest GPU hardware, we achieve up to 180× speedup compared to a similar but serial CPU implementation. Our results suggest that GPU acceleration is a competitive option for ice-flow modelling when compared to CPU-optimised algorithms parallelised by the OpenMP or Message Passing Interface (MPI) protocols.

  20. Facilitating arrhythmia simulation: the method of quantitative cellular automata modeling and parallel running

    PubMed Central

    Zhu, Hao; Sun, Yan; Rajagopal, Gunaretnam; Mondry, Adrian; Dhar, Pawan

    2004-01-01

    Background Many arrhythmias are triggered by abnormal electrical activity at the ionic channel and cell level, and then evolve spatio-temporally within the heart. To understand arrhythmias better and to diagnose them more precisely by their ECG waveforms, a whole-heart model is required to explore the association between the massively parallel activities at the channel/cell level and the integrative electrophysiological phenomena at organ level. Methods We have developed a method to build large-scale electrophysiological models by using extended cellular automata, and to run such models on a cluster of shared memory machines. We describe here the method, including the extension of a language-based cellular automaton to implement quantitative computing, the building of a whole-heart model with Visible Human Project data, the parallelization of the model on a cluster of shared memory computers with OpenMP and MPI hybrid programming, and a simulation algorithm that links cellular activity with the ECG. Results We demonstrate that electrical activities at channel, cell, and organ levels can be traced and captured conveniently in our extended cellular automaton system. Examples of some ECG waveforms simulated with a 2-D slice are given to support the ECG simulation algorithm. A performance evaluation of the 3-D model on a four-node cluster is also given. Conclusions Quantitative multicellular modeling with extended cellular automata is a highly efficient and widely applicable method to weave experimental data at different levels into computational models. This process can be used to investigate complex and collective biological activities that can be described neither by their governing differentiation equations nor by discrete parallel computation. Transparent cluster computing is a convenient and effective method to make time-consuming simulation feasible. Arrhythmias, as a typical case, can be effectively simulated with the methods described. PMID:15339335

  1. OpenSWPC: an open-source integrated parallel simulation code for modeling seismic wave propagation in 3D heterogeneous viscoelastic media

    NASA Astrophysics Data System (ADS)

    Maeda, Takuto; Takemura, Shunsuke; Furumura, Takashi

    2017-07-01

    We have developed an open-source software package, Open-source Seismic Wave Propagation Code (OpenSWPC), for parallel numerical simulations of seismic wave propagation in 3D and 2D (P-SV and SH) viscoelastic media based on the finite difference method in local-to-regional scales. This code is equipped with a frequency-independent attenuation model based on the generalized Zener body and an efficient perfectly matched layer for absorbing boundary condition. A hybrid-style programming using OpenMP and the Message Passing Interface (MPI) is adopted for efficient parallel computation. OpenSWPC has wide applicability for seismological studies and great portability to allowing excellent performance from PC clusters to supercomputers. Without modifying the code, users can conduct seismic wave propagation simulations using their own velocity structure models and the necessary source representations by specifying them in an input parameter file. The code has various modes for different types of velocity structure model input and different source representations such as single force, moment tensor and plane-wave incidence, which can easily be selected via the input parameters. Widely used binary data formats, the Network Common Data Form (NetCDF) and the Seismic Analysis Code (SAC) are adopted for the input of the heterogeneous structure model and the outputs of the simulation results, so users can easily handle the input/output datasets. All codes are written in Fortran 2003 and are available with detailed documents in a public repository.[Figure not available: see fulltext.

  2. Investigation of obstacle effect to improve conjugate heat transfer in backward facing step channel using fast simulation of incompressible flow

    NASA Astrophysics Data System (ADS)

    Nouri-Borujerdi, Ali; Moazezi, Arash

    2018-01-01

    The current study investigates the conjugate heat transfer characteristics for laminar flow in backward facing step channel. All of the channel walls are insulated except the lower thick wall under a constant temperature. The upper wall includes a insulated obstacle perpendicular to flow direction. The effect of obstacle height and location on the fluid flow and heat transfer are numerically explored for the Reynolds number in the range of 10 ≤ Re ≤ 300. Incompressible Navier-Stokes and thermal energy equations are solved simultaneously in fluid region by the upwind compact finite difference scheme based on flux-difference splitting in conjunction with artificial compressibility method. In the thick wall, the energy equation is obtained by Laplace equation. A multi-block approach is used to perform parallel computing to reduce the CPU time. Each block is modeled separately by sharing boundary conditions with neighbors. The developed program for modeling was written in FORTRAN language with OpenMP API. The obtained results showed that using of the multi-block parallel computing method is a simple robust scheme with high performance and high-order accurate. Moreover, the obtained results demonstrated that the increment of Reynolds number and obstacle height as well as decrement of horizontal distance between the obstacle and the step improve the heat transfer.

  3. Massively parallel and linear-scaling algorithm for second-order Møller-Plesset perturbation theory applied to the study of supramolecular wires

    NASA Astrophysics Data System (ADS)

    Kjærgaard, Thomas; Baudin, Pablo; Bykov, Dmytro; Eriksen, Janus Juul; Ettenhuber, Patrick; Kristensen, Kasper; Larkin, Jeff; Liakh, Dmitry; Pawłowski, Filip; Vose, Aaron; Wang, Yang Min; Jørgensen, Poul

    2017-03-01

    We present a scalable cross-platform hybrid MPI/OpenMP/OpenACC implementation of the Divide-Expand-Consolidate (DEC) formalism with portable performance on heterogeneous HPC architectures. The Divide-Expand-Consolidate formalism is designed to reduce the steep computational scaling of conventional many-body methods employed in electronic structure theory to linear scaling, while providing a simple mechanism for controlling the error introduced by this approximation. Our massively parallel implementation of this general scheme has three levels of parallelism, being a hybrid of the loosely coupled task-based parallelization approach and the conventional MPI +X programming model, where X is either OpenMP or OpenACC. We demonstrate strong and weak scalability of this implementation on heterogeneous HPC systems, namely on the GPU-based Cray XK7 Titan supercomputer at the Oak Ridge National Laboratory. Using the "resolution of the identity second-order Møller-Plesset perturbation theory" (RI-MP2) as the physical model for simulating correlated electron motion, the linear-scaling DEC implementation is applied to 1-aza-adamantane-trione (AAT) supramolecular wires containing up to 40 monomers (2440 atoms, 6800 correlated electrons, 24 440 basis functions and 91 280 auxiliary functions). This represents the largest molecular system treated at the MP2 level of theory, demonstrating an efficient removal of the scaling wall pertinent to conventional quantum many-body methods.

  4. How to Build MCNP 6.2

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Bull, Jeffrey S.

    This presentation describes how to build MCNP 6.2. MCNP®* 6.2 can be compiled on Macs, PCs, and most Linux systems. It can also be built for parallel execution using both OpenMP and Messing Passing Interface (MPI) methods. MCNP6 requires Fortran, C, and C++ compilers to build the code.

  5. batman: BAsic Transit Model cAlculatioN in Python

    NASA Astrophysics Data System (ADS)

    Kreidberg, Laura

    2015-11-01

    I introduce batman, a Python package for modeling exoplanet transit light curves. The batman package supports calculation of light curves for any radially symmetric stellar limb darkening law, using a new integration algorithm for models that cannot be quickly calculated analytically. The code uses C extension modules to speed up model calculation and is parallelized with OpenMP. For a typical light curve with 100 data points in transit, batman can calculate one million quadratic limb-darkened models in 30 seconds with a single 1.7 GHz Intel Core i5 processor. The same calculation takes seven minutes using the four-parameter nonlinear limb darkening model (computed to 1 ppm accuracy). Maximum truncation error for integrated models is an input parameter that can be set as low as 0.001 ppm, ensuring that the community is prepared for the precise transit light curves we anticipate measuring with upcoming facilities. The batman package is open source and publicly available at https://github.com/lkreidberg/batman .

  6. Performance Study of Monte Carlo Codes on Xeon Phi Coprocessors — Testing MCNP 6.1 and Profiling ARCHER Geometry Module on the FS7ONNi Problem

    NASA Astrophysics Data System (ADS)

    Liu, Tianyu; Wolfe, Noah; Lin, Hui; Zieb, Kris; Ji, Wei; Caracappa, Peter; Carothers, Christopher; Xu, X. George

    2017-09-01

    This paper contains two parts revolving around Monte Carlo transport simulation on Intel Many Integrated Core coprocessors (MIC, also known as Xeon Phi). (1) MCNP 6.1 was recompiled into multithreading (OpenMP) and multiprocessing (MPI) forms respectively without modification to the source code. The new codes were tested on a 60-core 5110P MIC. The test case was FS7ONNi, a radiation shielding problem used in MCNP's verification and validation suite. It was observed that both codes became slower on the MIC than on a 6-core X5650 CPU, by a factor of 4 for the MPI code and, abnormally, 20 for the OpenMP code, and both exhibited limited capability of strong scaling. (2) We have recently added a Constructive Solid Geometry (CSG) module to our ARCHER code to provide better support for geometry modelling in radiation shielding simulation. The functions of this module are frequently called in the particle random walk process. To identify the performance bottleneck we developed a CSG proxy application and profiled the code using the geometry data from FS7ONNi. The profiling data showed that the code was primarily memory latency bound on the MIC. This study suggests that despite low initial porting e_ort, Monte Carlo codes do not naturally lend themselves to the MIC platform — just like to the GPUs, and that the memory latency problem needs to be addressed in order to achieve decent performance gain.

  7. cljam: a library for handling DNA sequence alignment/map (SAM) with parallel processing.

    PubMed

    Takeuchi, Toshiki; Yamada, Atsuo; Aoki, Takashi; Nishimura, Kunihiro

    2016-01-01

    Next-generation sequencing can determine DNA bases and the results of sequence alignments are generally stored in files in the Sequence Alignment/Map (SAM) format and the compressed binary version (BAM) of it. SAMtools is a typical tool for dealing with files in the SAM/BAM format. SAMtools has various functions, including detection of variants, visualization of alignments, indexing, extraction of parts of the data and loci, and conversion of file formats. It is written in C and can execute fast. However, SAMtools requires an additional implementation to be used in parallel with, for example, OpenMP (Open Multi-Processing) libraries. For the accumulation of next-generation sequencing data, a simple parallelization program, which can support cloud and PC cluster environments, is required. We have developed cljam using the Clojure programming language, which simplifies parallel programming, to handle SAM/BAM data. Cljam can run in a Java runtime environment (e.g., Windows, Linux, Mac OS X) with Clojure. Cljam can process and analyze SAM/BAM files in parallel and at high speed. The execution time with cljam is almost the same as with SAMtools. The cljam code is written in Clojure and has fewer lines than other similar tools.

  8. Model-independent partial wave analysis using a massively-parallel fitting framework

    NASA Astrophysics Data System (ADS)

    Sun, L.; Aoude, R.; dos Reis, A. C.; Sokoloff, M.

    2017-10-01

    The functionality of GooFit, a GPU-friendly framework for doing maximum-likelihood fits, has been extended to extract model-independent {\\mathscr{S}}-wave amplitudes in three-body decays such as D + → h + h + h -. A full amplitude analysis is done where the magnitudes and phases of the {\\mathscr{S}}-wave amplitudes are anchored at a finite number of m 2(h + h -) control points, and a cubic spline is used to interpolate between these points. The amplitudes for {\\mathscr{P}}-wave and {\\mathscr{D}}-wave intermediate states are modeled as spin-dependent Breit-Wigner resonances. GooFit uses the Thrust library, with a CUDA backend for NVIDIA GPUs and an OpenMP backend for threads with conventional CPUs. Performance on a variety of platforms is compared. Executing on systems with GPUs is typically a few hundred times faster than executing the same algorithm on a single CPU.

  9. MLP: A Parallel Programming Alternative to MPI for New Shared Memory Parallel Systems

    NASA Technical Reports Server (NTRS)

    Taft, James R.

    1999-01-01

    Recent developments at the NASA AMES Research Center's NAS Division have demonstrated that the new generation of NUMA based Symmetric Multi-Processing systems (SMPs), such as the Silicon Graphics Origin 2000, can successfully execute legacy vector oriented CFD production codes at sustained rates far exceeding processing rates possible on dedicated 16 CPU Cray C90 systems. This high level of performance is achieved via shared memory based Multi-Level Parallelism (MLP). This programming approach, developed at NAS and outlined below, is distinct from the message passing paradigm of MPI. It offers parallelism at both the fine and coarse grained level, with communication latencies that are approximately 50-100 times lower than typical MPI implementations on the same platform. Such latency reductions offer the promise of performance scaling to very large CPU counts. The method draws on, but is also distinct from, the newly defined OpenMP specification, which uses compiler directives to support a limited subset of multi-level parallel operations. The NAS MLP method is general, and applicable to a large class of NASA CFD codes.

  10. Hybrid MPI/OpenMP Implementation of the ORAC Molecular Dynamics Program for Generalized Ensemble and Fast Switching Alchemical Simulations.

    PubMed

    Procacci, Piero

    2016-06-27

    We present a new release (6.0β) of the ORAC program [Marsili et al. J. Comput. Chem. 2010, 31, 1106-1116] with a hybrid OpenMP/MPI (open multiprocessing message passing interface) multilevel parallelism tailored for generalized ensemble (GE) and fast switching double annihilation (FS-DAM) nonequilibrium technology aimed at evaluating the binding free energy in drug-receptor system on high performance computing platforms. The production of the GE or FS-DAM trajectories is handled using a weak scaling parallel approach on the MPI level only, while a strong scaling force decomposition scheme is implemented for intranode computations with shared memory access at the OpenMP level. The efficiency, simplicity, and inherent parallel nature of the ORAC implementation of the FS-DAM algorithm, project the code as a possible effective tool for a second generation high throughput virtual screening in drug discovery and design. The code, along with documentation, testing, and ancillary tools, is distributed under the provisions of the General Public License and can be freely downloaded at www.chim.unifi.it/orac .

  11. Performance Analysis of and Tool Support for Transactional Memory on BG/Q

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Schindewolf, M

    2011-12-08

    Martin Schindewolf worked during his internship at the Lawrence Livermore National Laboratory (LLNL) under the guidance of Martin Schulz at the Computer Science Group of the Center for Applied Scientific Computing. We studied the performance of the TM subsystem of BG/Q as well as researched the possibilities for tool support for TM. To study the performance, we run CLOMP-TM. CLOMP-TM is a benchmark designed for the purpose to quantify the overhead of OpenMP and compare different synchronization primitives. To advance CLOMP-TM, we added Message Passing Interface (MPI) routines for a hybrid parallelization. This enables to run multiple MPI tasks, eachmore » running OpenMP, on one node. With these enhancements, a beneficial MPI task to OpenMP thread ratio is determined. Further, the synchronization primitives are ranked as a function of the application characteristics. To demonstrate the usefulness of these results, we investigate a real Monte Carlo simulation called Monte Carlo Benchmark (MCB). Applying the lessons learned yields the best task to thread ratio. Further, we were able to tune the synchronization by transactifying the MCB. Further, we develop tools that capture the performance of the TM run time system and present it to the application's developer. The performance of the TM run time system relies on the built-in statistics. These tools use the Blue Gene Performance Monitoring (BGPM) interface to correlate the statistics from the TM run time system with performance counter values. This combination provides detailed insights in the run time behavior of the application and enables to track down the cause of degraded performance. Further, one tool has been implemented that separates the performance counters in three categories: Successful Speculation, Unsuccessful Speculation and No Speculation. All of the tools are crafted around IBM's xlc compiler for C and C++ and have been run and tested on a Q32 early access system.« less

  12. Hot Chips and Hot Interconnects for High End Computing Systems

    NASA Technical Reports Server (NTRS)

    Saini, Subhash

    2005-01-01

    I will discuss several processors: 1. The Cray proprietary processor used in the Cray X1; 2. The IBM Power 3 and Power 4 used in an IBM SP 3 and IBM SP 4 systems; 3. The Intel Itanium and Xeon, used in the SGI Altix systems and clusters respectively; 4. IBM System-on-a-Chip used in IBM BlueGene/L; 5. HP Alpha EV68 processor used in DOE ASCI Q cluster; 6. SPARC64 V processor, which is used in the Fujitsu PRIMEPOWER HPC2500; 7. An NEC proprietary processor, which is used in NEC SX-6/7; 8. Power 4+ processor, which is used in Hitachi SR11000; 9. NEC proprietary processor, which is used in Earth Simulator. The IBM POWER5 and Red Storm Computing Systems will also be discussed. The architectures of these processors will first be presented, followed by interconnection networks and a description of high-end computer systems based on these processors and networks. The performance of various hardware/programming model combinations will then be compared, based on latest NAS Parallel Benchmark results (MPI, OpenMP/HPF and hybrid (MPI + OpenMP). The tutorial will conclude with a discussion of general trends in the field of high performance computing, (quantum computing, DNA computing, cellular engineering, and neural networks).

  13. BEAGLE: an application programming interface and high-performance computing library for statistical phylogenetics.

    PubMed

    Ayres, Daniel L; Darling, Aaron; Zwickl, Derrick J; Beerli, Peter; Holder, Mark T; Lewis, Paul O; Huelsenbeck, John P; Ronquist, Fredrik; Swofford, David L; Cummings, Michael P; Rambaut, Andrew; Suchard, Marc A

    2012-01-01

    Phylogenetic inference is fundamental to our understanding of most aspects of the origin and evolution of life, and in recent years, there has been a concentration of interest in statistical approaches such as Bayesian inference and maximum likelihood estimation. Yet, for large data sets and realistic or interesting models of evolution, these approaches remain computationally demanding. High-throughput sequencing can yield data for thousands of taxa, but scaling to such problems using serial computing often necessitates the use of nonstatistical or approximate approaches. The recent emergence of graphics processing units (GPUs) provides an opportunity to leverage their excellent floating-point computational performance to accelerate statistical phylogenetic inference. A specialized library for phylogenetic calculation would allow existing software packages to make more effective use of available computer hardware, including GPUs. Adoption of a common library would also make it easier for other emerging computing architectures, such as field programmable gate arrays, to be used in the future. We present BEAGLE, an application programming interface (API) and library for high-performance statistical phylogenetic inference. The API provides a uniform interface for performing phylogenetic likelihood calculations on a variety of compute hardware platforms. The library includes a set of efficient implementations and can currently exploit hardware including GPUs using NVIDIA CUDA, central processing units (CPUs) with Streaming SIMD Extensions and related processor supplementary instruction sets, and multicore CPUs via OpenMP. To demonstrate the advantages of a common API, we have incorporated the library into several popular phylogenetic software packages. The BEAGLE library is free open source software licensed under the Lesser GPL and available from http://beagle-lib.googlecode.com. An example client program is available as public domain software.

  14. BEAGLE: An Application Programming Interface and High-Performance Computing Library for Statistical Phylogenetics

    PubMed Central

    Ayres, Daniel L.; Darling, Aaron; Zwickl, Derrick J.; Beerli, Peter; Holder, Mark T.; Lewis, Paul O.; Huelsenbeck, John P.; Ronquist, Fredrik; Swofford, David L.; Cummings, Michael P.; Rambaut, Andrew; Suchard, Marc A.

    2012-01-01

    Abstract Phylogenetic inference is fundamental to our understanding of most aspects of the origin and evolution of life, and in recent years, there has been a concentration of interest in statistical approaches such as Bayesian inference and maximum likelihood estimation. Yet, for large data sets and realistic or interesting models of evolution, these approaches remain computationally demanding. High-throughput sequencing can yield data for thousands of taxa, but scaling to such problems using serial computing often necessitates the use of nonstatistical or approximate approaches. The recent emergence of graphics processing units (GPUs) provides an opportunity to leverage their excellent floating-point computational performance to accelerate statistical phylogenetic inference. A specialized library for phylogenetic calculation would allow existing software packages to make more effective use of available computer hardware, including GPUs. Adoption of a common library would also make it easier for other emerging computing architectures, such as field programmable gate arrays, to be used in the future. We present BEAGLE, an application programming interface (API) and library for high-performance statistical phylogenetic inference. The API provides a uniform interface for performing phylogenetic likelihood calculations on a variety of compute hardware platforms. The library includes a set of efficient implementations and can currently exploit hardware including GPUs using NVIDIA CUDA, central processing units (CPUs) with Streaming SIMD Extensions and related processor supplementary instruction sets, and multicore CPUs via OpenMP. To demonstrate the advantages of a common API, we have incorporated the library into several popular phylogenetic software packages. The BEAGLE library is free open source software licensed under the Lesser GPL and available from http://beagle-lib.googlecode.com. An example client program is available as public domain software. PMID:21963610

  15. Reconstruction for time-domain in vivo EPR 3D multigradient oximetric imaging--a parallel processing perspective.

    PubMed

    Dharmaraj, Christopher D; Thadikonda, Kishan; Fletcher, Anthony R; Doan, Phuc N; Devasahayam, Nallathamby; Matsumoto, Shingo; Johnson, Calvin A; Cook, John A; Mitchell, James B; Subramanian, Sankaran; Krishna, Murali C

    2009-01-01

    Three-dimensional Oximetric Electron Paramagnetic Resonance Imaging using the Single Point Imaging modality generates unpaired spin density and oxygen images that can readily distinguish between normal and tumor tissues in small animals. It is also possible with fast imaging to track the changes in tissue oxygenation in response to the oxygen content in the breathing air. However, this involves dealing with gigabytes of data for each 3D oximetric imaging experiment involving digital band pass filtering and background noise subtraction, followed by 3D Fourier reconstruction. This process is rather slow in a conventional uniprocessor system. This paper presents a parallelization framework using OpenMP runtime support and parallel MATLAB to execute such computationally intensive programs. The Intel compiler is used to develop a parallel C++ code based on OpenMP. The code is executed on four Dual-Core AMD Opteron shared memory processors, to reduce the computational burden of the filtration task significantly. The results show that the parallel code for filtration has achieved a speed up factor of 46.66 as against the equivalent serial MATLAB code. In addition, a parallel MATLAB code has been developed to perform 3D Fourier reconstruction. Speedup factors of 4.57 and 4.25 have been achieved during the reconstruction process and oximetry computation, for a data set with 23 x 23 x 23 gradient steps. The execution time has been computed for both the serial and parallel implementations using different dimensions of the data and presented for comparison. The reported system has been designed to be easily accessible even from low-cost personal computers through local internet (NIHnet). The experimental results demonstrate that the parallel computing provides a source of high computational power to obtain biophysical parameters from 3D EPR oximetric imaging, almost in real-time.

  16. Performance Analysis of GFDL's GCM Line-By-Line Radiative Transfer Model on GPU and MIC Architectures

    NASA Astrophysics Data System (ADS)

    Menzel, R.; Paynter, D.; Jones, A. L.

    2017-12-01

    Due to their relatively low computational cost, radiative transfer models in global climate models (GCMs) run on traditional CPU architectures generally consist of shortwave and longwave parameterizations over a small number of wavelength bands. With the rise of newer GPU and MIC architectures, however, the performance of high resolution line-by-line radiative transfer models may soon approach those of the physical parameterizations currently employed in GCMs. Here we present an analysis of the current performance of a new line-by-line radiative transfer model currently under development at GFDL. Although originally designed to specifically exploit GPU architectures through the use of CUDA, the radiative transfer model has recently been extended to include OpenMP in an effort to also effectively target MIC architectures such as Intel's Xeon Phi. Using input data provided by the upcoming Radiative Forcing Model Intercomparison Project (RFMIP, as part of CMIP 6), we compare model results and performance data for various model configurations and spectral resolutions run on both GPU and Intel Knights Landing architectures to analogous runs of the standard Oxford Reference Forward Model on traditional CPUs.

  17. Acceleration of Semiempirical QM/MM Methods through Message Passage Interface (MPI), Hybrid MPI/Open Multiprocessing, and Self-Consistent Field Accelerator Implementations.

    PubMed

    Ojeda-May, Pedro; Nam, Kwangho

    2017-08-08

    The strategy and implementation of scalable and efficient semiempirical (SE) QM/MM methods in CHARMM are described. The serial version of the code was first profiled to identify routines that required parallelization. Afterward, the code was parallelized and accelerated with three approaches. The first approach was the parallelization of the entire QM/MM routines, including the Fock matrix diagonalization routines, using the CHARMM message passage interface (MPI) machinery. In the second approach, two different self-consistent field (SCF) energy convergence accelerators were implemented using density and Fock matrices as targets for their extrapolations in the SCF procedure. In the third approach, the entire QM/MM and MM energy routines were accelerated by implementing the hybrid MPI/open multiprocessing (OpenMP) model in which both the task- and loop-level parallelization strategies were adopted to balance loads between different OpenMP threads. The present implementation was tested on two solvated enzyme systems (including <100 QM atoms) and an S N 2 symmetric reaction in water. The MPI version exceeded existing SE QM methods in CHARMM, which include the SCC-DFTB and SQUANTUM methods, by at least 4-fold. The use of SCF convergence accelerators further accelerated the code by ∼12-35% depending on the size of the QM region and the number of CPU cores used. Although the MPI version displayed good scalability, the performance was diminished for large numbers of MPI processes due to the overhead associated with MPI communications between nodes. This issue was partially overcome by the hybrid MPI/OpenMP approach which displayed a better scalability for a larger number of CPU cores (up to 64 CPUs in the tested systems).

  18. Massively parallel and linear-scaling algorithm for second-order Moller–Plesset perturbation theory applied to the study of supramolecular wires

    DOE PAGES

    Kjaergaard, Thomas; Baudin, Pablo; Bykov, Dmytro; ...

    2016-11-16

    Here, we present a scalable cross-platform hybrid MPI/OpenMP/OpenACC implementation of the Divide–Expand–Consolidate (DEC) formalism with portable performance on heterogeneous HPC architectures. The Divide–Expand–Consolidate formalism is designed to reduce the steep computational scaling of conventional many-body methods employed in electronic structure theory to linear scaling, while providing a simple mechanism for controlling the error introduced by this approximation. Our massively parallel implementation of this general scheme has three levels of parallelism, being a hybrid of the loosely coupled task-based parallelization approach and the conventional MPI +X programming model, where X is either OpenMP or OpenACC. We demonstrate strong and weak scalabilitymore » of this implementation on heterogeneous HPC systems, namely on the GPU-based Cray XK7 Titan supercomputer at the Oak Ridge National Laboratory. Using the “resolution of the identity second-order Moller–Plesset perturbation theory” (RI-MP2) as the physical model for simulating correlated electron motion, the linear-scaling DEC implementation is applied to 1-aza-adamantane-trione (AAT) supramolecular wires containing up to 40 monomers (2440 atoms, 6800 correlated electrons, 24 440 basis functions and 91 280 auxiliary functions). This represents the largest molecular system treated at the MP2 level of theory, demonstrating an efficient removal of the scaling wall pertinent to conventional quantum many-body methods.« less

  19. MOIL-opt: Energy-Conserving Molecular Dynamics on a GPU/CPU system

    PubMed Central

    Ruymgaart, A. Peter; Cardenas, Alfredo E.; Elber, Ron

    2011-01-01

    We report an optimized version of the molecular dynamics program MOIL that runs on a shared memory system with OpenMP and exploits the power of a Graphics Processing Unit (GPU). The model is of heterogeneous computing system on a single node with several cores sharing the same memory and a GPU. This is a typical laboratory tool, which provides excellent performance at minimal cost. Besides performance, emphasis is made on accuracy and stability of the algorithm probed by energy conservation for explicit-solvent atomically-detailed-models. Especially for long simulations energy conservation is critical due to the phenomenon known as “energy drift” in which energy errors accumulate linearly as a function of simulation time. To achieve long time dynamics with acceptable accuracy the drift must be particularly small. We identify several means of controlling long-time numerical accuracy while maintaining excellent speedup. To maintain a high level of energy conservation SHAKE and the Ewald reciprocal summation are run in double precision. Double precision summation of real-space non-bonded interactions improves energy conservation. In our best option, the energy drift using 1fs for a time step while constraining the distances of all bonds, is undetectable in 10ns simulation of solvated DHFR (Dihydrofolate reductase). Faster options, shaking only bonds with hydrogen atoms, are also very well behaved and have drifts of less than 1kcal/mol per nanosecond of the same system. CPU/GPU implementations require changes in programming models. We consider the use of a list of neighbors and quadratic versus linear interpolation in lookup tables of different sizes. Quadratic interpolation with a smaller number of grid points is faster than linear lookup tables (with finer representation) without loss of accuracy. Atomic neighbor lists were found most efficient. Typical speedups are about a factor of 10 compared to a single-core single-precision code. PMID:22328867

  20. Kernel optimization for short-range molecular dynamics

    NASA Astrophysics Data System (ADS)

    Hu, Changjun; Wang, Xianmeng; Li, Jianjiang; He, Xinfu; Li, Shigang; Feng, Yangde; Yang, Shaofeng; Bai, He

    2017-02-01

    To optimize short-range force computations in Molecular Dynamics (MD) simulations, multi-threading and SIMD optimizations are presented in this paper. With respect to multi-threading optimization, a Partition-and-Separate-Calculation (PSC) method is designed to avoid write conflicts caused by using Newton's third law. Serial bottlenecks are eliminated with no additional memory usage. The method is implemented by using the OpenMP model. Furthermore, the PSC method is employed on Intel Xeon Phi coprocessors in both native and offload models. We also evaluate the performance of the PSC method under different thread affinities on the MIC architecture. In the SIMD execution, we explain the performance influence in the PSC method, considering the "if-clause" of the cutoff radius check. The experiment results show that our PSC method is relatively more efficient compared to some traditional methods. In double precision, our 256-bit SIMD implementation is about 3 times faster than the scalar version.

  1. Extending Automatic Parallelization to Optimize High-Level Abstractions for Multicore

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Liao, C; Quinlan, D J; Willcock, J J

    2008-12-12

    Automatic introduction of OpenMP for sequential applications has attracted significant attention recently because of the proliferation of multicore processors and the simplicity of using OpenMP to express parallelism for shared-memory systems. However, most previous research has only focused on C and Fortran applications operating on primitive data types. C++ applications using high-level abstractions, such as STL containers and complex user-defined types, are largely ignored due to the lack of research compilers that are readily able to recognize high-level object-oriented abstractions and leverage their associated semantics. In this paper, we automatically parallelize C++ applications using ROSE, a multiple-language source-to-source compiler infrastructuremore » which preserves the high-level abstractions and gives us access to their semantics. Several representative parallelization candidate kernels are used to explore semantic-aware parallelization strategies for high-level abstractions, combined with extended compiler analyses. Those kernels include an array-base computation loop, a loop with task-level parallelism, and a domain-specific tree traversal. Our work extends the applicability of automatic parallelization to modern applications using high-level abstractions and exposes more opportunities to take advantage of multicore processors.« less

  2. GNAQPMS v1.1: accelerating the Global Nested Air Quality Prediction Modeling System (GNAQPMS) on Intel Xeon Phi processors

    NASA Astrophysics Data System (ADS)

    Wang, Hui; Chen, Huansheng; Wu, Qizhong; Lin, Junmin; Chen, Xueshun; Xie, Xinwei; Wang, Rongrong; Tang, Xiao; Wang, Zifa

    2017-08-01

    The Global Nested Air Quality Prediction Modeling System (GNAQPMS) is the global version of the Nested Air Quality Prediction Modeling System (NAQPMS), which is a multi-scale chemical transport model used for air quality forecast and atmospheric environmental research. In this study, we present the porting and optimisation of GNAQPMS on a second-generation Intel Xeon Phi processor, codenamed Knights Landing (KNL). Compared with the first-generation Xeon Phi coprocessor (codenamed Knights Corner, KNC), KNL has many new hardware features such as a bootable processor, high-performance in-package memory and ISA compatibility with Intel Xeon processors. In particular, we describe the five optimisations we applied to the key modules of GNAQPMS, including the CBM-Z gas-phase chemistry, advection, convection and wet deposition modules. These optimisations work well on both the KNL 7250 processor and the Intel Xeon E5-2697 V4 processor. They include (1) updating the pure Message Passing Interface (MPI) parallel mode to the hybrid parallel mode with MPI and OpenMP in the emission, advection, convection and gas-phase chemistry modules; (2) fully employing the 512 bit wide vector processing units (VPUs) on the KNL platform; (3) reducing unnecessary memory access to improve cache efficiency; (4) reducing the thread local storage (TLS) in the CBM-Z gas-phase chemistry module to improve its OpenMP performance; and (5) changing the global communication from writing/reading interface files to MPI functions to improve the performance and the parallel scalability. These optimisations greatly improved the GNAQPMS performance. The same optimisations also work well for the Intel Xeon Broadwell processor, specifically E5-2697 v4. Compared with the baseline version of GNAQPMS, the optimised version was 3.51 × faster on KNL and 2.77 × faster on the CPU. Moreover, the optimised version ran at 26 % lower average power on KNL than on the CPU. With the combined performance and energy improvement, the KNL platform was 37.5 % more efficient on power consumption compared with the CPU platform. The optimisations also enabled much further parallel scalability on both the CPU cluster and the KNL cluster scaled to 40 CPU nodes and 30 KNL nodes, with a parallel efficiency of 70.4 and 42.2 %, respectively.

  3. PAREMD: A parallel program for the evaluation of momentum space properties of atoms and molecules

    NASA Astrophysics Data System (ADS)

    Meena, Deep Raj; Gadre, Shridhar R.; Balanarayan, P.

    2018-03-01

    The present work describes a code for evaluating the electron momentum density (EMD), its moments and the associated Shannon information entropy for a multi-electron molecular system. The code works specifically for electronic wave functions obtained from traditional electronic structure packages such as GAMESS and GAUSSIAN. For the momentum space orbitals, the general expression for Gaussian basis sets in position space is analytically Fourier transformed to momentum space Gaussian basis functions. The molecular orbital coefficients of the wave function are taken as an input from the output file of the electronic structure calculation. The analytic expressions of EMD are evaluated over a fine grid and the accuracy of the code is verified by a normalization check and a numerical kinetic energy evaluation which is compared with the analytic kinetic energy given by the electronic structure package. Apart from electron momentum density, electron density in position space has also been integrated into this package. The program is written in C++ and is executed through a Shell script. It is also tuned for multicore machines with shared memory through OpenMP. The program has been tested for a variety of molecules and correlated methods such as CISD, Møller-Plesset second order (MP2) theory and density functional methods. For correlated methods, the PAREMD program uses natural spin orbitals as an input. The program has been benchmarked for a variety of Gaussian basis sets for different molecules showing a linear speedup on a parallel architecture.

  4. Compiled MPI: Cost-Effective Exascale Applications Development

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Bronevetsky, G; Quinlan, D; Lumsdaine, A

    2012-04-10

    The complexity of petascale and exascale machines makes it increasingly difficult to develop applications that can take advantage of them. Future systems are expected to feature billion-way parallelism, complex heterogeneous compute nodes and poor availability of memory (Peter Kogge, 2008). This new challenge for application development is motivating a significant amount of research and development on new programming models and runtime systems designed to simplify large-scale application development. Unfortunately, DoE has significant multi-decadal investment in a large family of mission-critical scientific applications. Scaling these applications to exascale machines will require a significant investment that will dwarf the costs of hardwaremore » procurement. A key reason for the difficulty in transitioning today's applications to exascale hardware is their reliance on explicit programming techniques, such as the Message Passing Interface (MPI) programming model to enable parallelism. MPI provides a portable and high performance message-passing system that enables scalable performance on a wide variety of platforms. However, it also forces developers to lock the details of parallelization together with application logic, making it very difficult to adapt the application to significant changes in the underlying system. Further, MPI's explicit interface makes it difficult to separate the application's synchronization and communication structure, reducing the amount of support that can be provided by compiler and run-time tools. This is in contrast to the recent research on more implicit parallel programming models such as Chapel, OpenMP and OpenCL, which promise to provide significantly more flexibility at the cost of reimplementing significant portions of the application. We are developing CoMPI, a novel compiler-driven approach to enable existing MPI applications to scale to exascale systems with minimal modifications that can be made incrementally over the application's lifetime. It includes: (1) New set of source code annotations, inserted either manually or automatically, that will clarify the application's use of MPI to the compiler infrastructure, enabling greater accuracy where needed; (2) A compiler transformation framework that leverages these annotations to transform the original MPI source code to improve its performance and scalability; (3) Novel MPI runtime implementation techniques that will provide a rich set of functionality extensions to be used by applications that have been transformed by our compiler; and (4) A novel compiler analysis that leverages simple user annotations to automatically extract the application's communication structure and synthesize most complex code annotations.« less

  5. PREMER: a Tool to Infer Biological Networks.

    PubMed

    Villaverde, Alejandro F; Becker, Kolja; Banga, Julio R

    2017-10-04

    Inferring the structure of unknown cellular networks is a main challenge in computational biology. Data-driven approaches based on information theory can determine the existence of interactions among network nodes automatically. However, the elucidation of certain features - such as distinguishing between direct and indirect interactions or determining the direction of a causal link - requires estimating information-theoretic quantities in a multidimensional space. This can be a computationally demanding task, which acts as a bottleneck for the application of elaborate algorithms to large-scale network inference problems. The computational cost of such calculations can be alleviated by the use of compiled programs and parallelization. To this end we have developed PREMER (Parallel Reverse Engineering with Mutual information & Entropy Reduction), a software toolbox that can run in parallel and sequential environments. It uses information theoretic criteria to recover network topology and determine the strength and causality of interactions, and allows incorporating prior knowledge, imputing missing data, and correcting outliers. PREMER is a free, open source software tool that does not require any commercial software. Its core algorithms are programmed in FORTRAN 90 and implement OpenMP directives. It has user interfaces in Python and MATLAB/Octave, and runs on Windows, Linux and OSX (https://sites.google.com/site/premertoolbox/).

  6. High-Productivity Computing in Computational Physics Education

    NASA Astrophysics Data System (ADS)

    Tel-Zur, Guy

    2011-03-01

    We describe the development of a new course in Computational Physics at the Ben-Gurion University. This elective course for 3rd year undergraduates and MSc. students is being taught during one semester. Computational Physics is by now well accepted as the Third Pillar of Science. This paper's claim is that modern Computational Physics education should deal also with High-Productivity Computing. The traditional approach of teaching Computational Physics emphasizes ``Correctness'' and then ``Accuracy'' and we add also ``Performance.'' Along with topics in Mathematical Methods and case studies in Physics the course deals a significant amount of time with ``Mini-Courses'' in topics such as: High-Throughput Computing - Condor, Parallel Programming - MPI and OpenMP, How to build a Beowulf, Visualization and Grid and Cloud Computing. The course does not intend to teach neither new physics nor new mathematics but it is focused on an integrated approach for solving problems starting from the physics problem, the corresponding mathematical solution, the numerical scheme, writing an efficient computer code and finally analysis and visualization.

  7. Pythran: enabling static optimization of scientific Python programs

    NASA Astrophysics Data System (ADS)

    Guelton, Serge; Brunet, Pierrick; Amini, Mehdi; Merlini, Adrien; Corbillon, Xavier; Raynaud, Alan

    2015-01-01

    Pythran is an open source static compiler that turns modules written in a subset of Python language into native ones. Assuming that scientific modules do not rely much on the dynamic features of the language, it trades them for powerful, possibly inter-procedural, optimizations. These optimizations include detection of pure functions, temporary allocation removal, constant folding, Numpy ufunc fusion and parallelization, explicit thread-level parallelism through OpenMP annotations, false variable polymorphism pruning, and automatic vector instruction generation such as AVX or SSE. In addition to these compilation steps, Pythran provides a C++ runtime library that leverages the C++ STL to provide generic containers, and the Numeric Template Toolbox for Numpy support. It takes advantage of modern C++11 features such as variadic templates, type inference, move semantics and perfect forwarding, as well as classical idioms such as expression templates. Unlike the Cython approach, Pythran input code remains compatible with the Python interpreter. Output code is generally as efficient as the annotated Cython equivalent, if not more, but without the backward compatibility loss.

  8. Revised Extended Grid Library

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Martz, Roger L.

    The Revised Eolus Grid Library (REGL) is a mesh-tracking library that was developed for use with the MCNP6TM computer code so that (radiation) particles can track on an unstructured mesh. The unstructured mesh is a finite element representation of any geometric solid model created with a state-of-the-art CAE/CAD tool. The mesh-tracking library is written using modern Fortran and programming standards; the library is Fortran 2003 compliant. The library was created with a defined application programmer interface (API) so that it could easily integrate with other particle tracking/transport codes. The library does not handle parallel processing via the message passing interfacemore » (mpi), but has been used successfully where the host code handles the mpi calls. The library is thread-safe and supports the OpenMP paradigm. As a library, all features are available through the API and overall a tight coupling between it and the host code is required. Features of the library are summarized with the following list: Can accommodate first and second order 4, 5, and 6-sided polyhedra; any combination of element types may appear in a single geometry model; parts may not contain tetrahedra mixed with other element types; pentahedra and hexahedra can be together in the same part; robust handling of overlaps and gaps; tracks element-to-element to produce path length results at the element level; finds element numbers for a given mesh location; finds intersection points on element faces for the particle tracks; produce a data file for post processing results analysis; reads Abaqus .inp input (ASCII) files to obtain information for the global mesh-model; supports parallel input processing via mpi; and support parallel particle transport by both mpi and OpenMP.« less

  9. A hybrid parallel framework for the cellular Potts model simulations

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Jiang, Yi; He, Kejing; Dong, Shoubin

    2009-01-01

    The Cellular Potts Model (CPM) has been widely used for biological simulations. However, most current implementations are either sequential or approximated, which can't be used for large scale complex 3D simulation. In this paper we present a hybrid parallel framework for CPM simulations. The time-consuming POE solving, cell division, and cell reaction operation are distributed to clusters using the Message Passing Interface (MPI). The Monte Carlo lattice update is parallelized on shared-memory SMP system using OpenMP. Because the Monte Carlo lattice update is much faster than the POE solving and SMP systems are more and more common, this hybrid approachmore » achieves good performance and high accuracy at the same time. Based on the parallel Cellular Potts Model, we studied the avascular tumor growth using a multiscale model. The application and performance analysis show that the hybrid parallel framework is quite efficient. The hybrid parallel CPM can be used for the large scale simulation ({approx}10{sup 8} sites) of complex collective behavior of numerous cells ({approx}10{sup 6}).« less

  10. Scaling Up Coordinate Descent Algorithms for Large ℓ1 Regularization Problems

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Scherrer, Chad; Halappanavar, Mahantesh; Tewari, Ambuj

    2012-07-03

    We present a generic framework for parallel coordinate descent (CD) algorithms that has as special cases the original sequential algorithms of Cyclic CD and Stochastic CD, as well as the recent parallel Shotgun algorithm of Bradley et al. We introduce two novel parallel algorithms that are also special cases---Thread-Greedy CD and Coloring-Based CD---and give performance measurements for an OpenMP implementation of these.

  11. spammpack, Version 2013-06-18

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    2014-01-17

    This library is an implementation of the Sparse Approximate Matrix Multiplication (SpAMM) algorithm introduced. It provides a matrix data type, and an approximate matrix product, which exhibits linear scaling computational complexity for matrices with decay. The product error and the performance of the multiply can be tuned by choosing an appropriate tolerance. The library can be compiled for serial execution or parallel execution on shared memory systems with an OpenMP capable compiler

  12. A multithreaded and GPU-optimized compact finite difference algorithm for turbulent mixing at high Schmidt number using petascale computing

    NASA Astrophysics Data System (ADS)

    Clay, M. P.; Yeung, P. K.; Buaria, D.; Gotoh, T.

    2017-11-01

    Turbulent mixing at high Schmidt number is a multiscale problem which places demanding requirements on direct numerical simulations to resolve fluctuations down the to Batchelor scale. We use a dual-grid, dual-scheme and dual-communicator approach where velocity and scalar fields are computed by separate groups of parallel processes, the latter using a combined compact finite difference (CCD) scheme on finer grid with a static 3-D domain decomposition free of the communication overhead of memory transposes. A high degree of scalability is achieved for a 81923 scalar field at Schmidt number 512 in turbulence with a modest inertial range, by overlapping communication with computation whenever possible. On the Cray XE6 partition of Blue Waters, use of a dedicated thread for communication combined with OpenMP locks and nested parallelism reduces CCD timings by 34% compared to an MPI baseline. The code has been further optimized for the 27-petaflops Cray XK7 machine Titan using GPUs as accelerators with the latest OpenMP 4.5 directives, giving 2.7X speedup compared to CPU-only execution at the largest problem size. Supported by NSF Grant ACI-1036170, the NCSA Blue Waters Project with subaward via UIUC, and a DOE INCITE allocation at ORNL.

  13. Comparison of Origin 2000 and Origin 3000 Using NAS Parallel Benchmarks

    NASA Technical Reports Server (NTRS)

    Turney, Raymond D.

    2001-01-01

    This report describes results of benchmark tests on the Origin 3000 system currently being installed at the NASA Ames National Advanced Supercomputing facility. This machine will ultimately contain 1024 R14K processors. The first part of the system, installed in November, 2000 and named mendel, is an Origin 3000 with 128 R12K processors. For comparison purposes, the tests were also run on lomax, an Origin 2000 with R12K processors. The BT, LU, and SP application benchmarks in the NAS Parallel Benchmark Suite and the kernel benchmark FT were chosen to determine system performance and measure the impact of changes on the machine as it evolves. Having been written to measure performance on Computational Fluid Dynamics applications, these benchmarks are assumed appropriate to represent the NAS workload. Since the NAS runs both message passing (MPI) and shared-memory, compiler directive type codes, both MPI and OpenMP versions of the benchmarks were used. The MPI versions used were the latest official release of the NAS Parallel Benchmarks, version 2.3. The OpenMP versiqns used were PBN3b2, a beta version that is in the process of being released. NPB 2.3 and PBN 3b2 are technically different benchmarks, and NPB results are not directly comparable to PBN results.

  14. Neptune: An astrophysical smooth particle hydrodynamics code for massively parallel computer architectures

    NASA Astrophysics Data System (ADS)

    Sandalski, Stou

    Smooth particle hydrodynamics is an efficient method for modeling the dynamics of fluids. It is commonly used to simulate astrophysical processes such as binary mergers. We present a newly developed GPU accelerated smooth particle hydrodynamics code for astrophysical simulations. The code is named neptune after the Roman god of water. It is written in OpenMP parallelized C++ and OpenCL and includes octree based hydrodynamic and gravitational acceleration. The design relies on object-oriented methodologies in order to provide a flexible and modular framework that can be easily extended and modified by the user. Several pre-built scenarios for simulating collisions of polytropes and black-hole accretion are provided. The code is released under the MIT Open Source license and publicly available at http://code.google.com/p/neptune-sph/.

  15. Application of Adjoint Method and Spectral-Element Method to Tomographic Inversion of Regional Seismological Structure Beneath Japanese Islands

    NASA Astrophysics Data System (ADS)

    Tsuboi, S.; Miyoshi, T.; Obayashi, M.; Tono, Y.; Ando, K.

    2014-12-01

    Recent progress in large scale computing by using waveform modeling technique and high performance computing facility has demonstrated possibilities to perform full-waveform inversion of three dimensional (3D) seismological structure inside the Earth. We apply the adjoint method (Liu and Tromp, 2006) to obtain 3D structure beneath Japanese Islands. First we implemented Spectral-Element Method to K-computer in Kobe, Japan. We have optimized SPECFEM3D_GLOBE (Komatitsch and Tromp, 2002) by using OpenMP so that the code fits hybrid architecture of K-computer. Now we could use 82,134 nodes of K-computer (657,072 cores) to compute synthetic waveform with about 1 sec accuracy for realistic 3D Earth model and its performance was 1.2 PFLOPS. We use this optimized SPECFEM3D_GLOBE code and take one chunk around Japanese Islands from global mesh and compute synthetic seismograms with accuracy of about 10 second. We use GAP-P2 mantle tomography model (Obayashi et al., 2009) as an initial 3D model and use as many broadband seismic stations available in this region as possible to perform inversion. We then use the time windows for body waves and surface waves to compute adjoint sources and calculate adjoint kernels for seismic structure. We have performed several iteration and obtained improved 3D structure beneath Japanese Islands. The result demonstrates that waveform misfits between observed and theoretical seismograms improves as the iteration proceeds. We now prepare to use much shorter period in our synthetic waveform computation and try to obtain seismic structure for basin scale model, such as Kanto basin, where there are dense seismic network and high seismic activity. Acknowledgements: This research was partly supported by MEXT Strategic Program for Innovative Research. We used F-net seismograms of the National Research Institute for Earth Science and Disaster Prevention.

  16. fast_protein_cluster: parallel and optimized clustering of large-scale protein modeling data.

    PubMed

    Hung, Ling-Hong; Samudrala, Ram

    2014-06-15

    fast_protein_cluster is a fast, parallel and memory efficient package used to cluster 60 000 sets of protein models (with up to 550 000 models per set) generated by the Nutritious Rice for the World project. fast_protein_cluster is an optimized and extensible toolkit that supports Root Mean Square Deviation after optimal superposition (RMSD) and Template Modeling score (TM-score) as metrics. RMSD calculations using a laptop CPU are 60× faster than qcprot and 3× faster than current graphics processing unit (GPU) implementations. New GPU code further increases the speed of RMSD and TM-score calculations. fast_protein_cluster provides novel k-means and hierarchical clustering methods that are up to 250× and 2000× faster, respectively, than Clusco, and identify significantly more accurate models than Spicker and Clusco. fast_protein_cluster is written in C++ using OpenMP for multi-threading support. Custom streaming Single Instruction Multiple Data (SIMD) extensions and advanced vector extension intrinsics code accelerate CPU calculations, and OpenCL kernels support AMD and Nvidia GPUs. fast_protein_cluster is available under the M.I.T. license. (http://software.compbio.washington.edu/fast_protein_cluster) © The Author 2014. Published by Oxford University Press.

  17. Study on Development of 1D-2D Coupled Real-time Urban Inundation Prediction model

    NASA Astrophysics Data System (ADS)

    Lee, Seungsoo

    2017-04-01

    In recent years, we are suffering abnormal weather condition due to climate change around the world. Therefore, countermeasures for flood defense are urgent task. In this research, study on development of 1D-2D coupled real-time urban inundation prediction model using predicted precipitation data based on remote sensing technology is conducted. 1 dimensional (1D) sewerage system analysis model which was introduced by Lee et al. (2015) is used to simulate inlet and overflow phenomena by interacting with surface flown as well as flows in conduits. 2 dimensional (2D) grid mesh refinement method is applied to depict road networks for effective calculation time. 2D surface model is coupled with 1D sewerage analysis model in order to consider bi-directional flow between both. Also parallel computing method, OpenMP, is applied to reduce calculation time. The model is estimated by applying to 25 August 2014 extreme rainfall event which caused severe inundation damages in Busan, Korea. Oncheoncheon basin is selected for study basin and observed radar data are assumed as predicted rainfall data. The model shows acceptable calculation speed with accuracy. Therefore it is expected that the model can be used for real-time urban inundation forecasting system to minimize damages.

  18. Accelerating the Pace of Protein Functional Annotation With Intel Xeon Phi Coprocessors.

    PubMed

    Feinstein, Wei P; Moreno, Juana; Jarrell, Mark; Brylinski, Michal

    2015-06-01

    Intel Xeon Phi is a new addition to the family of powerful parallel accelerators. The range of its potential applications in computationally driven research is broad; however, at present, the repository of scientific codes is still relatively limited. In this study, we describe the development and benchmarking of a parallel version of eFindSite, a structural bioinformatics algorithm for the prediction of ligand-binding sites in proteins. Implemented for the Intel Xeon Phi platform, the parallelization of the structure alignment portion of eFindSite using pragma-based OpenMP brings about the desired performance improvements, which scale well with the number of computing cores. Compared to a serial version, the parallel code runs 11.8 and 10.1 times faster on the CPU and the coprocessor, respectively; when both resources are utilized simultaneously, the speedup is 17.6. For example, ligand-binding predictions for 501 benchmarking proteins are completed in 2.1 hours on a single Stampede node equipped with the Intel Xeon Phi card compared to 3.1 hours without the accelerator and 36.8 hours required by a serial version. In addition to the satisfactory parallel performance, porting existing scientific codes to the Intel Xeon Phi architecture is relatively straightforward with a short development time due to the support of common parallel programming models by the coprocessor. The parallel version of eFindSite is freely available to the academic community at www.brylinski.org/efindsite.

  19. What Scientific Applications can Benefit from Hardware Transactional Memory?

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Schindewolf, M; Bihari, B; Gyllenhaal, J

    2012-06-04

    Achieving efficient and correct synchronization of multiple threads is a difficult and error-prone task at small scale and, as we march towards extreme scale computing, will be even more challenging when the resulting application is supposed to utilize millions of cores efficiently. Transactional Memory (TM) is a promising technique to ease the burden on the programmer, but only recently has become available on commercial hardware in the new Blue Gene/Q system and hence the real benefit for realistic applications has not been studied, yet. This paper presents the first performance results of TM embedded into OpenMP on a prototype systemmore » of BG/Q and characterizes code properties that will likely lead to benefits when augmented with TM primitives. We first, study the influence of thread count, environment variables and memory layout on TM performance and identify code properties that will yield performance gains with TM. Second, we evaluate the combination of OpenMP with multiple synchronization primitives on top of MPI to determine suitable task to thread ratios per node. Finally, we condense our findings into a set of best practices. These are applied to a Monte Carlo Benchmark and a Smoothed Particle Hydrodynamics method. In both cases an optimized TM version, executed with 64 threads on one node, outperforms a simple TM implementation. MCB with optimized TM yields a speedup of 27.45 over baseline.« less

  20. From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation

    DOE PAGES

    Blazewicz, Marek; Hinder, Ian; Koppelman, David M.; ...

    2013-01-01

    Starting from a high-level problem description in terms of partial differential equations using abstract tensor notation, the Chemora framework discretizes, optimizes, and generates complete high performance codes for a wide range of compute architectures. Chemora extends the capabilities of Cactus, facilitating the usage of large-scale CPU/GPU systems in an efficient manner for complex applications, without low-level code tuning. Chemora achieves parallelism through MPI and multi-threading, combining OpenMP and CUDA. Optimizations include high-level code transformations, efficient loop traversal strategies, dynamically selected data and instruction cache usage strategies, and JIT compilation of GPU code tailored to the problem characteristics. The discretization ismore » based on higher-order finite differences on multi-block domains. Chemora's capabilities are demonstrated by simulations of black hole collisions. This problem provides an acid test of the framework, as the Einstein equations contain hundreds of variables and thousands of terms.« less

  1. Acoustic 3D modeling by the method of integral equations

    NASA Astrophysics Data System (ADS)

    Malovichko, M.; Khokhlov, N.; Yavich, N.; Zhdanov, M.

    2018-02-01

    This paper presents a parallel algorithm for frequency-domain acoustic modeling by the method of integral equations (IE). The algorithm is applied to seismic simulation. The IE method reduces the size of the problem but leads to a dense system matrix. A tolerable memory consumption and numerical complexity were achieved by applying an iterative solver, accompanied by an effective matrix-vector multiplication operation, based on the fast Fourier transform (FFT). We demonstrate that, the IE system matrix is better conditioned than that of the finite-difference (FD) method, and discuss its relation to a specially preconditioned FD matrix. We considered several methods of matrix-vector multiplication for the free-space and layered host models. The developed algorithm and computer code were benchmarked against the FD time-domain solution. It was demonstrated that, the method could accurately calculate the seismic field for the models with sharp material boundaries and a point source and receiver located close to the free surface. We used OpenMP to speed up the matrix-vector multiplication, while MPI was used to speed up the solution of the system equations, and also for parallelizing across multiple sources. The practical examples and efficiency tests are presented as well.

  2. Parallelization strategies for continuum-generalized method of moments on the multi-thread systems

    NASA Astrophysics Data System (ADS)

    Bustamam, A.; Handhika, T.; Ernastuti, Kerami, D.

    2017-07-01

    Continuum-Generalized Method of Moments (C-GMM) covers the Generalized Method of Moments (GMM) shortfall which is not as efficient as Maximum Likelihood estimator by using the continuum set of moment conditions in a GMM framework. However, this computation would take a very long time since optimizing regularization parameter. Unfortunately, these calculations are processed sequentially whereas in fact all modern computers are now supported by hierarchical memory systems and hyperthreading technology, which allowing for parallel computing. This paper aims to speed up the calculation process of C-GMM by designing a parallel algorithm for C-GMM on the multi-thread systems. First, parallel regions are detected for the original C-GMM algorithm. There are two parallel regions in the original C-GMM algorithm, that are contributed significantly to the reduction of computational time: the outer-loop and the inner-loop. Furthermore, this parallel algorithm will be implemented with standard shared-memory application programming interface, i.e. Open Multi-Processing (OpenMP). The experiment shows that the outer-loop parallelization is the best strategy for any number of observations.

  3. BLESS 2: accurate, memory-efficient and fast error correction method.

    PubMed

    Heo, Yun; Ramachandran, Anand; Hwu, Wen-Mei; Ma, Jian; Chen, Deming

    2016-08-01

    The most important features of error correction tools for sequencing data are accuracy, memory efficiency and fast runtime. The previous version of BLESS was highly memory-efficient and accurate, but it was too slow to handle reads from large genomes. We have developed a new version of BLESS to improve runtime and accuracy while maintaining a small memory usage. The new version, called BLESS 2, has an error correction algorithm that is more accurate than BLESS, and the algorithm has been parallelized using hybrid MPI and OpenMP programming. BLESS 2 was compared with five top-performing tools, and it was found to be the fastest when it was executed on two computing nodes using MPI, with each node containing twelve cores. Also, BLESS 2 showed at least 11% higher gain while retaining the memory efficiency of the previous version for large genomes. Freely available at https://sourceforge.net/projects/bless-ec dchen@illinois.edu Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  4. CCC7-119 Reactive Molecular Dynamics Simulations of Hot Spot Growth in Shocked Energetic Materials

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Thompson, Aidan P.

    2015-03-01

    The purpose of this work is to understand how defects control initiation in energetic materials used in stockpile components; Sequoia gives us the core-count to run very large-scale simulations of up to 10 million atoms and; Using an OpenMP threaded implementation of the ReaxFF package in LAMMPS, we have been able to get good parallel efficiency running on 16k nodes of Sequoia, with 1 hardware thread per core.

  5. OpenMP Performance on the Columbia Supercomputer

    NASA Technical Reports Server (NTRS)

    Haoqiang, Jin; Hood, Robert

    2005-01-01

    This presentation discusses Columbia World Class Supercomputer which is one of the world's fastest supercomputers providing 61 TFLOPs (10/20/04). Conceived, designed, built, and deployed in just 120 days. A 20-node supercomputer built on proven 512-processor nodes. The largest SGI system in the world with over 10,000 Intel Itanium 2 processors and provides the largest node size incorporating commodity parts (512) and the largest shared-memory environment (2048) with 88% efficiency tops the scalar systems on the Top500 list.

  6. OpenMP Parallelization and Optimization of Graph-based Machine Learning Algorithms

    DTIC Science & Technology

    2016-05-01

    composed of hyper - spectral video sequences recording the release of chemical plumes at the Dugway Proving Ground. We use the 329 frames of the...video. Each frame is a hyper - spectral image with dimension 128 × 320 × 129, where 129 is the dimension of the channel of each pixel. The total number of...j=1 . Then we use the nested for- loop to calculate the values of WXY by the formula (1). We then put the corresponding value in an array which

  7. Parallel Execution of Functional Mock-up Units in Buildings Modeling

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Ozmen, Ozgur; Nutaro, James J.; New, Joshua Ryan

    2016-06-30

    A Functional Mock-up Interface (FMI) defines a standardized interface to be used in computer simulations to develop complex cyber-physical systems. FMI implementation by a software modeling tool enables the creation of a simulation model that can be interconnected, or the creation of a software library called a Functional Mock-up Unit (FMU). This report describes an FMU wrapper implementation that imports FMUs into a C++ environment and uses an Euler solver that executes FMUs in parallel using Open Multi-Processing (OpenMP). The purpose of this report is to elucidate the runtime performance of the solver when a multi-component system is imported asmore » a single FMU (for the whole system) or as multiple FMUs (for different groups of components as sub-systems). This performance comparison is conducted using two test cases: (1) a simple, multi-tank problem; and (2) a more realistic use case based on the Modelica Buildings Library. In both test cases, the performance gains are promising when each FMU consists of a large number of states and state events that are wrapped in a single FMU. Load balancing is demonstrated to be a critical factor in speeding up parallel execution of multiple FMUs.« less

  8. A GPU-accelerated semi-implicit fractional-step method for numerical solutions of incompressible Navier-Stokes equations

    NASA Astrophysics Data System (ADS)

    Ha, Sanghyun; Park, Junshin; You, Donghyun

    2018-01-01

    Utility of the computational power of Graphics Processing Units (GPUs) is elaborated for solutions of incompressible Navier-Stokes equations which are integrated using a semi-implicit fractional-step method. The Alternating Direction Implicit (ADI) and the Fourier-transform-based direct solution methods used in the semi-implicit fractional-step method take advantage of multiple tridiagonal matrices whose inversion is known as the major bottleneck for acceleration on a typical multi-core machine. A novel implementation of the semi-implicit fractional-step method designed for GPU acceleration of the incompressible Navier-Stokes equations is presented. Aspects of the programing model of Compute Unified Device Architecture (CUDA), which are critical to the bandwidth-bound nature of the present method are discussed in detail. A data layout for efficient use of CUDA libraries is proposed for acceleration of tridiagonal matrix inversion and fast Fourier transform. OpenMP is employed for concurrent collection of turbulence statistics on a CPU while the Navier-Stokes equations are computed on a GPU. Performance of the present method using CUDA is assessed by comparing the speed of solving three tridiagonal matrices using ADI with the speed of solving one heptadiagonal matrix using a conjugate gradient method. An overall speedup of 20 times is achieved using a Tesla K40 GPU in comparison with a single-core Xeon E5-2660 v3 CPU in simulations of turbulent boundary-layer flow over a flat plate conducted on over 134 million grids. Enhanced performance of 48 times speedup is reached for the same problem using a Tesla P100 GPU.

  9. A Parallel Vector Machine for the PM Programming Language

    NASA Astrophysics Data System (ADS)

    Bellerby, Tim

    2016-04-01

    PM is a new programming language which aims to make the writing of computational geoscience models on parallel hardware accessible to scientists who are not themselves expert parallel programmers. It is based around the concept of communicating operators: language constructs that enable variables local to a single invocation of a parallelised loop to be viewed as if they were arrays spanning the entire loop domain. This mechanism enables different loop invocations (which may or may not be executing on different processors) to exchange information in a manner that extends the successful Communicating Sequential Processes idiom from single messages to collective communication. Communicating operators avoid the additional synchronisation mechanisms, such as atomic variables, required when programming using the Partitioned Global Address Space (PGAS) paradigm. Using a single loop invocation as the fundamental unit of concurrency enables PM to uniformly represent different levels of parallelism from vector operations through shared memory systems to distributed grids. This paper describes an implementation of PM based on a vectorised virtual machine. On a single processor node, concurrent operations are implemented using masked vector operations. Virtual machine instructions operate on vectors of values and may be unmasked, masked using a Boolean field, or masked using an array of active vector cell locations. Conditional structures (such as if-then-else or while statement implementations) calculate and apply masks to the operations they control. A shift in mask representation from Boolean to location-list occurs when active locations become sufficiently sparse. Parallel loops unfold data structures (or vectors of data structures for nested loops) into vectors of values that may additionally be distributed over multiple computational nodes and then split into micro-threads compatible with the size of the local cache. Inter-node communication is accomplished using standard OpenMP and MPI. Performance analyses of the PM vector machine, demonstrating its scaling properties with respect to domain size and the number of processor nodes will be presented for a range of hardware configurations. The PM software and language definition are being made available under unrestrictive MIT and Creative Commons Attribution licenses respectively: www.pm-lang.org.

  10. Tycho 2: A Proxy Application for Kinetic Transport Sweeps

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Garrett, Charles Kristopher; Warsa, James S.

    2016-09-14

    Tycho 2 is a proxy application that implements discrete ordinates (SN) kinetic transport sweeps on unstructured, 3D, tetrahedral meshes. It has been designed to be small and require minimal dependencies to make collaboration and experimentation as easy as possible. Tycho 2 has been released as open source software. The software is currently in a beta release with plans for a stable release (version 1.0) before the end of the year. The code is parallelized via MPI across spatial cells and OpenMP across angles. Currently, several parallelization algorithms are implemented.

  11. LFRic: Building a new Unified Model

    NASA Astrophysics Data System (ADS)

    Melvin, Thomas; Mullerworth, Steve; Ford, Rupert; Maynard, Chris; Hobson, Mike

    2017-04-01

    The LFRic project, named for Lewis Fry Richardson, aims to develop a replacement for the Met Office Unified Model in order to meet the challenges which will be presented by the next generation of exascale supercomputers. This project, a collaboration between the Met Office, STFC Daresbury and the University of Manchester, builds on the earlier GungHo project to redesign the dynamical core, in partnership with NERC. The new atmospheric model aims to retain the performance of the current ENDGame dynamical core and associated subgrid physics, while also enabling a far greater scalability and flexibility to accommodate future supercomputer architectures. Design of the model revolves around a principle of a 'separation of concerns', whereby the natural science aspects of the code can be developed without worrying about the underlying architecture, while machine dependent optimisations can be carried out at a high level. These principles are put into practice through the development of an autogenerated Parallel Systems software layer (known as the PSy layer) using a domain-specific compiler called PSyclone. The prototype model includes a re-write of the dynamical core using a mixed finite element method, in which different function spaces are used to represent the various fields. It is able to run in parallel with MPI and OpenMP and has been tested on over 200,000 cores. In this talk an overview of the both the natural science and computational science implementations of the model will be presented.

  12. spMC: an R-package for 3D lithological reconstructions based on spatial Markov chains

    NASA Astrophysics Data System (ADS)

    Sartore, Luca; Fabbri, Paolo; Gaetan, Carlo

    2016-09-01

    The paper presents the spatial Markov Chains (spMC) R-package and a case study of subsoil simulation/prediction located in a plain site of Northeastern Italy. spMC is a quite complete collection of advanced methods for data inspection, besides spMC implements Markov Chain models to estimate experimental transition probabilities of categorical lithological data. Furthermore, simulation methods based on most known prediction methods (as indicator Kriging and CoKriging) were implemented in spMC package. Moreover, other more advanced methods are available for simulations, e.g. path methods and Bayesian procedures, that exploit the maximum entropy. Since the spMC package was developed for intensive geostatistical computations, part of the code is implemented for parallel computations via the OpenMP constructs. A final analysis of this computational efficiency compares the simulation/prediction algorithms by using different numbers of CPU cores, and considering the example data set of the case study included in the package.

  13. Load Balancing Strategies for Multiphase Flows on Structured Grids

    NASA Astrophysics Data System (ADS)

    Olshefski, Kristopher; Owkes, Mark

    2017-11-01

    The computation time required to perform large simulations of complex systems is currently one of the leading bottlenecks of computational research. Parallelization allows multiple processing cores to perform calculations simultaneously and reduces computational times. However, load imbalances between processors waste computing resources as processors wait for others to complete imbalanced tasks. In multiphase flows, these imbalances arise due to the additional computational effort required at the gas-liquid interface. However, many current load balancing schemes are only designed for unstructured grid applications. The purpose of this research is to develop a load balancing strategy while maintaining the simplicity of a structured grid. Several approaches are investigated including brute force oversubscription, node oversubscription through Message Passing Interface (MPI) commands, and shared memory load balancing using OpenMP. Each of these strategies are tested with a simple one-dimensional model prior to implementation into the three-dimensional NGA code. Current results show load balancing will reduce computational time by at least 30%.

  14. A novel heterogeneous algorithm to simulate multiphase flow in porous media on multicore CPU-GPU systems

    NASA Astrophysics Data System (ADS)

    McClure, J. E.; Prins, J. F.; Miller, C. T.

    2014-07-01

    Multiphase flow implementations of the lattice Boltzmann method (LBM) are widely applied to the study of porous medium systems. In this work, we construct a new variant of the popular "color" LBM for two-phase flow in which a three-dimensional, 19-velocity (D3Q19) lattice is used to compute the momentum transport solution while a three-dimensional, seven velocity (D3Q7) lattice is used to compute the mass transport solution. Based on this formulation, we implement a novel heterogeneous GPU-accelerated algorithm in which the mass transport solution is computed by multiple shared memory CPU cores programmed using OpenMP while a concurrent solution of the momentum transport is performed using a GPU. The heterogeneous solution is demonstrated to provide speedup of 2.6 × as compared to multi-core CPU solution and 1.8 × compared to GPU solution due to concurrent utilization of both CPU and GPU bandwidths. Furthermore, we verify that the proposed formulation provides an accurate physical representation of multiphase flow processes and demonstrate that the approach can be applied to perform heterogeneous simulations of two-phase flow in porous media using a typical GPU-accelerated workstation.

  15. Thread-level parallelization and optimization of NWChem for the Intel MIC architecture

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Shan, Hongzhang; Williams, Samuel; de Jong, Wibe

    In the multicore era it was possible to exploit the increase in on-chip parallelism by simply running multiple MPI processes per chip. Unfortunately, manycore processors' greatly increased thread- and data-level parallelism coupled with a reduced memory capacity demand an altogether different approach. In this paper we explore augmenting two NWChem modules, triples correction of the CCSD(T) and Fock matrix construction, with OpenMP in order that they might run efficiently on future manycore architectures. As the next NERSC machine will be a self-hosted Intel MIC (Xeon Phi) based supercomputer, we leverage an existing MIC testbed at NERSC to evaluate our experiments.more » In order to proxy the fact that future MIC machines will not have a host processor, we run all of our experiments in native mode. We found that while straightforward application of OpenMP to the deep loop nests associated with the tensor contractions of CCSD(T) was sufficient in attaining high performance, significant e ort was required to safely and efeciently thread the TEXAS integral package when constructing the Fock matrix. Ultimately, our new MPI+OpenMP hybrid implementations attain up to 65× better performance for the triples part of the CCSD(T) due in large part to the fact that the limited on-card memory limits the existing MPI implementation to a single process per card. Additionally, we obtain up to 1.6× better performance on Fock matrix constructions when compared with the best MPI implementations running multiple processes per card.« less

  16. Use of general purpose graphics processing units with MODFLOW

    USGS Publications Warehouse

    Hughes, Joseph D.; White, Jeremy T.

    2013-01-01

    To evaluate the use of general-purpose graphics processing units (GPGPUs) to improve the performance of MODFLOW, an unstructured preconditioned conjugate gradient (UPCG) solver has been developed. The UPCG solver uses a compressed sparse row storage scheme and includes Jacobi, zero fill-in incomplete, and modified-incomplete lower-upper (LU) factorization, and generalized least-squares polynomial preconditioners. The UPCG solver also includes options for sequential and parallel solution on the central processing unit (CPU) using OpenMP. For simulations utilizing the GPGPU, all basic linear algebra operations are performed on the GPGPU; memory copies between the central processing unit CPU and GPCPU occur prior to the first iteration of the UPCG solver and after satisfying head and flow criteria or exceeding a maximum number of iterations. The efficiency of the UPCG solver for GPGPU and CPU solutions is benchmarked using simulations of a synthetic, heterogeneous unconfined aquifer with tens of thousands to millions of active grid cells. Testing indicates GPGPU speedups on the order of 2 to 8, relative to the standard MODFLOW preconditioned conjugate gradient (PCG) solver, can be achieved when (1) memory copies between the CPU and GPGPU are optimized, (2) the percentage of time performing memory copies between the CPU and GPGPU is small relative to the calculation time, (3) high-performance GPGPU cards are utilized, and (4) CPU-GPGPU combinations are used to execute sequential operations that are difficult to parallelize. Furthermore, UPCG solver testing indicates GPGPU speedups exceed parallel CPU speedups achieved using OpenMP on multicore CPUs for preconditioners that can be easily parallelized.

  17. Graphics Processing Unit Acceleration of Gyrokinetic Turbulence Simulations

    NASA Astrophysics Data System (ADS)

    Hause, Benjamin; Parker, Scott; Chen, Yang

    2013-10-01

    We find a substantial increase in on-node performance using Graphics Processing Unit (GPU) acceleration in gyrokinetic delta-f particle-in-cell simulation. Optimization is performed on a two-dimensional slab gyrokinetic particle simulation using the Portland Group Fortran compiler with the OpenACC compiler directives and Fortran CUDA. Mixed implementation of both Open-ACC and CUDA is demonstrated. CUDA is required for optimizing the particle deposition algorithm. We have implemented the GPU acceleration on a third generation Core I7 gaming PC with two NVIDIA GTX 680 GPUs. We find comparable, or better, acceleration relative to the NERSC DIRAC cluster with the NVIDIA Tesla C2050 computing processor. The Tesla C 2050 is about 2.6 times more expensive than the GTX 580 gaming GPU. We also see enormous speedups (10 or more) on the Titan supercomputer at Oak Ridge with Kepler K20 GPUs. Results show speed-ups comparable or better than that of OpenMP models utilizing multiple cores. The use of hybrid OpenACC, CUDA Fortran, and MPI models across many nodes will also be discussed. Optimization strategies will be presented. We will discuss progress on optimizing the comprehensive three dimensional general geometry GEM code.

  18. Optimization of computations for adjoint field and Jacobian needed in 3D CSEM inversion

    NASA Astrophysics Data System (ADS)

    Dehiya, Rahul; Singh, Arun; Gupta, Pravin K.; Israil, M.

    2017-01-01

    We present the features and results of a newly developed code, based on Gauss-Newton optimization technique, for solving three-dimensional Controlled-Source Electromagnetic inverse problem. In this code a special emphasis has been put on representing the operations by block matrices for conjugate gradient iteration. We show how in the computation of Jacobian, the matrix formed by differentiation of system matrix can be made independent of frequency to optimize the operations at conjugate gradient step. The coarse level parallel computing, using OpenMP framework, is used primarily due to its simplicity in implementation and accessibility of shared memory multi-core computing machine to almost anyone. We demonstrate how the coarseness of modeling grid in comparison to source (comp`utational receivers) spacing can be exploited for efficient computing, without compromising the quality of the inverted model, by reducing the number of adjoint calls. It is also demonstrated that the adjoint field can even be computed on a grid coarser than the modeling grid without affecting the inversion outcome. These observations were reconfirmed using an experiment design where the deviation of source from straight tow line is considered. Finally, a real field data inversion experiment is presented to demonstrate robustness of the code.

  19. The Development of High-speed Full-function Storm Surge Model and the Case Study of 2013 Typhoon Haiyan

    NASA Astrophysics Data System (ADS)

    Tsai, Y. L.; Wu, T. R.; Lin, C. Y.; Chuang, M. H.; Lin, C. W.

    2016-02-01

    An ideal storm surge operational model should feature as: 1. Large computational domain which covers the complete typhoon life cycle. 2. Supporting both parametric and atmospheric models. 3. Capable of calculating inundation area for risk assessment. 4. Tides are included for accurate inundation simulation. Literature review shows that not many operational models reach the goals for the fast calculation, and most of the models have limited functions. In this paper, a well-developed COMCOT (COrnell Multi-grid Coupled of Tsunami Model) tsunami model is chosen as the kernel to establish a storm surge model which solves the nonlinear shallow water equations on both spherical and Cartesian coordinates directly. The complete evolution of storm surge including large-scale propagation and small-scale offshore run-up can be simulated by nested-grid scheme. The global tide model TPXO 7.2 established by Oregon State University is coupled to provide astronomical boundary conditions. The atmospheric model named WRF (Weather Research and Forecasting Model) is also coupled to provide metrological fields. The high-efficiency thin-film method is adopted to evaluate the storm surge inundation. Our in-house model has been optimized by OpenMp (Open Multi-Processing) with the performance which is 10 times faster than the original version and makes it an early-warning storm surge model. In this study, the thorough simulation of 2013 Typhoon Haiyan is performed. The detailed results will be presented in Oceanic Science Meeting of 2016 in terms of surge propagation and high-resolution inundation areas.

  20. HPC Profiling with the Sun Studio™ Performance Tools

    NASA Astrophysics Data System (ADS)

    Itzkowitz, Marty; Maruyama, Yukon

    In this paper, we describe how to use the Sun Studio Performance Tools to understand the nature and causes of application performance problems. We first explore CPU and memory performance problems for single-threaded applications, giving some simple examples. Then, we discuss multi-threaded performance issues, such as locking and false-sharing of cache lines, in each case showing how the tools can help. We go on to describe OpenMP applications and the support for them in the performance tools. Then we discuss MPI applications, and the techniques used to profile them. Finally, we present our conclusions.

  1. DOE Office of Scientific and Technical Information (OSTI.GOV)

    D'Azevedo, Eduardo; Abbott, Stephen; Koskela, Tuomas

    The XGC fusion gyrokinetic code combines state-of-the-art, portable computational and algorithmic technologies to enable complicated multiscale simulations of turbulence and transport dynamics in ITER edge plasma on the largest US open-science computer, the CRAY XK7 Titan, at its maximal heterogeneous capability, which have not been possible before due to a factor of over 10 shortage in the time-to-solution for less than 5 days of wall-clock time for one physics case. Frontier techniques such as nested OpenMP parallelism, adaptive parallel I/O, staging I/O and data reduction using dynamic and asynchronous applications interactions, dynamic repartitioning.

  2. Analysis and optimization of gyrokinetic toroidal simulations on homogenous and heterogenous platforms

    DOE PAGES

    Ibrahim, Khaled Z.; Madduri, Kamesh; Williams, Samuel; ...

    2013-07-18

    The Gyrokinetic Toroidal Code (GTC) uses the particle-in-cell method to efficiently simulate plasma microturbulence. This paper presents novel analysis and optimization techniques to enhance the performance of GTC on large-scale machines. We introduce cell access analysis to better manage locality vs. synchronization tradeoffs on CPU and GPU-based architectures. Finally, our optimized hybrid parallel implementation of GTC uses MPI, OpenMP, and NVIDIA CUDA, achieves up to a 2× speedup over the reference Fortran version on multiple parallel systems, and scales efficiently to tens of thousands of cores.

  3. Checkpointing Shared Memory Programs at the Application-level

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Bronevetsky, G; Schulz, M; Szwed, P

    2004-09-08

    Trends in high-performance computing are making it necessary for long-running applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart(CPR)-the state of the computation is saved periodically on disk, and when a failure occurs, the computation is restarted from the last saved state. At present, it is the responsibility of the programmer to instrument applications for CPR. Our group is investigating the use of compiler technology to instrument codes to make them self-checkpointing and self-restarting, thereby providing an automatic solution to the problem of making long-running scientific applications resilient to hardware faults. Our previous work focusedmore » on message-passing programs. In this paper, we describe such a system for shared-memory programs running on symmetric multiprocessors. The system has two components: (i)a pre-compiler for source-to-source modification of applications, and (ii) a runtime system that implements a protocol for coordinating CPR among the threads of the parallel application. For the sake of concreteness, we focus on a non-trivial subset of OpenMP that includes barriers and locks. One of the advantages of this approach is that the ability to tolerate faults becomes embedded within the application itself, so applications become self-checkpointing and self-restarting on any platform. We demonstrate this by showing that our transformed benchmarks can checkpoint and restart on three different platforms (Windows/x86, Linux/x86, and Tru64/Alpha). Our experiments show that the overhead introduced by this approach is usually quite small; they also suggest ways in which the current implementation can be tuned to reduced overheads further.« less

  4. The VOLSCAT package for electron and positron scattering of molecular targets: A new high throughput approach to cross-section and resonances computation

    NASA Astrophysics Data System (ADS)

    Sanna, N.; Baccarelli, I.; Morelli, G.

    2009-12-01

    VOLSCAT is a computer program which implements the Single Center Expansion (SCE) method to solve the scattering equation for the elastic collision of electrons/positrons off molecular targets. The scattering potential needed is calculated by on-the-fly calls to the external SCELib library for molecular properties, recently ported to GPU computing environment and ClearSpeed platforms, and made available by means of an Application Program Interface (SCELib-API) which is also provided with the VOLSCAT package in a beta version. The result is a high throughput approach to the solution of the complex e/e-molecule scattering problem, with allows for intensive calculations both for the number of systems which can be studied and for their size. Accurate partial and total elastic cross sections are produced in output together with the associated eigenphase sums. Indirect scattering processes arising from the formation of temporary negative ions can also be analyzed through the computation of the resonances' parameters. Program summaryProgram title: VOLSCAT V1.0 Catalogue identifier: AEEW_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEEW_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 4 618 353 No. of bytes in distributed program, including test data, etc.: 120 307 536 Distribution format: tar.gz Programming language: Fortran90 Computer: All SMP platforms based on AIX, Linux and SUNOS operating systems over SPARC, POWER, Intel Itanium2, X86, em64t and Opteron processors Operating system: SUNOS, IBM AIX, Linux RedHat (Enterprise), Linux SuSE (SLES) Has the code been vectorized or parallelized?: Yes. The parallel version in the present release of the code is limited to the OpenMP calculation of the exchange potential V or V. The number of OpenMP threads can then be set in the input script. RAM: For a typical (isolated) biomolecule (e.g. Cytosine or Ribose) a converged calculation would require from 320 MB up to 2.5 GB. Word size: 64 bits Classification: 16.5 External routines: LAPACK (dsyev, dgetri, dgetrf) ( http://www.netlib.org/lapack/) Nature of problem: In this set of codes an efficient procedure is implemented to calculate partial cross section for the scattering between an electron/positron and a molecular target as a function of the collision energies. Solution method: The scattering equations are derived in the framework of the Single Center Expansion (SCE) procedure which allows the reduction of the original three-dimensional problem to a radial (one-dimensional) equation through the expansion of the scattering potential and the system wavefunction in a set of symmetry-adapted (real) spherical harmonics. The local part of the electrostatic interaction between the charged projectile (electron/positron) and the molecular target is provided in input by the SCELib library, which also provides the correlation and polarization corrections for the short-range and long-range part, respectively, of the interaction. A proper Application Programming Interface (API) is used by VOLSCAT to load the energy-independent part of the potential while the non-local exchange contribution is approximated by a local form and calculated on the fly in the VOLSCAT run for each desired collision energy. The resulting SCE one-dimensional homogeneous scattering equation is rewritten in an integral form by means of the standard Green's function technique resulting in a set of Volterra coupled equations which are solved to give the phase shifts and cross sections for any desired impact energy in terms of the partial components defined by the irreducible representations of the symmetry point group to which the target molecule belongs. The total cross section can then be straightforwardly calculated by summing over all the partial cross sections produced in the output. By the Breit-Wigner analysis of the eigenphase sum produced as a function of the energy one can also get information on the location of possible resonance states arising in the collision process. Restrictions: Depending on the molecular system under study and on the operating conditions the program may or may not fit into available RAM memory. Additional comments: A beta version of SCELib-API is included in the distribution package. Running time: The execution time strongly depends on the molecular target description and on the hardware/OS chosen, it is directly proportional to the (r,θ,φ) grid size and to the number of angular basis functions used.

  5. Revisiting Molecular Dynamics on a CPU/GPU system: Water Kernel and SHAKE Parallelization.

    PubMed

    Ruymgaart, A Peter; Elber, Ron

    2012-11-13

    We report Graphics Processing Unit (GPU) and Open-MP parallel implementations of water-specific force calculations and of bond constraints for use in Molecular Dynamics simulations. We focus on a typical laboratory computing-environment in which a CPU with a few cores is attached to a GPU. We discuss in detail the design of the code and we illustrate performance comparable to highly optimized codes such as GROMACS. Beside speed our code shows excellent energy conservation. Utilization of water-specific lists allows the efficient calculations of non-bonded interactions that include water molecules and results in a speed-up factor of more than 40 on the GPU compared to code optimized on a single CPU core for systems larger than 20,000 atoms. This is up four-fold from a factor of 10 reported in our initial GPU implementation that did not include a water-specific code. Another optimization is the implementation of constrained dynamics entirely on the GPU. The routine, which enforces constraints of all bonds, runs in parallel on multiple Open-MP cores or entirely on the GPU. It is based on Conjugate Gradient solution of the Lagrange multipliers (CG SHAKE). The GPU implementation is partially in double precision and requires no communication with the CPU during the execution of the SHAKE algorithm. The (parallel) implementation of SHAKE allows an increase of the time step to 2.0fs while maintaining excellent energy conservation. Interestingly, CG SHAKE is faster than the usual bond relaxation algorithm even on a single core if high accuracy is expected. The significant speedup of the optimized components transfers the computational bottleneck of the MD calculation to the reciprocal part of Particle Mesh Ewald (PME).

  6. Evaluation of the Intel Xeon Phi Co-processor to accelerate the sensitivity map calculation for PET imaging

    NASA Astrophysics Data System (ADS)

    Dey, T.; Rodrigue, P.

    2015-07-01

    We aim to evaluate the Intel Xeon Phi coprocessor for acceleration of 3D Positron Emission Tomography (PET) image reconstruction. We focus on the sensitivity map calculation as one computational intensive part of PET image reconstruction, since it is a promising candidate for acceleration with the Many Integrated Core (MIC) architecture of the Xeon Phi. The computation of the voxels in the field of view (FoV) can be done in parallel and the 103 to 104 samples needed to calculate the detection probability of each voxel can take advantage of vectorization. We use the ray tracing kernels of the Embree project to calculate the hit points of the sample rays with the detector and in a second step the sum of the radiological path taking into account attenuation is determined. The core components are implemented using the Intel single instruction multiple data compiler (ISPC) to enable a portable implementation showing efficient vectorization either on the Xeon Phi and the Host platform. On the Xeon Phi, the calculation of the radiological path is also implemented in hardware specific intrinsic instructions (so-called `intrinsics') to allow manually-optimized vectorization. For parallelization either OpenMP and ISPC tasking (based on pthreads) are evaluated.Our implementation achieved a scalability factor of 0.90 on the Xeon Phi coprocessor (model 5110P) with 60 cores at 1 GHz. Only minor differences were found between parallelization with OpenMP and the ISPC tasking feature. The implementation using intrinsics was found to be about 12% faster than the portable ISPC version. With this version, a speedup of 1.43 was achieved on the Xeon Phi coprocessor compared to the host system (HP SL250s Gen8) equipped with two Xeon (E5-2670) CPUs, with 8 cores at 2.6 to 3.3 GHz each. Using a second Xeon Phi card the speedup could be further increased to 2.77. No significant differences were found between the results of the different Xeon Phi and the Host implementations. The examination showed that a reasonable speedup of sensitivity map calculation could be achieved on the Xeon Phi either by a portable or a hardware specific implementation.

  7. Axially deformed solution of the Skyrme-Hartree-Fock-Bogoliubov equations using the transformed harmonic oscillator basis (II) HFBTHO v2.00d: A new version of the program

    NASA Astrophysics Data System (ADS)

    Stoitsov, M. V.; Schunck, N.; Kortelainen, M.; Michel, N.; Nam, H.; Olsen, E.; Sarich, J.; Wild, S.

    2013-06-01

    We describe the new version 2.00d of the code HFBTHO that solves the nuclear Skyrme-Hartree-Fock (HF) or Skyrme-Hartree-Fock-Bogoliubov (HFB) problem by using the cylindrical transformed deformed harmonic oscillator basis. In the new version, we have implemented the following features: (i) the modified Broyden method for non-linear problems, (ii) optional breaking of reflection symmetry, (iii) calculation of axial multipole moments, (iv) finite temperature formalism for the HFB method, (v) linear constraint method based on the approximation of the Random Phase Approximation (RPA) matrix for multi-constraint calculations, (vi) blocking of quasi-particles in the Equal Filling Approximation (EFA), (vii) framework for generalized energy density with arbitrary density-dependences, and (viii) shared memory parallelism via OpenMP pragmas. Program summaryProgram title: HFBTHO v2.00d Catalog identifier: ADUI_v2_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADUI_v2_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GNU General Public License version 3 No. of lines in distributed program, including test data, etc.: 167228 No. of bytes in distributed program, including test data, etc.: 2672156 Distribution format: tar.gz Programming language: FORTRAN-95. Computer: Intel Pentium-III, Intel Xeon, AMD-Athlon, AMD-Opteron, Cray XT5, Cray XE6. Operating system: UNIX, LINUX, WindowsXP. RAM: 200 Mwords Word size: 8 bits Classification: 17.22. Does the new version supercede the previous version?: Yes Catalog identifier of previous version: ADUI_v1_0 Journal reference of previous version: Comput. Phys. Comm. 167 (2005) 43 Nature of problem: The solution of self-consistent mean-field equations for weakly-bound paired nuclei requires a correct description of the asymptotic properties of nuclear quasi-particle wave functions. In the present implementation, this is achieved by using the single-particle wave functions of the transformed harmonic oscillator, which allows for an accurate description of deformation effects and pairing correlations in nuclei arbitrarily close to the particle drip lines. Solution method: The program uses the axial Transformed Harmonic Oscillator (THO) single- particle basis to expand quasi-particle wave functions. It iteratively diagonalizes the Hartree-Fock-Bogoliubov Hamiltonian based on generalized Skyrme-like energy densities and zero-range pairing interactions until a self-consistent solution is found. A previous version of the program was presented in: M.V. Stoitsov, J. Dobaczewski, W. Nazarewicz, P. Ring, Comput. Phys. Commun. 167 (2005) 43-63. Reasons for new version: Version 2.00d of HFBTHO provides a number of new options such as the optional breaking of reflection symmetry, the calculation of axial multipole moments, the finite temperature formalism for the HFB method, optimized multi-constraint calculations, the treatment of odd-even and odd-odd nuclei in the blocking approximation, and the framework for generalized energy density with arbitrary density-dependences. It is also the first version of HFBTHO to contain threading capabilities. Summary of revisions: The modified Broyden method has been implemented, Optional breaking of reflection symmetry has been implemented, The calculation of all axial multipole moments up to λ=8 has been implemented, The finite temperature formalism for the HFB method has been implemented, The linear constraint method based on the approximation of the Random Phase Approximation (RPA) matrix for multi-constraint calculations has been implemented, The blocking of quasi-particles in the Equal Filling Approximation (EFA) has been implemented, The framework for generalized energy density functionals with arbitrary density-dependence has been implemented, Shared memory parallelism via OpenMP pragmas has been implemented. Restrictions: Axial- and time-reversal symmetries are assumed. Unusual features: The user must have access to the LAPACK subroutines DSYEVD, DSYTRF and DSYTRI, and their dependences, which compute eigenvalues and eigenfunctions of real symmetric matrices, the LAPACK subroutines DGETRI and DGETRF, which invert arbitrary real matrices, and the BLAS routines DCOPY, DSCAL, DGEMM and DGEMV for double-precision linear algebra (or provide another set of subroutines that can perform such tasks). The BLAS and LAPACK subroutines can be obtained from the Netlib Repository at the University of Tennessee, Knoxville: http://netlib2.cs.utk.edu/. Running time: Highly variable, as it depends on the nucleus, size of the basis, requested accuracy, requested configuration, compiler and libraries, and hardware architecture. An order of magnitude would be a few seconds for ground-state configurations in small bases N≈8-12, to a few minutes in very deformed configuration of a heavy nucleus with a large basis N>20.

  8. Code Modernization of VPIC

    NASA Astrophysics Data System (ADS)

    Bird, Robert; Nystrom, David; Albright, Brian

    2017-10-01

    The ability of scientific simulations to effectively deliver performant computation is increasingly being challenged by successive generations of high-performance computing architectures. Code development to support efficient computation on these modern architectures is both expensive, and highly complex; if it is approached without due care, it may also not be directly transferable between subsequent hardware generations. Previous works have discussed techniques to support the process of adapting a legacy code for modern hardware generations, but despite the breakthroughs in the areas of mini-app development, portable-performance, and cache oblivious algorithms the problem still remains largely unsolved. In this work we demonstrate how a focus on platform agnostic modern code-development can be applied to Particle-in-Cell (PIC) simulations to facilitate effective scientific delivery. This work builds directly on our previous work optimizing VPIC, in which we replaced intrinsic based vectorisation with compile generated auto-vectorization to improve the performance and portability of VPIC. In this work we present the use of a specialized SIMD queue for processing some particle operations, and also preview a GPU capable OpenMP variant of VPIC. Finally we include a lessons learnt. Work performed under the auspices of the U.S. Dept. of Energy by the Los Alamos National Security, LLC Los Alamos National Laboratory under contract DE-AC52-06NA25396 and supported by the LANL LDRD program.

  9. Global magnetohydrodynamic simulations on multiple GPUs

    NASA Astrophysics Data System (ADS)

    Wong, Un-Hong; Wong, Hon-Cheng; Ma, Yonghui

    2014-01-01

    Global magnetohydrodynamic (MHD) models play the major role in investigating the solar wind-magnetosphere interaction. However, the huge computation requirement in global MHD simulations is also the main problem that needs to be solved. With the recent development of modern graphics processing units (GPUs) and the Compute Unified Device Architecture (CUDA), it is possible to perform global MHD simulations in a more efficient manner. In this paper, we present a global magnetohydrodynamic (MHD) simulator on multiple GPUs using CUDA 4.0 with GPUDirect 2.0. Our implementation is based on the modified leapfrog scheme, which is a combination of the leapfrog scheme and the two-step Lax-Wendroff scheme. GPUDirect 2.0 is used in our implementation to drive multiple GPUs. All data transferring and kernel processing are managed with CUDA 4.0 API instead of using MPI or OpenMP. Performance measurements are made on a multi-GPU system with eight NVIDIA Tesla M2050 (Fermi architecture) graphics cards. These measurements show that our multi-GPU implementation achieves a peak performance of 97.36 GFLOPS in double precision.

  10. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Messer, Bronson; Harris, James A; Parete-Koon, Suzanne T

    We describe recent development work on the core-collapse supernova code CHIMERA. CHIMERA has consumed more than 100 million cpu-hours on Oak Ridge Leadership Computing Facility (OLCF) platforms in the past 3 years, ranking it among the most important applications at the OLCF. Most of the work described has been focused on exploiting the multicore nature of the current platform (Jaguar) via, e.g., multithreading using OpenMP. In addition, we have begun a major effort to marshal the computational power of GPUs with CHIMERA. The impending upgrade of Jaguar to Titan a 20+ PF machine with an NVIDIA GPU on many nodesmore » makes this work essential.« less

  11. Performance Portability Strategies for Grid C++ Expression Templates

    NASA Astrophysics Data System (ADS)

    Boyle, Peter A.; Clark, M. A.; DeTar, Carleton; Lin, Meifeng; Rana, Verinder; Vaquero Avilés-Casco, Alejandro

    2018-03-01

    One of the key requirements for the Lattice QCD Application Development as part of the US Exascale Computing Project is performance portability across multiple architectures. Using the Grid C++ expression template as a starting point, we report on the progress made with regards to the Grid GPU offloading strategies. We present both the successes and issues encountered in using CUDA, OpenACC and Just-In-Time compilation. Experimentation and performance on GPUs with a SU(3)×SU(3) streaming test will be reported. We will also report on the challenges of using current OpenMP 4.x for GPU offloading in the same code.

  12. 3D-PDR: Three-dimensional photodissociation region code

    NASA Astrophysics Data System (ADS)

    Bisbas, T. G.; Bell, T. A.; Viti, S.; Yates, J.; Barlow, M. J.

    2018-03-01

    3D-PDR is a three-dimensional photodissociation region code written in Fortran. It uses the Sundials package (written in C) to solve the set of ordinary differential equations and it is the successor of the one-dimensional PDR code UCL_PDR (ascl:1303.004). Using the HEALpix ray-tracing scheme (ascl:1107.018), 3D-PDR solves a three-dimensional escape probability routine and evaluates the attenuation of the far-ultraviolet radiation in the PDR and the propagation of FIR/submm emission lines out of the PDR. The code is parallelized (OpenMP) and can be applied to 1D and 3D problems.

  13. XaNSoNS: GPU-accelerated simulator of diffraction patterns of nanoparticles

    NASA Astrophysics Data System (ADS)

    Neverov, V. S.

    XaNSoNS is an open source software with GPU support, which simulates X-ray and neutron 1D (or 2D) diffraction patterns and pair-distribution functions (PDF) for amorphous or crystalline nanoparticles (up to ∼107 atoms) of heterogeneous structural content. Among the multiple parameters of the structure the user may specify atomic displacements, site occupancies, molecular displacements and molecular rotations. The software uses general equations nonspecific to crystalline structures to calculate the scattering intensity. It supports four major standards of parallel computing: MPI, OpenMP, Nvidia CUDA and OpenCL, enabling it to run on various architectures, from CPU-based HPCs to consumer-level GPUs.

  14. Research on bathymetry estimation by Worldview-2 based with the semi-analytical model

    NASA Astrophysics Data System (ADS)

    Sheng, L.; Bai, J.; Zhou, G.-W.; Zhao, Y.; Li, Y.-C.

    2015-04-01

    South Sea Islands of China are far away from the mainland, the reefs takes more than 95% of south sea, and most reefs scatter over interested dispute sensitive area. Thus, the methods of obtaining the reefs bathymetry accurately are urgent to be developed. Common used method, including sonar, airborne laser and remote sensing estimation, are limited by the long distance, large area and sensitive location. Remote sensing data provides an effective way for bathymetry estimation without touching over large area, by the relationship between spectrum information and bathymetry. Aimed at the water quality of the south sea of China, our paper develops a bathymetry estimation method without measured water depth. Firstly the semi-analytical optimization model of the theoretical interpretation models has been studied based on the genetic algorithm to optimize the model. Meanwhile, OpenMP parallel computing algorithm has been introduced to greatly increase the speed of the semi-analytical optimization model. One island of south sea in China is selected as our study area, the measured water depth are used to evaluate the accuracy of bathymetry estimation from Worldview-2 multispectral images. The results show that: the semi-analytical optimization model based on genetic algorithm has good results in our study area;the accuracy of estimated bathymetry in the 0-20 meters shallow water area is accepted.Semi-analytical optimization model based on genetic algorithm solves the problem of the bathymetry estimation without water depth measurement. Generally, our paper provides a new bathymetry estimation method for the sensitive reefs far away from mainland.

  15. A smooth particle hydrodynamics code to model collisions between solid, self-gravitating objects

    NASA Astrophysics Data System (ADS)

    Schäfer, C.; Riecker, S.; Maindl, T. I.; Speith, R.; Scherrer, S.; Kley, W.

    2016-05-01

    Context. Modern graphics processing units (GPUs) lead to a major increase in the performance of the computation of astrophysical simulations. Owing to the different nature of GPU architecture compared to traditional central processing units (CPUs) such as x86 architecture, existing numerical codes cannot be easily migrated to run on GPU. Here, we present a new implementation of the numerical method smooth particle hydrodynamics (SPH) using CUDA and the first astrophysical application of the new code: the collision between Ceres-sized objects. Aims: The new code allows for a tremendous increase in speed of astrophysical simulations with SPH and self-gravity at low costs for new hardware. Methods: We have implemented the SPH equations to model gas, liquids and elastic, and plastic solid bodies and added a fragmentation model for brittle materials. Self-gravity may be optionally included in the simulations and is treated by the use of a Barnes-Hut tree. Results: We find an impressive performance gain using NVIDIA consumer devices compared to our existing OpenMP code. The new code is freely available to the community upon request. If you are interested in our CUDA SPH code miluphCUDA, please write an email to Christoph Schäfer. miluphCUDA is the CUDA port of miluph. miluph is pronounced [maßl2v]. We do not support the use of the code for military purposes.

  16. Collaborating CPU and GPU for large-scale high-order CFD simulations with complex grids on the TianHe-1A supercomputer

    NASA Astrophysics Data System (ADS)

    Xu, Chuanfu; Deng, Xiaogang; Zhang, Lilun; Fang, Jianbin; Wang, Guangxue; Jiang, Yi; Cao, Wei; Che, Yonggang; Wang, Yongxian; Wang, Zhenghua; Liu, Wei; Cheng, Xinghua

    2014-12-01

    Programming and optimizing complex, real-world CFD codes on current many-core accelerated HPC systems is very challenging, especially when collaborating CPUs and accelerators to fully tap the potential of heterogeneous systems. In this paper, with a tri-level hybrid and heterogeneous programming model using MPI + OpenMP + CUDA, we port and optimize our high-order multi-block structured CFD software HOSTA on the GPU-accelerated TianHe-1A supercomputer. HOSTA adopts two self-developed high-order compact definite difference schemes WCNS and HDCS that can simulate flows with complex geometries. We present a dual-level parallelization scheme for efficient multi-block computation on GPUs and perform particular kernel optimizations for high-order CFD schemes. The GPU-only approach achieves a speedup of about 1.3 when comparing one Tesla M2050 GPU with two Xeon X5670 CPUs. To achieve a greater speedup, we collaborate CPU and GPU for HOSTA instead of using a naive GPU-only approach. We present a novel scheme to balance the loads between the store-poor GPU and the store-rich CPU. Taking CPU and GPU load balance into account, we improve the maximum simulation problem size per TianHe-1A node for HOSTA by 2.3×, meanwhile the collaborative approach can improve the performance by around 45% compared to the GPU-only approach. Further, to scale HOSTA on TianHe-1A, we propose a gather/scatter optimization to minimize PCI-e data transfer times for ghost and singularity data of 3D grid blocks, and overlap the collaborative computation and communication as far as possible using some advanced CUDA and MPI features. Scalability tests show that HOSTA can achieve a parallel efficiency of above 60% on 1024 TianHe-1A nodes. With our method, we have successfully simulated an EET high-lift airfoil configuration containing 800M cells and China's large civil airplane configuration containing 150M cells. To our best knowledge, those are the largest-scale CPU-GPU collaborative simulations that solve realistic CFD problems with both complex configurations and high-order schemes.

  17. Collaborating CPU and GPU for large-scale high-order CFD simulations with complex grids on the TianHe-1A supercomputer

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Xu, Chuanfu, E-mail: xuchuanfu@nudt.edu.cn; Deng, Xiaogang; Zhang, Lilun

    Programming and optimizing complex, real-world CFD codes on current many-core accelerated HPC systems is very challenging, especially when collaborating CPUs and accelerators to fully tap the potential of heterogeneous systems. In this paper, with a tri-level hybrid and heterogeneous programming model using MPI + OpenMP + CUDA, we port and optimize our high-order multi-block structured CFD software HOSTA on the GPU-accelerated TianHe-1A supercomputer. HOSTA adopts two self-developed high-order compact definite difference schemes WCNS and HDCS that can simulate flows with complex geometries. We present a dual-level parallelization scheme for efficient multi-block computation on GPUs and perform particular kernel optimizations formore » high-order CFD schemes. The GPU-only approach achieves a speedup of about 1.3 when comparing one Tesla M2050 GPU with two Xeon X5670 CPUs. To achieve a greater speedup, we collaborate CPU and GPU for HOSTA instead of using a naive GPU-only approach. We present a novel scheme to balance the loads between the store-poor GPU and the store-rich CPU. Taking CPU and GPU load balance into account, we improve the maximum simulation problem size per TianHe-1A node for HOSTA by 2.3×, meanwhile the collaborative approach can improve the performance by around 45% compared to the GPU-only approach. Further, to scale HOSTA on TianHe-1A, we propose a gather/scatter optimization to minimize PCI-e data transfer times for ghost and singularity data of 3D grid blocks, and overlap the collaborative computation and communication as far as possible using some advanced CUDA and MPI features. Scalability tests show that HOSTA can achieve a parallel efficiency of above 60% on 1024 TianHe-1A nodes. With our method, we have successfully simulated an EET high-lift airfoil configuration containing 800M cells and China's large civil airplane configuration containing 150M cells. To our best knowledge, those are the largest-scale CPU–GPU collaborative simulations that solve realistic CFD problems with both complex configurations and high-order schemes.« less

  18. Pareto Joint Inversion of Love and Quasi Rayleigh's waves - synthetic study

    NASA Astrophysics Data System (ADS)

    Bogacz, Adrian; Dalton, David; Danek, Tomasz; Miernik, Katarzyna; Slawinski, Michael A.

    2017-04-01

    In this contribution the specific application of Pareto joint inversion in solving geophysical problem is presented. Pareto criterion combine with Particle Swarm Optimization were used to solve geophysical inverse problems for Love and Quasi Rayleigh's waves. Basic theory of forward problem calculation for chosen surface waves is described. To avoid computational problems some simplification were made. This operation allowed foster and more straightforward calculation without lost of solution generality. According to the solving scheme restrictions, considered model must have exact two layers, elastic isotropic surface layer and elastic isotropic half space with infinite thickness. The aim of the inversion is to obain elastic parameters and model geometry using dispersion data. In calculations different case were considered, such as different number of modes for different wave types and different frequencies. Created solutions are using OpenMP standard for parallel computing, which help in reduction of computational times. The results of experimental computations are presented and commented. This research was performed in the context of The Geomechanics Project supported by Husky Energy. Also, this research was partially supported by the Natural Sciences and Engineering Research Council of Canada, grant 238416-2013, and by the Polish National Science Center under contract No. DEC-2013/11/B/ST10/0472.

  19. Parallelized computation for computer simulation of electrocardiograms using personal computers with multi-core CPU and general-purpose GPU.

    PubMed

    Shen, Wenfeng; Wei, Daming; Xu, Weimin; Zhu, Xin; Yuan, Shizhong

    2010-10-01

    Biological computations like electrocardiological modelling and simulation usually require high-performance computing environments. This paper introduces an implementation of parallel computation for computer simulation of electrocardiograms (ECGs) in a personal computer environment with an Intel CPU of Core (TM) 2 Quad Q6600 and a GPU of Geforce 8800GT, with software support by OpenMP and CUDA. It was tested in three parallelization device setups: (a) a four-core CPU without a general-purpose GPU, (b) a general-purpose GPU plus 1 core of CPU, and (c) a four-core CPU plus a general-purpose GPU. To effectively take advantage of a multi-core CPU and a general-purpose GPU, an algorithm based on load-prediction dynamic scheduling was developed and applied to setting (c). In the simulation with 1600 time steps, the speedup of the parallel computation as compared to the serial computation was 3.9 in setting (a), 16.8 in setting (b), and 20.0 in setting (c). This study demonstrates that a current PC with a multi-core CPU and a general-purpose GPU provides a good environment for parallel computations in biological modelling and simulation studies. Copyright 2010 Elsevier Ireland Ltd. All rights reserved.

  20. Fast 2D FWI on a multi and many-cores workstation.

    NASA Astrophysics Data System (ADS)

    Thierry, Philippe; Donno, Daniela; Noble, Mark

    2014-05-01

    Following the introduction of x86 co-processors (Xeon Phi) and the performance increase of standard 2-socket workstations using the latest 12 cores E5-v2 x86-64 CPU, we present here a MPI + OpenMP implementation of an acoustic 2D FWI (full waveform inversion) code which simultaneously runs on the CPUs and on the co-processors installed in a workstation. The main advantage of running a 2D FWI on a workstation is to be able to quickly evaluate new features such as more complicated wave equations, new cost functions, finite-difference stencils or boundary conditions. Since the co-processor is made of 61 in-order x86 cores, each of them having up to 4 threads, this many-core can be seen as a shared memory SMP (symmetric multiprocessing) machine with its own IP address. Depending on the vendor, a single workstation can handle several co-processors making the workstation as a personal cluster under the desk. The original Fortran 90 CPU version of the 2D FWI code is just recompiled to get a Xeon Phi x86 binary. This multi and many-core configuration uses standard compilers and associated MPI as well as math libraries under Linux; therefore, the cost of code development remains constant, while improving computation time. We choose to implement the code with the so-called symmetric mode to fully use the capacity of the workstation, but we also evaluate the scalability of the code in native mode (i.e running only on the co-processor) thanks to the Linux ssh and NFS capabilities. Usual care of optimization and SIMD vectorization is used to ensure optimal performances, and to analyze the application performances and bottlenecks on both platforms. The 2D FWI implementation uses finite-difference time-domain forward modeling and a quasi-Newton (with L-BFGS algorithm) optimization scheme for the model parameters update. Parallelization is achieved through standard MPI shot gathers distribution and OpenMP for domain decomposition within the co-processor. Taking advantage of the 16 GB of memory available on the co-processor we are able to keep wavefields in memory to achieve the gradient computation by cross-correlation of forward and back-propagated wavefields needed by our time-domain FWI scheme, without heavy traffic on the i/o subsystem and PCIe bus. In this presentation we will also review some simple methodologies to determine performance expectation compared to real performances in order to get optimization effort estimation before starting any huge modification or rewriting of research codes. The key message is the ease of use and development of this hybrid configuration to reach not the absolute peak performance value but the optimal one that ensures the best balance between geophysical and computer developments.

  1. KITTEN Lightweight Kernel 0.1 Beta

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Pedretti, Kevin; Levenhagen, Michael; Kelly, Suzanne

    2007-12-12

    The Kitten Lightweight Kernel is a simplified OS (operating system) kernel that is intended to manage a compute node's hardware resources. It provides a set of mechanisms to user-level applications for utilizing hardware resources (e.g., allocating memory, creating processes, accessing the network). Kitten is much simpler than general-purpose OS kernels, such as Linux or Windows, but includes all of the esssential functionality needed to support HPC (high-performance computing) MPI, PGAS and OpenMP applications. Kitten provides unique capabilities such as physically contiguous application memory, transparent large page support, and noise-free tick-less operation, which enable HPC applications to obtain greater efficiency andmore » scalability than with general purpose OS kernels.« less

  2. Parallel, Distributed Scripting with Python

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Miller, P J

    2002-05-24

    Parallel computers used to be, for the most part, one-of-a-kind systems which were extremely difficult to program portably. With SMP architectures, the advent of the POSIX thread API and OpenMP gave developers ways to portably exploit on-the-box shared memory parallelism. Since these architectures didn't scale cost-effectively, distributed memory clusters were developed. The associated MPI message passing libraries gave these systems a portable paradigm too. Having programmers effectively use this paradigm is a somewhat different question. Distributed data has to be explicitly transported via the messaging system in order for it to be useful. In high level languages, the MPI librarymore » gives access to data distribution routines in C, C++, and FORTRAN. But we need more than that. Many reasonable and common tasks are best done in (or as extensions to) scripting languages. Consider sysadm tools such as password crackers, file purgers, etc ... These are simple to write in a scripting language such as Python (an open source, portable, and freely available interpreter). But these tasks beg to be done in parallel. Consider the a password checker that checks an encrypted password against a 25,000 word dictionary. This can take around 10 seconds in Python (6 seconds in C). It is trivial to parallelize if you can distribute the information and co-ordinate the work.« less

  3. Bayer image parallel decoding based on GPU

    NASA Astrophysics Data System (ADS)

    Hu, Rihui; Xu, Zhiyong; Wei, Yuxing; Sun, Shaohua

    2012-11-01

    In the photoelectrical tracking system, Bayer image is decompressed in traditional method, which is CPU-based. However, it is too slow when the images become large, for example, 2K×2K×16bit. In order to accelerate the Bayer image decoding, this paper introduces a parallel speedup method for NVIDA's Graphics Processor Unit (GPU) which supports CUDA architecture. The decoding procedure can be divided into three parts: the first is serial part, the second is task-parallelism part, and the last is data-parallelism part including inverse quantization, inverse discrete wavelet transform (IDWT) as well as image post-processing part. For reducing the execution time, the task-parallelism part is optimized by OpenMP techniques. The data-parallelism part could advance its efficiency through executing on the GPU as CUDA parallel program. The optimization techniques include instruction optimization, shared memory access optimization, the access memory coalesced optimization and texture memory optimization. In particular, it can significantly speed up the IDWT by rewriting the 2D (Tow-dimensional) serial IDWT into 1D parallel IDWT. Through experimenting with 1K×1K×16bit Bayer image, data-parallelism part is 10 more times faster than CPU-based implementation. Finally, a CPU+GPU heterogeneous decompression system was designed. The experimental result shows that it could achieve 3 to 5 times speed increase compared to the CPU serial method.

  4. PhreeqcRM: A reaction module for transport simulators based on the geochemical model PHREEQC

    USGS Publications Warehouse

    Parkhurst, David L.; Wissmeier, Laurin

    2015-01-01

    PhreeqcRM is a geochemical reaction module designed specifically to perform equilibrium and kinetic reaction calculations for reactive transport simulators that use an operator-splitting approach. The basic function of the reaction module is to take component concentrations from the model cells of the transport simulator, run geochemical reactions, and return updated component concentrations to the transport simulator. If multicomponent diffusion is modeled (e.g., Nernst–Planck equation), then aqueous species concentrations can be used instead of component concentrations. The reaction capabilities are a complete implementation of the reaction capabilities of PHREEQC. In each cell, the reaction module maintains the composition of all of the reactants, which may include minerals, exchangers, surface complexers, gas phases, solid solutions, and user-defined kinetic reactants.PhreeqcRM assigns initial and boundary conditions for model cells based on standard PHREEQC input definitions (files or strings) of chemical compositions of solutions and reactants. Additional PhreeqcRM capabilities include methods to eliminate reaction calculations for inactive parts of a model domain, transfer concentrations and other model properties, and retrieve selected results. The module demonstrates good scalability for parallel processing by using multiprocessing with MPI (message passing interface) on distributed memory systems, and limited scalability using multithreading with OpenMP on shared memory systems. PhreeqcRM is written in C++, but interfaces allow methods to be called from C or Fortran. By using the PhreeqcRM reaction module, an existing multicomponent transport simulator can be extended to simulate a wide range of geochemical reactions. Results of the implementation of PhreeqcRM as the reaction engine for transport simulators PHAST and FEFLOW are shown by using an analytical solution and the reactive transport benchmark of MoMaS.

  5. Monte Carlo based investigation of berry phase for depth resolved characterization of biomedical scattering samples

    NASA Astrophysics Data System (ADS)

    Baba, J. S.; Koju, V.; John, D.

    2015-03-01

    The propagation of light in turbid media is an active area of research with relevance to numerous investigational fields, e.g., biomedical diagnostics and therapeutics. The statistical random-walk nature of photon propagation through turbid media is ideal for computational based modeling and simulation. Ready access to super computing resources provide a means for attaining brute force solutions to stochastic light-matter interactions entailing scattering by facilitating timely propagation of sufficient (>107) photons while tracking characteristic parameters based on the incorporated physics of the problem. One such model that works well for isotropic but fails for anisotropic scatter, which is the case for many biomedical sample scattering problems, is the diffusion approximation. In this report, we address this by utilizing Berry phase (BP) evolution as a means for capturing anisotropic scattering characteristics of samples in the preceding depth where the diffusion approximation fails. We extend the polarization sensitive Monte Carlo method of Ramella-Roman, et al., to include the computationally intensive tracking of photon trajectory in addition to polarization state at every scattering event. To speed-up the computations, which entail the appropriate rotations of reference frames, the code was parallelized using OpenMP. The results presented reveal that BP is strongly correlated to the photon penetration depth, thus potentiating the possibility of polarimetric depth resolved characterization of highly scattering samples, e.g., biological tissues.

  6. Monte Carlo based investigation of Berry phase for depth resolved characterization of biomedical scattering samples

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Baba, Justin S; John, Dwayne O; Koju, Vijay

    The propagation of light in turbid media is an active area of research with relevance to numerous investigational fields, e.g., biomedical diagnostics and therapeutics. The statistical random-walk nature of photon propagation through turbid media is ideal for computational based modeling and simulation. Ready access to super computing resources provide a means for attaining brute force solutions to stochastic light-matter interactions entailing scattering by facilitating timely propagation of sufficient (>10million) photons while tracking characteristic parameters based on the incorporated physics of the problem. One such model that works well for isotropic but fails for anisotropic scatter, which is the case formore » many biomedical sample scattering problems, is the diffusion approximation. In this report, we address this by utilizing Berry phase (BP) evolution as a means for capturing anisotropic scattering characteristics of samples in the preceding depth where the diffusion approximation fails. We extend the polarization sensitive Monte Carlo method of Ramella-Roman, et al.,1 to include the computationally intensive tracking of photon trajectory in addition to polarization state at every scattering event. To speed-up the computations, which entail the appropriate rotations of reference frames, the code was parallelized using OpenMP. The results presented reveal that BP is strongly correlated to the photon penetration depth, thus potentiating the possibility of polarimetric depth resolved characterization of highly scattering samples, e.g., biological tissues.« less

  7. OpenRBC: Redefining the Frontier of Red Blood Cell Simulations at Protein Resolution

    NASA Astrophysics Data System (ADS)

    Tang, Yu-Hang; Lu, Lu; Li, He; Grinberg, Leopold; Sachdeva, Vipin; Evangelinos, Constantinos; Karniadakis, George

    We present a from-scratch development of OpenRBC, a coarse-grained molecular dynamics code, which is capable of performing an unprecedented in silico experiment - simulating an entire mammal red blood cell lipid bilayer and cytoskeleton modeled by 4 million mesoscopic particles - on a single shared memory node. To achieve this, we invented an adaptive spatial searching algorithm to accelerate the computation of short-range pairwise interactions in an extremely sparse 3D space. The algorithm is based on a Voronoi partitioning of the point cloud of coarse-grained particles, and is continuously updated over the course of the simulation. The algorithm enables the construction of a lattice-free cell list, i.e. the key spatial searching data structure in our code, in O (N) time and space space with cells whose position and shape adapts automatically to the local density and curvature. The code implements NUMA/NUCA-aware OpenMP parallelization and achieves perfect scaling with up to hundreds of hardware threads. The code outperforms a legacy solver by more than 8 times in time-to-solution and more than 20 times in problem size, thus providing a new venue for probing the cytomechanics of red blood cells. This work was supported by the Department of Energy (DOE) Collaboratory on Mathematics for Mesoscopic Model- ing of Materials (CM4). YHT acknowledges partial financial support from an IBM Ph.D. Scholarship Award.

  8. GENESIS: a hybrid-parallel and multi-scale molecular dynamics simulator with enhanced sampling algorithms for biomolecular and cellular simulations.

    PubMed

    Jung, Jaewoon; Mori, Takaharu; Kobayashi, Chigusa; Matsunaga, Yasuhiro; Yoda, Takao; Feig, Michael; Sugita, Yuji

    2015-07-01

    GENESIS (Generalized-Ensemble Simulation System) is a new software package for molecular dynamics (MD) simulations of macromolecules. It has two MD simulators, called ATDYN and SPDYN. ATDYN is parallelized based on an atomic decomposition algorithm for the simulations of all-atom force-field models as well as coarse-grained Go-like models. SPDYN is highly parallelized based on a domain decomposition scheme, allowing large-scale MD simulations on supercomputers. Hybrid schemes combining OpenMP and MPI are used in both simulators to target modern multicore computer architectures. Key advantages of GENESIS are (1) the highly parallel performance of SPDYN for very large biological systems consisting of more than one million atoms and (2) the availability of various REMD algorithms (T-REMD, REUS, multi-dimensional REMD for both all-atom and Go-like models under the NVT, NPT, NPAT, and NPγT ensembles). The former is achieved by a combination of the midpoint cell method and the efficient three-dimensional Fast Fourier Transform algorithm, where the domain decomposition space is shared in real-space and reciprocal-space calculations. Other features in SPDYN, such as avoiding concurrent memory access, reducing communication times, and usage of parallel input/output files, also contribute to the performance. We show the REMD simulation results of a mixed (POPC/DMPC) lipid bilayer as a real application using GENESIS. GENESIS is released as free software under the GPLv2 licence and can be easily modified for the development of new algorithms and molecular models. WIREs Comput Mol Sci 2015, 5:310-323. doi: 10.1002/wcms.1220.

  9. Fast 3D Surface Extraction 2 pages (including abstract)

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Sewell, Christopher Meyer; Patchett, John M.; Ahrens, James P.

    Ocean scientists searching for isosurfaces and/or thresholds of interest in high resolution 3D datasets required a tedious and time-consuming interactive exploration experience. PISTON research and development activities are enabling ocean scientists to rapidly and interactively explore isosurfaces and thresholds in their large data sets using a simple slider with real time calculation and visualization of these features. Ocean Scientists can now visualize more features in less time, helping them gain a better understanding of the high resolution data sets they work with on a daily basis. Isosurface timings (512{sup 3} grid): VTK 7.7 s, Parallel VTK (48-core) 1.3 s, PISTONmore » OpenMP (48-core) 0.2 s, PISTON CUDA (Quadro 6000) 0.1 s.« less

  10. A high-performance model for shallow-water simulations in distributed and heterogeneous architectures

    NASA Astrophysics Data System (ADS)

    Conde, Daniel; Canelas, Ricardo B.; Ferreira, Rui M. L.

    2017-04-01

    One of the most common challenges in hydrodynamic modelling is the trade off one must make between highly resolved simulations and the time required for their computation. In the particular case of urban floods, modelers are often forced to simplify the complex geometries of the problem, or to implicitly include some of its hydrodynamic effects, due to the typically very large spatial scales involved and limited computational resources. At CEris - Instituto Superior Técnico, Universidade de Lisboa - the STAV-2D shallow-water model, particularly suited for strong transient flows in complex and dynamic geometries, has been under development for the past recent years (Canelas et al., 2013 & Conde et al., 2013). The model is based on an explicit, first-order 2DH finite-volume discretization scheme for unstructured triangular meshes, in which a flux-splitting technique is paired with a reviewed Roe-Riemann solver, yielding a model applicable to discontinuous flows over time-evolving geometries. STAV-2D features solid transport in both Euleran and Lagrangian forms, with the first aiming at describing the transport of fine natural sediments and the latter aimed at large individual debris. The model has been validated with theoretical solutions and laboratory experiments (Canelas et al., 2013 & Conde et al., 2015). This work presents our most recent effort in STAV-2D: the re-design of the code in a modern Object-Oriented parallel framework for heterogeneous computations in CPUs and GPUs. The programming language of choice for this re-design was C++, due to its wide support of established and emerging parallel programming interfaces. The current implementation of STAV-2D provides two different levels of parallel granularity: inter-node and intra-node. Inter-node parallelism is achieved by distributing a simulation across a set of worker nodes, with communication between nodes being explicitly managed through MPI. At this level, the main difficulty is associated with the unstructured nature of the mesh topology with the corresponding employed solution, based on space-filling curves, being analyzed and discussed. Intra-node parallelism is achieved through OpenMP for CPUs and CUDA for GPUs, depending on which kind of device the process is running. Here the main difficulty is associated with the Object-Oriented approach, where the presence of complex data structures can degrade model performance considerably. STAV-2D now supports fully distributed and heterogeneous simulations where multiple different devices can be used to accelerate computation time. The advantages, short-comings and specific solutions for the employed unified Object-Oriented approach, where the source code for CPU and GPU has the same compilation units (no device specific branches like seen in available models), are discussed and quantified with a thorough scalability and performance analysis. The assembled parallel model is expected to achieve faster than real-time simulations for high resolutions (from meters to sub-meter) in large scaled problems (from cities to watersheds), effectively bridging the gap between detailed and timely simulation results. Acknowledgements This research as partially supported by Portuguese and European funds, within programs COMPETE2020 and PORL-FEDER, through project PTDC/ECM-HID/6387/2014 and Doctoral Grant SFRH/BD/97933/2013 granted by the National Foundation for Science and Technology (FCT). References Canelas, R.; Murillo, J. & Ferreira, R.M.L. (2013), Two-dimensional depth-averaged modelling of dam-break flows over mobile beds. Journal of Hydraulic Research, 51(4), 392-407. Conde, D. A. S.; Baptista, M. A. V.; Sousa Oliveira, C. & Ferreira, R. M. L. (2013), A shallow-flow model for the propagation of tsunamis over complex geometries and mobile beds, Nat. Hazards and Earth Syst. Sci., 13, 2533-2542. Conde, D. A. S.; Telhado, M. J.; Viana Baptista, M. A. & Ferreira, R. M. L. (2015) Severity and exposure associated with tsunami actions in urban waterfronts: the case of Lisbon, Portugal. Natural Hazards, Springer, 79, 2125, DOI:10.1007/s11069-015-1951-z

  11. A fast algorithm for forward-modeling of gravitational fields in spherical coordinates with 3D Gauss-Legendre quadrature

    NASA Astrophysics Data System (ADS)

    Zhao, G.; Liu, J.; Chen, B.; Guo, R.; Chen, L.

    2017-12-01

    Forward modeling of gravitational fields at large-scale requires to consider the curvature of the Earth and to evaluate the Newton's volume integral in spherical coordinates. To acquire fast and accurate gravitational effects for subsurface structures, subsurface mass distribution is usually discretized into small spherical prisms (called tesseroids). The gravity fields of tesseroids are generally calculated numerically. One of the commonly used numerical methods is the 3D Gauss-Legendre quadrature (GLQ). However, the traditional GLQ integration suffers from low computational efficiency and relatively poor accuracy when the observation surface is close to the source region. We developed a fast and high accuracy 3D GLQ integration based on the equivalence of kernel matrix, adaptive discretization and parallelization using OpenMP. The equivalence of kernel matrix strategy increases efficiency and reduces memory consumption by calculating and storing the same matrix elements in each kernel matrix just one time. In this method, the adaptive discretization strategy is used to improve the accuracy. The numerical investigations show that the executing time of the proposed method is reduced by two orders of magnitude compared with the traditional method that without these optimized strategies. High accuracy results can also be guaranteed no matter how close the computation points to the source region. In addition, the algorithm dramatically reduces the memory requirement by N times compared with the traditional method, where N is the number of discretization of the source region in the longitudinal direction. It makes the large-scale gravity forward modeling and inversion with a fine discretization possible.

  12. Multi-core and GPU accelerated simulation of a radial star target imaged with equivalent t-number circular and Gaussian pupils

    NASA Astrophysics Data System (ADS)

    Greynolds, Alan W.

    2013-09-01

    Results from the GelOE optical engineering software are presented for the through-focus, monochromatic coherent and polychromatic incoherent imaging of a radial "star" target for equivalent t-number circular and Gaussian pupils. The FFT-based simulations are carried out using OpenMP threading on a multi-core desktop computer, with and without the aid of a many-core NVIDIA GPU accessing its cuFFT library. It is found that a custom FFT optimized for the 12-core host has similar performance to a simply implemented 256-core GPU FFT. A more sophisticated version of the latter but tuned to reduce overhead on a 448-core GPU is 20 to 28 times faster than a basic FFT implementation running on one CPU core.

  13. GENESIS: a hybrid-parallel and multi-scale molecular dynamics simulator with enhanced sampling algorithms for biomolecular and cellular simulations

    PubMed Central

    Jung, Jaewoon; Mori, Takaharu; Kobayashi, Chigusa; Matsunaga, Yasuhiro; Yoda, Takao; Feig, Michael; Sugita, Yuji

    2015-01-01

    GENESIS (Generalized-Ensemble Simulation System) is a new software package for molecular dynamics (MD) simulations of macromolecules. It has two MD simulators, called ATDYN and SPDYN. ATDYN is parallelized based on an atomic decomposition algorithm for the simulations of all-atom force-field models as well as coarse-grained Go-like models. SPDYN is highly parallelized based on a domain decomposition scheme, allowing large-scale MD simulations on supercomputers. Hybrid schemes combining OpenMP and MPI are used in both simulators to target modern multicore computer architectures. Key advantages of GENESIS are (1) the highly parallel performance of SPDYN for very large biological systems consisting of more than one million atoms and (2) the availability of various REMD algorithms (T-REMD, REUS, multi-dimensional REMD for both all-atom and Go-like models under the NVT, NPT, NPAT, and NPγT ensembles). The former is achieved by a combination of the midpoint cell method and the efficient three-dimensional Fast Fourier Transform algorithm, where the domain decomposition space is shared in real-space and reciprocal-space calculations. Other features in SPDYN, such as avoiding concurrent memory access, reducing communication times, and usage of parallel input/output files, also contribute to the performance. We show the REMD simulation results of a mixed (POPC/DMPC) lipid bilayer as a real application using GENESIS. GENESIS is released as free software under the GPLv2 licence and can be easily modified for the development of new algorithms and molecular models. WIREs Comput Mol Sci 2015, 5:310–323. doi: 10.1002/wcms.1220 PMID:26753008

  14. Modeling Cooperative Threads to Project GPU Performance for Adaptive Parallelism

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Meng, Jiayuan; Uram, Thomas; Morozov, Vitali A.

    Most accelerators, such as graphics processing units (GPUs) and vector processors, are particularly suitable for accelerating massively parallel workloads. On the other hand, conventional workloads are developed for multi-core parallelism, which often scale to only a few dozen OpenMP threads. When hardware threads significantly outnumber the degree of parallelism in the outer loop, programmers are challenged with efficient hardware utilization. A common solution is to further exploit the parallelism hidden deep in the code structure. Such parallelism is less structured: parallel and sequential loops may be imperfectly nested within each other, neigh boring inner loops may exhibit different concurrency patternsmore » (e.g. Reduction vs. Forall), yet have to be parallelized in the same parallel section. Many input-dependent transformations have to be explored. A programmer often employs a larger group of hardware threads to cooperatively walk through a smaller outer loop partition and adaptively exploit any encountered parallelism. This process is time-consuming and error-prone, yet the risk of gaining little or no performance remains high for such workloads. To reduce risk and guide implementation, we propose a technique to model workloads with limited parallelism that can automatically explore and evaluate transformations involving cooperative threads. Eventually, our framework projects the best achievable performance and the most promising transformations without implementing GPU code or using physical hardware. We envision our technique to be integrated into future compilers or optimization frameworks for autotuning.« less

  15. High-Performance Reactive Particle Tracking with Adaptive Representation

    NASA Astrophysics Data System (ADS)

    Schmidt, M.; Benson, D. A.; Pankavich, S.

    2017-12-01

    Lagrangian particle tracking algorithms have been shown to be effective tools for modeling chemical reactions in imperfectly-mixed media. One disadvantage of these algorithms is the possible need to employ large numbers of particles in simulations, depending on the concentration covariance structure, and these large particle numbers can lead to long computation times. Two distinct approaches have recently arisen to overcome this. One method employs spatial kernels that are related to a specified, reduced particle number; however, over-wide kernels, dictated by a very low particle number, lead to an excess of reaction calculations and cause a reduction in performance. Another formulation involves hybrid particles that carry multiple species of reactant, wherein each particle is treated as its own well-mixed volume, obviating the need for large numbers of particles for each species but still requiring a fixed number of hybrid particles. Here, we combine these two approaches and demonstrate an improved method for simulating a given system in a computationally efficient manner. Additionally, the independent nature of transport and reaction calculations in this approach allows for significant gains via parallelization in an MPI or OpenMP context. For benchmarking, we choose a CO2 injection simulation with dissolution and precipitation of calcite and dolomite, allowing us to derive the proper treatment of interaction between solid and aqueous phases.

  16. Accelerating 3D Elastic Wave Equations on Knights Landing based Intel Xeon Phi processors

    NASA Astrophysics Data System (ADS)

    Sourouri, Mohammed; Birger Raknes, Espen

    2017-04-01

    In advanced imaging methods like reverse-time migration (RTM) and full waveform inversion (FWI) the elastic wave equation (EWE) is numerically solved many times to create the seismic image or the elastic parameter model update. Thus, it is essential to optimize the solution time for solving the EWE as this will have a major impact on the total computational cost in running RTM or FWI. From a computational point of view applications implementing EWEs are associated with two major challenges. The first challenge is the amount of memory-bound computations involved, while the second challenge is the execution of such computations over very large datasets. So far, multi-core processors have not been able to tackle these two challenges, which eventually led to the adoption of accelerators such as Graphics Processing Units (GPUs). Compared to conventional CPUs, GPUs are densely populated with many floating-point units and fast memory, a type of architecture that has proven to map well to many scientific computations. Despite its architectural advantages, full-scale adoption of accelerators has yet to materialize. First, accelerators require a significant programming effort imposed by programming models such as CUDA or OpenCL. Second, accelerators come with a limited amount of memory, which also require explicit data transfers between the CPU and the accelerator over the slow PCI bus. The second generation of the Xeon Phi processor based on the Knights Landing (KNL) architecture, promises the computational capabilities of an accelerator but require the same programming effort as traditional multi-core processors. The high computational performance is realized through many integrated cores (number of cores and tiles and memory varies with the model) organized in tiles that are connected via a 2D mesh based interconnect. In contrary to accelerators, KNL is a self-hosted system, meaning explicit data transfers over the PCI bus are no longer required. However, like most accelerators, KNL sports a memory subsystem consisting of low-level caches and 16GB of high-bandwidth MCDRAM memory. For capacity computing, up to 400GB of conventional DDR4 memory is provided. Such a strict hierarchical memory layout means that data locality is imperative if the true potential of this product is to be harnessed. In this work, we study a series of optimizations specifically targeting KNL for our EWE based application to reduce the time-to-solution time for the following 3D model sizes in grid points: 1283, 2563 and 5123. We compare the results with an optimized version for multi-core CPUs running on a dual-socket Xeon E5 2680v3 system using OpenMP. Our initial naive implementation on the KNL is roughly 20% faster than the multi-core version, but by using only one thread per core and careful memory placement using the memkind library, we could achieve higher speedups. Additionally, by using the MCDRAM as cache for problem sizes that are smaller than 16 GB further performance improvements were unlocked. Depending on the problem size, our overall results indicate that the KNL based system is approximately 2.2x faster than the 24-core Xeon E5 2680v3 system, with only modest changes to the code.

  17. DasPy – Open Source Multivariate Land Data Assimilation Framework with High Performance Computing

    NASA Astrophysics Data System (ADS)

    Han, Xujun; Li, Xin; Montzka, Carsten; Kollet, Stefan; Vereecken, Harry; Hendricks Franssen, Harrie-Jan

    2015-04-01

    Data assimilation has become a popular method to integrate observations from multiple sources with land surface models to improve predictions of the water and energy cycles of the soil-vegetation-atmosphere continuum. In recent years, several land data assimilation systems have been developed in different research agencies. Because of the software availability or adaptability, these systems are not easy to apply for the purpose of multivariate land data assimilation research. Multivariate data assimilation refers to the simultaneous assimilation of observation data for multiple model state variables into a simulation model. Our main motivation was to develop an open source multivariate land data assimilation framework (DasPy) which is implemented using the Python script language mixed with C++ and Fortran language. This system has been evaluated in several soil moisture, L-band brightness temperature and land surface temperature assimilation studies. The implementation allows also parameter estimation (soil properties and/or leaf area index) on the basis of the joint state and parameter estimation approach. LETKF (Local Ensemble Transform Kalman Filter) is implemented as the main data assimilation algorithm, and uncertainties in the data assimilation can be represented by perturbed atmospheric forcings, perturbed soil and vegetation properties and model initial conditions. The CLM4.5 (Community Land Model) was integrated as the model operator. The CMEM (Community Microwave Emission Modelling Platform), COSMIC (COsmic-ray Soil Moisture Interaction Code) and the two source formulation were integrated as observation operators for assimilation of L-band passive microwave, cosmic-ray soil moisture probe and land surface temperature measurements, respectively. DasPy is parallelized using the hybrid MPI (Message Passing Interface) and OpenMP (Open Multi-Processing) techniques. All the input and output data flow is organized efficiently using the commonly used NetCDF file format. Online 1D and 2D visualization of data assimilation results is also implemented to facilitate the post simulation analysis. In summary, DasPy is a ready to use open source parallel multivariate land data assimilation framework.

  18. The ASC Sequoia Programming Model

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Seager, M

    2008-08-06

    In the late 1980's and early 1990's, Lawrence Livermore National Laboratory was deeply engrossed in determining the next generation programming model for the Integrated Design Codes (IDC) beyond vectorization for the Cray 1s series of computers. The vector model, developed in mid 1970's first for the CDC 7600 and later extended from stack based vector operation to memory to memory operations for the Cray 1s, lasted approximately 20 years (See Slide 5). The Cray vector era was deemed an extremely long lived era as it allowed vector codes to be developed over time (the Cray 1s were faster in scalarmore » mode than the CDC 7600) with vector unit utilization increasing incrementally over time. The other attributes of the Cray vector era at LLNL were that we developed, supported and maintained the Operating System (LTSS and later NLTSS), communications protocols (LINCS), Compilers (Civic Fortran77 and Model), operating system tools (e.g., batch system, job control scripting, loaders, debuggers, editors, graphics utilities, you name it) and math and highly machine optimized libraries (e.g., SLATEC, and STACKLIB). Although LTSS was adopted by Cray for early system generations, they later developed COS and UNICOS operating systems and environment on their own. In the late 1970s and early 1980s two trends appeared that made the Cray vector programming model (described above including both the hardware and system software aspects) seem potentially dated and slated for major revision. These trends were the appearance of low cost CMOS microprocessors and their attendant, departmental and mini-computers and later workstations and personal computers. With the wide spread adoption of Unix in the early 1980s, it appeared that LLNL (and the other DOE Labs) would be left out of the mainstream of computing without a rapid transition to these 'Killer Micros' and modern OS and tools environments. The other interesting advance in the period is that systems were being developed with multiple 'cores' in them and called Symmetric Multi-Processor or Shared Memory Processor (SMP) systems. The parallel revolution had begun. The Laboratory started a small 'parallel processing project' in 1983 to study the new technology and its application to scientific computing with four people: Tim Axelrod, Pete Eltgroth, Paul Dubois and Mark Seager. Two years later, Eugene Brooks joined the team. This team focused on Unix and 'killer micro' SMPs. Indeed, Eugene Brooks was credited with coming up with the 'Killer Micro' term. After several generations of SMP platforms (e.g., Sequent Balance 8000 with 8 33MHz MC32032s, Allian FX8 with 8 MC68020 and FPGA based Vector Units and finally the BB&N Butterfly with 128 cores), it became apparent to us that the killer micro revolution would indeed take over Crays and that we definitely needed a new programming and systems model. The model developed by Mark Seager and Dale Nielsen focused on both the system aspects (Slide 3) and the code development aspects (Slide 4). Although now succinctly captured in two attached slides, at the time there was tremendous ferment in the research community as to what parallel programming model would emerge, dominate and survive. In addition, we wanted a model that would provide portability between platforms of a single generation but also longevity over multiple--and hopefully--many generations. Only after we developed the 'Livermore Model' and worked it out in considerable detail did it become obvious that what we came up with was the right approach. In a nutshell, the applications programming model of the Livermore Model posited that SMP parallelism would ultimately not scale indefinitely and one would have to bite the bullet and implement MPI parallelism within the Integrated Design Code (IDC). We also had a major emphasis on doing everything in a completely standards based, portable methodology with POSIX/Unix as the target environment. We decided against specialized libraries like STACKLIB for performance, but kept as many general purpose, portable math libraries as were needed by the codes. Third, we assumed that the SMPs in clusters would evolve in time to become more powerful, feature rich and, in particular, offer more cores. Thus, we focused on OpenMP, and POSIX PThreads for programming SMP parallelism. These code porting efforts were lead by Dale Nielsen, A-Division code group leader, and Randy Christensen, B-Division code group leader. Most of the porting effort revolved removing 'Crayisms' in the codes: artifacts of LTSS/NLTSS, Civic compiler extensions beyond Fortran77, IO libraries and dealing with new code control languages (we switched to Perl and later to Python). Adding MPI to the codes was initially problematic and error prone because the programmers used MPI directly and sprinkled the calls throughout the code.« less

  19. Efficient diagonalization of the sparse matrices produced within the framework of the UK R-matrix molecular codes

    NASA Astrophysics Data System (ADS)

    Galiatsatos, P. G.; Tennyson, J.

    2012-11-01

    The most time consuming step within the framework of the UK R-matrix molecular codes is that of the diagonalization of the inner region Hamiltonian matrix (IRHM). Here we present the method that we follow to speed up this step. We use shared memory machines (SMM), distributed memory machines (DMM), the OpenMP directive based parallel language, the MPI function based parallel language, the sparse matrix diagonalizers ARPACK and PARPACK, a variation for real symmetric matrices of the official coordinate sparse matrix format and finally a parallel sparse matrix-vector product (PSMV). The efficient application of the previous techniques rely on two important facts: the sparsity of the matrix is large enough (more than 98%) and in order to get back converged results we need a small only part of the matrix spectrum.

  20. The Process of Parallelizing the Conjunction Prediction Algorithm of ESA's SSA Conjunction Prediction Service Using GPGPU

    NASA Astrophysics Data System (ADS)

    Fehr, M.; Navarro, V.; Martin, L.; Fletcher, E.

    2013-08-01

    Space Situational Awareness[8] (SSA) is defined as the comprehensive knowledge, understanding and maintained awareness of the population of space objects, the space environment and existing threats and risks. As ESA's SSA Conjunction Prediction Service (CPS) requires the repetitive application of a processing algorithm against a data set of man-made space objects, it is crucial to exploit the highly parallelizable nature of this problem. Currently the CPS system makes use of OpenMP[7] for parallelization purposes using CPU threads, but only a GPU with its hundreds of cores can fully benefit from such high levels of parallelism. This paper presents the adaptation of several core algorithms[5] of the CPS for general-purpose computing on graphics processing units (GPGPU) using NVIDIAs Compute Unified Device Architecture (CUDA).

  1. Parallel fast multipole boundary element method applied to computational homogenization

    NASA Astrophysics Data System (ADS)

    Ptaszny, Jacek

    2018-01-01

    In the present work, a fast multipole boundary element method (FMBEM) and a parallel computer code for 3D elasticity problem is developed and applied to the computational homogenization of a solid containing spherical voids. The system of equation is solved by using the GMRES iterative solver. The boundary of the body is dicretized by using the quadrilateral serendipity elements with an adaptive numerical integration. Operations related to a single GMRES iteration, performed by traversing the corresponding tree structure upwards and downwards, are parallelized by using the OpenMP standard. The assignment of tasks to threads is based on the assumption that the tree nodes at which the moment transformations are initialized can be partitioned into disjoint sets of equal or approximately equal size and assigned to the threads. The achieved speedup as a function of number of threads is examined.

  2. pFUnit 3.0 Tutorial Advanced

    NASA Technical Reports Server (NTRS)

    Clune, Tom

    2014-01-01

    This tutorial will introduce Fortran developers to unit-testing and test-driven development (TDD) using pFUnit. As with other unit-testing frameworks, pFUnit, simplifies the process of writing, collecting, and executing tests while providing clear diagnostic messages for failing tests. pFUnit specifically targets the development of scientific-technical software written in Fortran and includes customized features such as: assertions for multi-dimensional arrays, distributed (MPI) and thread-based (OpenMP) parallellism, and flexible parameterized tests.These sessions will include numerous examples and hands-on exercises that gradually build in complexity. Attendees are expected to have working knowledge of F90, but familiarity with object-oriented syntax in F2003 and MPI will be of benefit for the more advanced examples. By the end of the tutorial the audience should feel comfortable in applying pFUnit within their own development environment.

  3. Development of an Implicit, Charge and Energy Conserving 2D Electromagnetic PIC Code on Advanced Architectures

    NASA Astrophysics Data System (ADS)

    Payne, Joshua; Taitano, William; Knoll, Dana; Liebs, Chris; Murthy, Karthik; Feltman, Nicolas; Wang, Yijie; McCarthy, Colleen; Cieren, Emanuel

    2012-10-01

    In order to solve problems such as the ion coalescence and slow MHD shocks fully kinetically we developed a fully implicit 2D energy and charge conserving electromagnetic PIC code, PlasmaApp2D. PlasmaApp2D differs from previous implicit PIC implementations in that it will utilize advanced architectures such as GPUs and shared memory CPU systems, with problems too large to fit into cache. PlasmaApp2D will be a hybrid CPU-GPU code developed primarily to run on the DARWIN cluster at LANL utilizing four 12-core AMD Opteron CPUs and two NVIDIA Tesla GPUs per node. MPI will be used for cross-node communication, OpenMP will be used for on-node parallelism, and CUDA will be used for the GPUs. Development progress and initial results will be presented.

  4. Parallel kinetic Monte Carlo simulation framework incorporating accurate models of adsorbate lateral interactions

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Nielsen, Jens; D’Avezac, Mayeul; Hetherington, James

    2013-12-14

    Ab initio kinetic Monte Carlo (KMC) simulations have been successfully applied for over two decades to elucidate the underlying physico-chemical phenomena on the surfaces of heterogeneous catalysts. These simulations necessitate detailed knowledge of the kinetics of elementary reactions constituting the reaction mechanism, and the energetics of the species participating in the chemistry. The information about the energetics is encoded in the formation energies of gas and surface-bound species, and the lateral interactions between adsorbates on the catalytic surface, which can be modeled at different levels of detail. The majority of previous works accounted for only pairwise-additive first nearest-neighbor interactions. Moremore » recently, cluster-expansion Hamiltonians incorporating long-range interactions and many-body terms have been used for detailed estimations of catalytic rate [C. Wu, D. J. Schmidt, C. Wolverton, and W. F. Schneider, J. Catal. 286, 88 (2012)]. In view of the increasing interest in accurate predictions of catalytic performance, there is a need for general-purpose KMC approaches incorporating detailed cluster expansion models for the adlayer energetics. We have addressed this need by building on the previously introduced graph-theoretical KMC framework, and we have developed Zacros, a FORTRAN2003 KMC package for simulating catalytic chemistries. To tackle the high computational cost in the presence of long-range interactions we introduce parallelization with OpenMP. We further benchmark our framework by simulating a KMC analogue of the NO oxidation system established by Schneider and co-workers [J. Catal. 286, 88 (2012)]. We show that taking into account only first nearest-neighbor interactions may lead to large errors in the prediction of the catalytic rate, whereas for accurate estimates thereof, one needs to include long-range terms in the cluster expansion.« less

  5. PS3 CELL Development for Scientific Computation and Research

    NASA Astrophysics Data System (ADS)

    Christiansen, M.; Sevre, E.; Wang, S. M.; Yuen, D. A.; Liu, S.; Lyness, M. D.; Broten, M.

    2007-12-01

    The Cell processor is one of the most powerful processors on the market, and researchers in the earth sciences may find its parallel architecture to be very useful. A cell processor, with 7 cores, can easily be obtained for experimentation by purchasing a PlayStation 3 (PS3) and installing linux and the IBM SDK. Each core of the PS3 is capable of 25 GFLOPS giving a potential limit of 150 GFLOPS when using all 6 SPUs (synergistic processing units) by using vectorized algorithms. We have used the Cell's computational power to create a program which takes simulated tsunami datasets, parses them, and returns a colorized height field image using ray casting techniques. As expected, the time required to create an image is inversely proportional to the number of SPUs used. We believe that this trend will continue when multiple PS3s are chained using OpenMP functionality and are in the process of researching this. By using the Cell to visualize tsunami data, we have found that its greatest feature is its power. This fact entwines well with the needs of the scientific community where the limiting factor is time. Any algorithm, such as the heat equation, that can be subdivided into multiple parts can take advantage of the PS3 Cell's ability to split the computations across the 6 SPUs reducing required run time by one sixth. Further vectorization of the code can allow for 4 simultanious floating point operations by using the SIMD (single instruction multiple data) capabilities of the SPU increasing efficiency 24 times.

  6. A fully non-linear multi-species Fokker–Planck–Landau collision operator for simulation of fusion plasma

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Hager, Robert, E-mail: rhager@pppl.gov; Yoon, E.S., E-mail: yoone@rpi.edu; Ku, S., E-mail: sku@pppl.gov

    2016-06-15

    Fusion edge plasmas can be far from thermal equilibrium and require the use of a non-linear collision operator for accurate numerical simulations. In this article, the non-linear single-species Fokker–Planck–Landau collision operator developed by Yoon and Chang (2014) [9] is generalized to include multiple particle species. The finite volume discretization used in this work naturally yields exact conservation of mass, momentum, and energy. The implementation of this new non-linear Fokker–Planck–Landau operator in the gyrokinetic particle-in-cell codes XGC1 and XGCa is described and results of a verification study are discussed. Finally, the numerical techniques that make our non-linear collision operator viable onmore » high-performance computing systems are described, including specialized load balancing algorithms and nested OpenMP parallelization. The collision operator's good weak and strong scaling behavior are shown.« less

  7. A fully non-linear multi-species Fokker–Planck–Landau collision operator for simulation of fusion plasma

    DOE PAGES

    Hager, Robert; Yoon, E. S.; Ku, S.; ...

    2016-04-04

    Fusion edge plasmas can be far from thermal equilibrium and require the use of a non-linear collision operator for accurate numerical simulations. The non-linear single-species Fokker–Planck–Landau collision operator developed by Yoon and Chang (2014) [9] is generalized to include multiple particle species. Moreover, the finite volume discretization used in this work naturally yields exact conservation of mass, momentum, and energy. The implementation of this new non-linear Fokker–Planck–Landau operator in the gyrokinetic particle-in-cell codes XGC1 and XGCa is described and results of a verification study are discussed. Finally, the numerical techniques that make our non-linear collision operator viable on high-performance computingmore » systems are described, including specialized load balancing algorithms and nested OpenMP parallelization. As a result, the collision operator's good weak and strong scaling behavior are shown.« less

  8. BCYCLIC: A parallel block tridiagonal matrix cyclic solver

    NASA Astrophysics Data System (ADS)

    Hirshman, S. P.; Perumalla, K. S.; Lynch, V. E.; Sanchez, R.

    2010-09-01

    A block tridiagonal matrix is factored with minimal fill-in using a cyclic reduction algorithm that is easily parallelized. Storage of the factored blocks allows the application of the inverse to multiple right-hand sides which may not be known at factorization time. Scalability with the number of block rows is achieved with cyclic reduction, while scalability with the block size is achieved using multithreaded routines (OpenMP, GotoBLAS) for block matrix manipulation. This dual scalability is a noteworthy feature of this new solver, as well as its ability to efficiently handle arbitrary (non-powers-of-2) block row and processor numbers. Comparison with a state-of-the art parallel sparse solver is presented. It is expected that this new solver will allow many physical applications to optimally use the parallel resources on current supercomputers. Example usage of the solver in magneto-hydrodynamic (MHD), three-dimensional equilibrium solvers for high-temperature fusion plasmas is cited.

  9. Characterization of UMT2013 Performance on Advanced Architectures

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Howell, Louis

    2014-12-31

    This paper presents part of a larger effort to make detailed assessments of several proxy applications on various advanced architectures, with the eventual goal of extending these assessments to codes of programmatic interest running more realistic simulations. The focus here is on UMT2013, a proxy implementation of deterministic transport for unstructured meshes. I present weak and strong MPI scaling results and studies of OpenMP efficiency on the Sequoia BG/Q system at LLNL, with comparison against similar tests on an Intel Sandy Bridge TLCC2 system. The hardware counters on BG/Q provide detailed information on many aspects of on-node performance, while informationmore » from the mpiP tool gives insight into the reasons for the differing scaling behavior on these two different architectures. Preliminary tests that exploit NVRAM as extended memory on an Ivy Bridge machine designed for “Big Data” applications are also included.« less

  10. GANDALF - Graphical Astrophysics code for N-body Dynamics And Lagrangian Fluids

    NASA Astrophysics Data System (ADS)

    Hubber, D. A.; Rosotti, G. P.; Booth, R. A.

    2018-01-01

    GANDALF is a new hydrodynamics and N-body dynamics code designed for investigating planet formation, star formation and star cluster problems. GANDALF is written in C++, parallelized with both OPENMP and MPI and contains a PYTHON library for analysis and visualization. The code has been written with a fully object-oriented approach to easily allow user-defined implementations of physics modules or other algorithms. The code currently contains implementations of smoothed particle hydrodynamics, meshless finite-volume and collisional N-body schemes, but can easily be adapted to include additional particle schemes. We present in this paper the details of its implementation, results from the test suite, serial and parallel performance results and discuss the planned future development. The code is freely available as an open source project on the code-hosting website github at https://github.com/gandalfcode/gandalf and is available under the GPLv2 license.

  11. Fast data reconstructed method of Fourier transform imaging spectrometer based on multi-core CPU

    NASA Astrophysics Data System (ADS)

    Yu, Chunchao; Du, Debiao; Xia, Zongze; Song, Li; Zheng, Weijian; Yan, Min; Lei, Zhenggang

    2017-10-01

    Imaging spectrometer can gain two-dimensional space image and one-dimensional spectrum at the same time, which shows high utility in color and spectral measurements, the true color image synthesis, military reconnaissance and so on. In order to realize the fast reconstructed processing of the Fourier transform imaging spectrometer data, the paper designed the optimization reconstructed algorithm with OpenMP parallel calculating technology, which was further used for the optimization process for the HyperSpectral Imager of `HJ-1' Chinese satellite. The results show that the method based on multi-core parallel computing technology can control the multi-core CPU hardware resources competently and significantly enhance the calculation of the spectrum reconstruction processing efficiency. If the technology is applied to more cores workstation in parallel computing, it will be possible to complete Fourier transform imaging spectrometer real-time data processing with a single computer.

  12. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Langer, Steven H.; Karlin, Ian; Marinak, Marty M.

    HYDRA is used to simulate a variety of experiments carried out at the National Ignition Facility (NIF) [4] and other high energy density physics facilities. HYDRA has packages to simulate radiation transfer, atomic physics, hydrodynamics, laser propagation, and a number of other physics effects. HYDRA has over one million lines of code and includes both MPI and thread-level (OpenMP and pthreads) parallelism. This paper measures the performance characteristics of HYDRA using hardware counters on an IBM BlueGene/Q system. We report key ratios such as bytes/instruction and memory bandwidth for several different physics packages. The total number of bytes read andmore » written per time step is also reported. We show that none of the packages which use significant time are memory bandwidth limited on a Blue Gene/Q. HYDRA currently issues very few SIMD instructions. The pressure on memory bandwidth will increase if high levels of SIMD instructions can be achieved.« less

  13. LAMMPS strong scaling performance optimization on Blue Gene/Q

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Coffman, Paul; Jiang, Wei; Romero, Nichols A.

    2014-11-12

    LAMMPS "Large-scale Atomic/Molecular Massively Parallel Simulator" is an open-source molecular dynamics package from Sandia National Laboratories. Significant performance improvements in strong-scaling and time-to-solution for this application on IBM's Blue Gene/Q have been achieved through computational optimizations of the OpenMP versions of the short-range Lennard-Jones term of the CHARMM force field and the long-range Coulombic interaction implemented with the PPPM (particle-particle-particle mesh) algorithm, enhanced by runtime parameter settings controlling thread utilization. Additionally, MPI communication performance improvements were made to the PPPM calculation by re-engineering the parallel 3D FFT to use MPICH collectives instead of point-to-point. Performance testing was done using anmore » 8.4-million atom simulation scaling up to 16 racks on the Mira system at Argonne Leadership Computing Facility (ALCF). Speedups resulting from this effort were in some cases over 2x.« less

  14. Gregarious Data Re-structuring in a Many Core Architecture

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Shrestha, Sunil; Manzano Franco, Joseph B.; Marquez, Andres

    this paper, we have developed a new methodology that takes in consideration the access patterns from a single parallel actor (e.g. a thread), as well as, the access patterns of “grouped” parallel actors that share a resource (e.g. a distributed Level 3 cache). We start with a hierarchical tile code for our target machine and apply a series of transformations at the tile level to improve data residence in a given memory hierarchy level. The contribution of this paper includes (a) collaborative data restructuring for group reuse and (b) low overhead transformation technique to improve access pattern and bring closelymore » connected data elements together. Preliminary results in a many core architecture, Tilera TileGX, shows promising improvements over optimized OpenMP code (up to 31% increase in GFLOPS) and over our own previous work on fine grained runtimes (up to 16%) for selected kernels« less

  15. Proceeding On : Parallelisation Of Critical Code Passages In PHOENIX/3D

    NASA Astrophysics Data System (ADS)

    Arkenberg, Mario; Wichert, Viktoria; Hauschildt, Peter H.

    2016-10-01

    Highly resolved state-of-the-art 3D atmosphere simulations will remain computationally extremely expensive for years to come. In addition to the need for more computing power, rethinking coding practices is necessary. We take a dual approach here, by introducing especially adapted, parallel numerical methods and correspondingly parallelising time critical code passages. In the following, we present our work on PHOENIX/3D.While parallelisation is generally worthwhile, it requires revision of time-consuming subroutines with respect to separability of localised data and variables in order to determine the optimal approach. Of course, the same applies to the code structure. The importance of this ongoing work can be showcased by recently derived benchmark results, which were generated utilis- ing MPI and OpenMP. Furthermore, the need for a careful and thorough choice of an adequate, machine dependent setup is discussed.

  16. Implementation of Parallel Dynamic Simulation on Shared-Memory vs. Distributed-Memory Environments

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Jin, Shuangshuang; Chen, Yousu; Wu, Di

    2015-12-09

    Power system dynamic simulation computes the system response to a sequence of large disturbance, such as sudden changes in generation or load, or a network short circuit followed by protective branch switching operation. It consists of a large set of differential and algebraic equations, which is computational intensive and challenging to solve using single-processor based dynamic simulation solution. High-performance computing (HPC) based parallel computing is a very promising technology to speed up the computation and facilitate the simulation process. This paper presents two different parallel implementations of power grid dynamic simulation using Open Multi-processing (OpenMP) on shared-memory platform, and Messagemore » Passing Interface (MPI) on distributed-memory clusters, respectively. The difference of the parallel simulation algorithms and architectures of the two HPC technologies are illustrated, and their performances for running parallel dynamic simulation are compared and demonstrated.« less

  17. A THC Simulator for Modeling Fluid-Rock Interactions

    NASA Astrophysics Data System (ADS)

    Hamidi, Sahar; Galvan, Boris; Heinze, Thomas; Miller, Stephen

    2014-05-01

    Fluid-rock interactions play an essential role in many earth processes, from a likely influence on earthquake nucleation and aftershocks, to enhanced geothermal system, carbon capture and storage (CCS), and underground nuclear waste repositories. In THC models, two-way interactions between different processes (thermal, hydraulic and chemical) are present. Fluid flow influences the permeability of the rock especially if chemical reactions are taken into account. On one hand solute concentration influences fluid properties while, on the other hand, heat can affect further chemical reactions. Estimating heat production from a naturally fractured geothermal systems remains a complex problem. Previous works are typically based on a local thermal equilibrium assumption and rarely consider the salinity. The dissolved salt in fluid affects the hydro- and thermodynamical behavior of the system by changing the hydraulic properties of the circulating fluid. Coupled thermal-hydraulic-chemical models (THC) are important for investigating these processes, but what is needed is a coupling to mechanics to result in THMC models. Although similar models currently exist (e.g. PFLOTRAN), our objective here is to develop algorithms for implementation using the Graphics Processing Unit (GPU) computer architecture to be run on GPU clusters. To that aim, we present a two-dimensional numerical simulation of a fully coupled non-isothermal non-reactive solute flow. The thermal part of the simulation models heat transfer processes for either local thermal equilibrium or nonequilibrium cases, and coupled to a non-reactive mass transfer described by a non-linear diffusion/dispersion model. The flow process of the model includes a non-linear Darcian flow for either saturated or unsaturated scenarios. For the unsaturated case, we use the Richards' approximation for a mixture of liquid and gas phases. Relative permeability and capillary pressure are determined by the van Genuchten relations. Permeability of rock is controlled by porosity, which is itself related to effective stress. The theoretical model is solved using explicit finite differences, and runs in parallel mode with OpenMP. The code is fully modular so that any combination of current THC processes, one- and two-phase, can be chosen. Future developments will include dissolution and precipitation of chemical components in addition to chemical erosion.

  18. Parallelization of TWOPORFLOW, a Cartesian Grid based Two-phase Porous Media Code for Transient Thermo-hydraulic Simulations

    NASA Astrophysics Data System (ADS)

    Trost, Nico; Jiménez, Javier; Imke, Uwe; Sanchez, Victor

    2014-06-01

    TWOPORFLOW is a thermo-hydraulic code based on a porous media approach to simulate single- and two-phase flow including boiling. It is under development at the Institute for Neutron Physics and Reactor Technology (INR) at KIT. The code features a 3D transient solution of the mass, momentum and energy conservation equations for two inter-penetrating fluids with a semi-implicit continuous Eulerian type solver. The application domain of TWOPORFLOW includes the flow in standard porous media and in structured porous media such as micro-channels and cores of nuclear power plants. In the latter case, the fluid domain is coupled to a fuel rod model, describing the heat flow inside the solid structure. In this work, detailed profiling tools have been utilized to determine the optimization potential of TWOPORFLOW. As a result, bottle-necks were identified and reduced in the most feasible way, leading for instance to an optimization of the water-steam property computation. Furthermore, an OpenMP implementation addressing the routines in charge of inter-phase momentum-, energy- and mass-coupling delivered good performance together with a high scalability on shared memory architectures. In contrast to that, the approach for distributed memory systems was to solve sub-problems resulting by the decomposition of the initial Cartesian geometry. Thread communication for the sub-problem boundary updates was accomplished by the Message Passing Interface (MPI) standard.

  19. Stochastic coalescence in finite systems: an algorithm for the numerical solution of the multivariate master equation.

    NASA Astrophysics Data System (ADS)

    Alfonso, Lester; Zamora, Jose; Cruz, Pedro

    2015-04-01

    The stochastic approach to coagulation considers the coalescence process going in a system of a finite number of particles enclosed in a finite volume. Within this approach, the full description of the system can be obtained from the solution of the multivariate master equation, which models the evolution of the probability distribution of the state vector for the number of particles of a given mass. Unfortunately, due to its complexity, only limited results were obtained for certain type of kernels and monodisperse initial conditions. In this work, a novel numerical algorithm for the solution of the multivariate master equation for stochastic coalescence that works for any type of kernels and initial conditions is introduced. The performance of the method was checked by comparing the numerically calculated particle mass spectrum with analytical solutions obtained for the constant and sum kernels, with an excellent correspondence between the analytical and numerical solutions. In order to increase the speedup of the algorithm, software parallelization techniques with OpenMP standard were used, along with an implementation in order to take advantage of new accelerator technologies. Simulations results show an important speedup of the parallelized algorithms. This study was funded by a grant from Consejo Nacional de Ciencia y Tecnologia de Mexico SEP-CONACYT CB-131879. The authors also thanks LUFAC® Computacion SA de CV for CPU time and all the support provided.

  20. Towards Highly Scalable Ab Initio Molecular Dynamics (AIMD) Simulations on the Intel Knights Landing Manycore Processor

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Jacquelin, Mathias; De Jong, Wibe A.; Bylaska, Eric J.

    2017-07-03

    The Ab Initio Molecular Dynamics (AIMD) method allows scientists to treat the dynamics of molecular and condensed phase systems while retaining a first-principles-based description of their interactions. This extremely important method has tremendous computational requirements, because the electronic Schr¨odinger equation, approximated using Kohn-Sham Density Functional Theory (DFT), is solved at every time step. With the advent of manycore architectures, application developers have a significant amount of processing power within each compute node that can only be exploited through massive parallelism. A compute intensive application such as AIMD forms a good candidate to leverage this processing power. In this paper, wemore » focus on adding thread level parallelism to the plane wave DFT methodology implemented in NWChem. Through a careful optimization of tall-skinny matrix products, which are at the heart of the Lagrange multiplier and nonlocal pseudopotential kernels, as well as 3D FFTs, our OpenMP implementation delivers excellent strong scaling on the latest Intel Knights Landing (KNL) processor. We assess the efficiency of our Lagrange multiplier kernels by building a Roofline model of the platform, and verify that our implementation is close to the roofline for various problem sizes. Finally, we present strong scaling results on the complete AIMD simulation for a 64 water molecules test case, that scales up to all 68 cores of the Knights Landing processor.« less

  1. Performance analysis of the FDTD method applied to holographic volume gratings: Multi-core CPU versus GPU computing

    NASA Astrophysics Data System (ADS)

    Francés, J.; Bleda, S.; Neipp, C.; Márquez, A.; Pascual, I.; Beléndez, A.

    2013-03-01

    The finite-difference time-domain method (FDTD) allows electromagnetic field distribution analysis as a function of time and space. The method is applied to analyze holographic volume gratings (HVGs) for the near-field distribution at optical wavelengths. Usually, this application requires the simulation of wide areas, which implies more memory and time processing. In this work, we propose a specific implementation of the FDTD method including several add-ons for a precise simulation of optical diffractive elements. Values in the near-field region are computed considering the illumination of the grating by means of a plane wave for different angles of incidence and including absorbing boundaries as well. We compare the results obtained by FDTD with those obtained using a matrix method (MM) applied to diffraction gratings. In addition, we have developed two optimized versions of the algorithm, for both CPU and GPU, in order to analyze the improvement of using the new NVIDIA Fermi GPU architecture versus highly tuned multi-core CPU as a function of the size simulation. In particular, the optimized CPU implementation takes advantage of the arithmetic and data transfer streaming SIMD (single instruction multiple data) extensions (SSE) included explicitly in the code and also of multi-threading by means of OpenMP directives. A good agreement between the results obtained using both FDTD and MM methods is obtained, thus validating our methodology. Moreover, the performance of the GPU is compared to the SSE+OpenMP CPU implementation, and it is quantitatively determined that a highly optimized CPU program can be competitive for a wider range of simulation sizes, whereas GPU computing becomes more powerful for large-scale simulations.

  2. Real-time computation of parameter fitting and image reconstruction using graphical processing units

    NASA Astrophysics Data System (ADS)

    Locans, Uldis; Adelmann, Andreas; Suter, Andreas; Fischer, Jannis; Lustermann, Werner; Dissertori, Günther; Wang, Qiulin

    2017-06-01

    In recent years graphical processing units (GPUs) have become a powerful tool in scientific computing. Their potential to speed up highly parallel applications brings the power of high performance computing to a wider range of users. However, programming these devices and integrating their use in existing applications is still a challenging task. In this paper we examined the potential of GPUs for two different applications. The first application, created at Paul Scherrer Institut (PSI), is used for parameter fitting during data analysis of μSR (muon spin rotation, relaxation and resonance) experiments. The second application, developed at ETH, is used for PET (Positron Emission Tomography) image reconstruction and analysis. Applications currently in use were examined to identify parts of the algorithms in need of optimization. Efficient GPU kernels were created in order to allow applications to use a GPU, to speed up the previously identified parts. Benchmarking tests were performed in order to measure the achieved speedup. During this work, we focused on single GPU systems to show that real time data analysis of these problems can be achieved without the need for large computing clusters. The results show that the currently used application for parameter fitting, which uses OpenMP to parallelize calculations over multiple CPU cores, can be accelerated around 40 times through the use of a GPU. The speedup may vary depending on the size and complexity of the problem. For PET image analysis, the obtained speedups of the GPU version were more than × 40 larger compared to a single core CPU implementation. The achieved results show that it is possible to improve the execution time by orders of magnitude.

  3. Coding for parallel execution of hardware-in-the-loop millimeter-wave scene generation models on multicore SIMD processor architectures

    NASA Astrophysics Data System (ADS)

    Olson, Richard F.

    2013-05-01

    Rendering of point scatterer based radar scenes for millimeter wave (mmW) seeker tests in real-time hardware-in-the-loop (HWIL) scene generation requires efficient algorithms and vector-friendly computer architectures for complex signal synthesis. New processor technology from Intel implements an extended 256-bit vector SIMD instruction set (AVX, AVX2) in a multi-core CPU design providing peak execution rates of hundreds of GigaFLOPS (GFLOPS) on one chip. Real world mmW scene generation code can approach peak SIMD execution rates only after careful algorithm and source code design. An effective software design will maintain high computing intensity emphasizing register-to-register SIMD arithmetic operations over data movement between CPU caches or off-chip memories. Engineers at the U.S. Army Aviation and Missile Research, Development and Engineering Center (AMRDEC) applied two basic parallel coding methods to assess new 256-bit SIMD multi-core architectures for mmW scene generation in HWIL. These include use of POSIX threads built on vector library functions and more portable, highlevel parallel code based on compiler technology (e.g. OpenMP pragmas and SIMD autovectorization). Since CPU technology is rapidly advancing toward high processor core counts and TeraFLOPS peak SIMD execution rates, it is imperative that coding methods be identified which produce efficient and maintainable parallel code. This paper describes the algorithms used in point scatterer target model rendering, the parallelization of those algorithms, and the execution performance achieved on an AVX multi-core machine using the two basic parallel coding methods. The paper concludes with estimates for scale-up performance on upcoming multi-core technology.

  4. FoSSI: the family of simplified solver interfaces for the rapid development of parallel numerical atmosphere and ocean models

    NASA Astrophysics Data System (ADS)

    Frickenhaus, Stephan; Hiller, Wolfgang; Best, Meike

    The portable software FoSSI is introduced that—in combination with additional free solver software packages—allows for an efficient and scalable parallel solution of large sparse linear equations systems arising in finite element model codes. FoSSI is intended to support rapid model code development, completely hiding the complexity of the underlying solver packages. In particular, the model developer need not be an expert in parallelization and is yet free to switch between different solver packages by simple modifications of the interface call. FoSSI offers an efficient and easy, yet flexible interface to several parallel solvers, most of them available on the web, such as PETSC, AZTEC, MUMPS, PILUT and HYPRE. FoSSI makes use of the concept of handles for vectors, matrices, preconditioners and solvers, that is frequently used in solver libraries. Hence, FoSSI allows for a flexible treatment of several linear equations systems and associated preconditioners at the same time, even in parallel on separate MPI-communicators. The second special feature in FoSSI is the task specifier, being a combination of keywords, each configuring a certain phase in the solver setup. This enables the user to control a solver over one unique subroutine. Furthermore, FoSSI has rather similar features for all solvers, making a fast solver intercomparison or exchange an easy task. FoSSI is a community software, proven in an adaptive 2D-atmosphere model and a 3D-primitive equation ocean model, both formulated in finite elements. The present paper discusses perspectives of an OpenMP-implementation of parallel iterative solvers based on domain decomposition methods. This approach to OpenMP solvers is rather attractive, as the code for domain-local operations of factorization, preconditioning and matrix-vector product can be readily taken from a sequential implementation that is also suitable to be used in an MPI-variant. Code development in this direction is in an advanced state under the name ScOPES: the Scalable Open Parallel sparse linear Equations Solver.

  5. Numerical Aspects of Nonhydrostatic Implementations Applied to a Parallel Finite Element Tsunami Model

    NASA Astrophysics Data System (ADS)

    Fuchs, A.; Androsov, A.; Harig, S.; Hiller, W.; Rakowsky, N.

    2012-04-01

    Based on the jeopardy of devastating tsunamis and the unpredictability of such events, tsunami modelling as part of warning systems is still a contemporary topic. The tsunami group of Alfred Wegener Institute developed the simulation tool TsunAWI as contribution to the Early Warning System in Indonesia. Although the precomputed scenarios for this purpose qualify for satisfying deliverables, the study of further improvements continues. While TsunAWI is governed by the Shallow Water Equations, an extension of the model is based on a nonhydrostatic approach. At the arrival of a tsunami wave in coastal regions with rough bathymetry, the term containing the nonhydrostatic part of pressure, that is neglected in the original hydrostatic model, gains in importance. In consideration of this term, a better approximation of the wave is expected. Differences of hydrostatic and nonhydrostatic model results are contrasted in the standard benchmark problem of a solitary wave runup on a plane beach. The observation data provided by Titov and Synolakis (1995) serves as reference. The nonhydrostatic approach implies a set of equations that are similar to the Shallow Water Equations, so the variation of the code can be implemented on top. However, this additional routines cause a lot of issues you have to cope with. So far the computations of the model were purely explicit. In the nonhydrostatic version the determination of an additional unknown and the solution of a large sparse system of linear equations is necessary. The latter constitutes the lion's share of computing time and memory requirement. Since the corresponding matrix is only symmetric in structure and not in values, an iterative Krylov Subspace Method is used, in particular the restarted Generalized Minimal Residual Algorithm GMRES(m). With regard to optimization, we present a comparison of several combinations of sequential and parallel preconditioning techniques respective number of iterations and setup/application time. Since the used software package pARMS 3.2, that provides solving and preconditioning techniques, works via MPI parallelism, in an auxiliary branch we adapted TsunAWI and switched from OpenMP to MPI with attached importance to internal partition management.

  6. Random sphere packing model of heterogeneous propellants

    NASA Astrophysics Data System (ADS)

    Kochevets, Sergei Victorovich

    It is well recognized that combustion of heterogeneous propellants is strongly dependent on the propellant morphology. Recent developments in computing systems make it possible to start three-dimensional modeling of heterogeneous propellant combustion. A key component of such large scale computations is a realistic model of industrial propellants which retains the true morphology---a goal never achieved before. The research presented develops the Random Sphere Packing Model of heterogeneous propellants and generates numerical samples of actual industrial propellants. This is done by developing a sphere packing algorithm which randomly packs a large number of spheres with a polydisperse size distribution within a rectangular domain. First, the packing code is developed, optimized for performance, and parallelized using the OpenMP shared memory architecture. Second, the morphology and packing fraction of two simple cases of unimodal and bimodal packs are investigated computationally and analytically. It is shown that both the Loose Random Packing and Dense Random Packing limits are not well defined and the growth rate of the spheres is identified as the key parameter controlling the efficiency of the packing. For a properly chosen growth rate, computational results are found to be in excellent agreement with experimental data. Third, two strategies are developed to define numerical samples of polydisperse heterogeneous propellants: the Deterministic Strategy and the Random Selection Strategy. Using these strategies, numerical samples of industrial propellants are generated. The packing fraction is investigated and it is shown that the experimental values of the packing fraction can be achieved computationally. It is strongly believed that this Random Sphere Packing Model of propellants is a major step forward in the realistic computational modeling of heterogeneous propellant of combustion. In addition, a method of analysis of the morphology of heterogeneous propellants is developed which uses the concept of multi-point correlation functions. A set of intrinsic length scales of local density fluctuations in random heterogeneous propellants is identified by performing a Monte-Carlo study of the correlation functions. This method of analysis shows great promise for understanding the origins of the combustion instability of heterogeneous propellants, and is believed to become a valuable tool for the development of safe and reliable rocket engines.

  7. Coalescence-induced jumping of micro-droplets on heterogeneous superhydrophobic surfaces

    NASA Astrophysics Data System (ADS)

    Attarzadeh, Reza; Dolatabadi, Ali

    2017-01-01

    The phenomenon of droplets coalescence-induced self-propelled jumping on homogeneous and heterogeneous superhydrophobic surfaces was numerically modeled using the volume of fluid method coupled with a dynamic contact angle model. The heterogeneity of the surface was directly modeled as a series of micro-patterned pillars. To resolve the influence of air around a droplet and between the pillars, extensive simulations were performed for different droplet sizes on a textured surface. Parallel computations with the OpenMP algorithm were used to accelerate computation speed to meet the convergence criteria. The composition of the air-solid surface underneath the droplet facilitated capturing the transition from a no-slip/no-penetration to a partial-slip with penetration as the contact line at triple point started moving to the air pockets. The wettability effect from the nanoscopic roughness and the coating was included in the model by using the intrinsic contact angle obtained from a previously published study. As the coalescence started, the radial velocity of the coalescing liquid bridge was partially reverted to the upward direction due to the counter-action of the surface. However, we found that the velocity varied with the size of the droplets. A part of the droplet kinetic energy was dissipated as the merged droplet started penetrating into the cavities. This was due to a different area in contact between the liquid and solid and, consequently, a higher viscous dissipation rate in the system. We showed that the effect of surface roughness is strongly significant when the size of the micro-droplet is comparable with the size of the roughness features. In addition, the relevance of droplet size to surface roughness (critical relative roughness) was numerically quantified. We also found that regardless of the viscous cutoff radius, as the relative roughness approached the value of 44, the direct inclusion of surface topography was crucial in the modeling of the droplet-surface interaction. Finally, we validated our model against existing experimental data in the literature, verifying the effect of relative roughness on the jumping velocity of a merged droplet.

  8. The Open Source Snowpack modelling ecosystem

    NASA Astrophysics Data System (ADS)

    Bavay, Mathias; Fierz, Charles; Egger, Thomas; Lehning, Michael

    2016-04-01

    As a large number of numerical snow models are available, a few stand out as quite mature and widespread. One such model is SNOWPACK, the Open Source model that is developed at the WSL Institute for Snow and Avalanche Research SLF. Over the years, various tools have been developed around SNOWPACK in order to expand its use or to integrate additional features. Today, the model is part of a whole ecosystem that has evolved to both offer seamless integration and high modularity so each tool can easily be used outside the ecosystem. Many of these Open Source tools experience their own, autonomous development and are successfully used in their own right in other models and applications. There is Alpine3D, the spatially distributed version of SNOWPACK, that forces it with terrain-corrected radiation fields and optionally with blowing and drifting snow. This model can be used on parallel systems (either with OpenMP or MPI) and has been used for applications ranging from climate change to reindeer herding. There is the MeteoIO pre-processing library that offers fully integrated data access, data filtering, data correction, data resampling and spatial interpolations. This library is now used by several other models and applications. There is the SnopViz snow profile visualization library and application that supports both measured and simulated snow profiles (relying on the CAAML standard) as well as time series. This JavaScript application can be used standalone without any internet connection or served on the web together with simulation results. There is the OSPER data platform effort with a data management service (build on the Global Sensor Network (GSN) platform) as well as a data documenting system (metadata management as a wiki). There are several distributed hydrological models for mountainous areas in ongoing development that require very little information about the soil structure based on the assumption that in step terrain, the most relevant information is contained in the Digital Elevation Model (DEM). There is finally a set of tools making up the operational chain to automatically run, monitor and publish SNOWPACK simulations for operational avalanche warning purposes. This tool chain has been developed with the aim of offering very low maintenance operation and very fast deployment and to easily adapt to other avalanche services.

  9. Power and Performance Trade-offs for Space Time Adaptive Processing

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Gawande, Nitin A.; Manzano Franco, Joseph B.; Tumeo, Antonino

    Computational efficiency – performance relative to power or energy – is one of the most important concerns when designing RADAR processing systems. This paper analyzes power and performance trade-offs for a typical Space Time Adaptive Processing (STAP) application. We study STAP implementations for CUDA and OpenMP on two computationally efficient architectures, Intel Haswell Core I7-4770TE and NVIDIA Kayla with a GK208 GPU. We analyze the power and performance of STAP’s computationally intensive kernels across the two hardware testbeds. We also show the impact and trade-offs of GPU optimization techniques. We show that data parallelism can be exploited for efficient implementationmore » on the Haswell CPU architecture. The GPU architecture is able to process large size data sets without increase in power requirement. The use of shared memory has a significant impact on the power requirement for the GPU. A balance between the use of shared memory and main memory access leads to an improved performance in a typical STAP application.« less

  10. Composing Data Parallel Code for a SPARQL Graph Engine

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Castellana, Vito G.; Tumeo, Antonino; Villa, Oreste

    Big data analytics process large amount of data to extract knowledge from them. Semantic databases are big data applications that adopt the Resource Description Framework (RDF) to structure metadata through a graph-based representation. The graph based representation provides several benefits, such as the possibility to perform in memory processing with large amounts of parallelism. SPARQL is a language used to perform queries on RDF-structured data through graph matching. In this paper we present a tool that automatically translates SPARQL queries to parallel graph crawling and graph matching operations. The tool also supports complex SPARQL constructs, which requires more than basicmore » graph matching for their implementation. The tool generates parallel code annotated with OpenMP pragmas for x86 Shared-memory Multiprocessors (SMPs). With respect to commercial database systems such as Virtuoso, our approach reduces memory occupation due to join operations and provides higher performance. We show the scaling of the automatically generated graph-matching code on a 48-core SMP.« less

  11. Parallel k-means++

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    A parallelization of the k-means++ seed selection algorithm on three distinct hardware platforms: GPU, multicore CPU, and multithreaded architecture. K-means++ was developed by David Arthur and Sergei Vassilvitskii in 2007 as an extension of the k-means data clustering technique. These algorithms allow people to cluster multidimensional data, by attempting to minimize the mean distance of data points within a cluster. K-means++ improved upon traditional k-means by using a more intelligent approach to selecting the initial seeds for the clustering process. While k-means++ has become a popular alternative to traditional k-means clustering, little work has been done to parallelize this technique.more » We have developed original C++ code for parallelizing the algorithm on three unique hardware architectures: GPU using NVidia's CUDA/Thrust framework, multicore CPU using OpenMP, and the Cray XMT multithreaded architecture. By parallelizing the process for these platforms, we are able to perform k-means++ clustering much more quickly than it could be done before.« less

  12. ARES v2: new features and improved performance

    NASA Astrophysics Data System (ADS)

    Sousa, S. G.; Santos, N. C.; Adibekyan, V.; Delgado-Mena, E.; Israelian, G.

    2015-05-01

    Aims: We present a new upgraded version of ARES. The new version includes a series of interesting new features such as automatic radial velocity correction, a fully automatic continuum determination, and an estimation of the errors for the equivalent widths. Methods: The automatic correction of the radial velocity is achieved with a simple cross-correlation function, and the automatic continuum determination, as well as the estimation of the errors, relies on a new approach to evaluating the spectral noise at the continuum level. Results: ARES v2 is totally compatible with its predecessor. We show that the fully automatic continuum determination is consistent with the previous methods applied for this task. It also presents a significant improvement on its performance thanks to the implementation of a parallel computation using the OpenMP library. Automatic Routine for line Equivalent widths in stellar Spectra - ARES webpage: http://www.astro.up.pt/~sousasag/ares/Based on observations made with ESO Telescopes at the La Silla Paranal Observatory under programme ID 075.D-0800(A).

  13. Optimizing the Performance of Reactive Molecular Dynamics Simulations for Multi-core Architectures

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Aktulga, Hasan Metin; Coffman, Paul; Shan, Tzu-Ray

    2015-12-01

    Hybrid parallelism allows high performance computing applications to better leverage the increasing on-node parallelism of modern supercomputers. In this paper, we present a hybrid parallel implementation of the widely used LAMMPS/ReaxC package, where the construction of bonded and nonbonded lists and evaluation of complex ReaxFF interactions are implemented efficiently using OpenMP parallelism. Additionally, the performance of the QEq charge equilibration scheme is examined and a dual-solver is implemented. We present the performance of the resulting ReaxC-OMP package on a state-of-the-art multi-core architecture Mira, an IBM BlueGene/Q supercomputer. For system sizes ranging from 32 thousand to 16.6 million particles, speedups inmore » the range of 1.5-4.5x are observed using the new ReaxC-OMP software. Sustained performance improvements have been observed for up to 262,144 cores (1,048,576 processes) of Mira with a weak scaling efficiency of 91.5% in larger simulations containing 16.6 million particles.« less

  14. The High Level Data Reduction Library

    NASA Astrophysics Data System (ADS)

    Ballester, P.; Gabasch, A.; Jung, Y.; Modigliani, A.; Taylor, J.; Coccato, L.; Freudling, W.; Neeser, M.; Marchetti, E.

    2015-09-01

    The European Southern Observatory (ESO) provides pipelines to reduce data for most of the instruments at its Very Large telescope (VLT). These pipelines are written as part of the development of VLT instruments, and are used both in the ESO's operational environment and by science users who receive VLT data. All the pipelines are highly specific geared toward instruments. However, experience showed that the independently developed pipelines include significant overlap, duplication and slight variations of similar algorithms. In order to reduce the cost of development, verification and maintenance of ESO pipelines, and at the same time improve the scientific quality of pipelines data products, ESO decided to develop a limited set of versatile high-level scientific functions that are to be used in all future pipelines. The routines are provided by the High-level Data Reduction Library (HDRL). To reach this goal, we first compare several candidate algorithms and verify them during a prototype phase using data sets from several instruments. Once the best algorithm and error model have been chosen, we start a design and implementation phase. The coding of HDRL is done in plain C and using the Common Pipeline Library (CPL) functionality. HDRL adopts consistent function naming conventions and a well defined API to minimise future maintenance costs, implements error propagation, uses pixel quality information, employs OpenMP to take advantage of multi-core processors, and is verified with extensive unit and regression tests. This poster describes the status of the project and the lesson learned during the development of reusable code implementing algorithms of high scientific quality.

  15. DISPATCH: a numerical simulation framework for the exa-scale era - I. Fundamentals

    NASA Astrophysics Data System (ADS)

    Nordlund, Åke; Ramsey, Jon P.; Popovas, Andrius; Küffmeier, Michael

    2018-06-01

    We introduce a high-performance simulation framework that permits the semi-independent, task-based solution of sets of partial differential equations, typically manifesting as updates to a collection of `patches' in space-time. A hybrid MPI/OpenMP execution model is adopted, where work tasks are controlled by a rank-local `dispatcher' which selects, from a set of tasks generally much larger than the number of physical cores (or hardware threads), tasks that are ready for updating. The definition of a task can vary, for example, with some solving the equations of ideal magnetohydrodynamics (MHD), others non-ideal MHD, radiative transfer, or particle motion, and yet others applying particle-in-cell (PIC) methods. Tasks do not have to be grid based, while tasks that are, may use either Cartesian or orthogonal curvilinear meshes. Patches may be stationary or moving. Mesh refinement can be static or dynamic. A feature of decisive importance for the overall performance of the framework is that time-steps are determined and applied locally; this allows potentially large reductions in the total number of updates required in cases when the signal speed varies greatly across the computational domain, and therefore a corresponding reduction in computing time. Another feature is a load balancing algorithm that operates `locally' and aims to simultaneously minimize load and communication imbalance. The framework generally relies on already existing solvers, whose performance is augmented when run under the framework, due to more efficient cache usage, vectorization, local time-stepping, plus near-linear and, in principle, unlimited OpenMP and MPI scaling.

  16. OpenMP-accelerated SWAT simulation using Intel C and FORTRAN compilers: Development and benchmark

    NASA Astrophysics Data System (ADS)

    Ki, Seo Jin; Sugimura, Tak; Kim, Albert S.

    2015-02-01

    We developed a practical method to accelerate execution of Soil and Water Assessment Tool (SWAT) using open (free) computational resources. The SWAT source code (rev 622) was recompiled using a non-commercial Intel FORTRAN compiler in Ubuntu 12.04 LTS Linux platform, and newly named iOMP-SWAT in this study. GNU utilities of make, gprof, and diff were used to develop the iOMP-SWAT package, profile memory usage, and check identicalness of parallel and serial simulations. Among 302 SWAT subroutines, the slowest routines were identified using GNU gprof, and later modified using Open Multiple Processing (OpenMP) library in an 8-core shared memory system. In addition, a C wrapping function was used to rapidly set large arrays to zero by cross compiling with the original SWAT FORTRAN package. A universal speedup ratio of 2.3 was achieved using input data sets of a large number of hydrological response units. As we specifically focus on acceleration of a single SWAT run, the use of iOMP-SWAT for parameter calibrations will significantly improve the performance of SWAT optimization.

  17. Livermore Compiler Analysis Loop Suite

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Hornung, R. D.

    2013-03-01

    LCALS is designed to evaluate compiler optimizations and performance of a variety of loop kernels and loop traversal software constructs. Some of the loop kernels are pulled directly from "Livermore Loops Coded in C", developed at LLNL (see item 11 below for details of earlier code versions). The older suites were used to evaluate floating-point performances of hardware platforms prior to porting larger application codes. The LCALS suite is geared toward assissing C++ compiler optimizations and platform performance related to SIMD vectorization, OpenMP threading, and advanced C++ language features. LCALS contains 20 of 24 loop kernels from the older Livermoremore » Loop suites, plus various others representative of loops found in current production appkication codes at LLNL. The latter loops emphasize more diverse loop constructs and data access patterns than the others, such as multi-dimensional difference stencils. The loops are included in a configurable framework, which allows control of compilation, loop sampling for execution timing, which loops are run and their lengths. It generates timing statistics for analysis and comparing variants of individual loops. Also, it is easy to add loops to the suite as desired.« less

  18. High Resolution Aerospace Applications using the NASA Columbia Supercomputer

    NASA Technical Reports Server (NTRS)

    Mavriplis, Dimitri J.; Aftosmis, Michael J.; Berger, Marsha

    2005-01-01

    This paper focuses on the parallel performance of two high-performance aerodynamic simulation packages on the newly installed NASA Columbia supercomputer. These packages include both a high-fidelity, unstructured, Reynolds-averaged Navier-Stokes solver, and a fully-automated inviscid flow package for cut-cell Cartesian grids. The complementary combination of these two simulation codes enables high-fidelity characterization of aerospace vehicle design performance over the entire flight envelope through extensive parametric analysis and detailed simulation of critical regions of the flight envelope. Both packages. are industrial-level codes designed for complex geometry and incorpor.ats. CuStomized multigrid solution algorithms. The performance of these codes on Columbia is examined using both MPI and OpenMP and using both the NUMAlink and InfiniBand interconnect fabrics. Numerical results demonstrate good scalability on up to 2016 CPUs using the NUMAIink4 interconnect, with measured computational rates in the vicinity of 3 TFLOP/s, while InfiniBand showed some performance degradation at high CPU counts, particularly with multigrid. Nonetheless, the results are encouraging enough to indicate that larger test cases using combined MPI/OpenMP communication should scale well on even more processors.

  19. The fusion code XGC: Enabling kinetic study of multi-scale edge turbulent transport in ITER [Book Chapter

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    D'Azevedo, Eduardo; Abbott, Stephen; Koskela, Tuomas

    The XGC fusion gyrokinetic code combines state-of-the-art, portable computational and algorithmic technologies to enable complicated multiscale simulations of turbulence and transport dynamics in ITER edge plasma on the largest US open-science computer, the CRAY XK7 Titan, at its maximal heterogeneous capability, which have not been possible before due to a factor of over 10 shortage in the time-to-solution for less than 5 days of wall-clock time for one physics case. Frontier techniques such as nested OpenMP parallelism, adaptive parallel I/O, staging I/O and data reduction using dynamic and asynchronous applications interactions, dynamic repartitioning for balancing computational work in pushing particlesmore » and in grid related work, scalable and accurate discretization algorithms for non-linear Coulomb collisions, and communication-avoiding subcycling technology for pushing particles on both CPUs and GPUs are also utilized to dramatically improve the scalability and time-to-solution, hence enabling the difficult kinetic ITER edge simulation on a present-day leadership class computer.« less

  20. Project APhiD: A Lorenz-gauged A-Φ decomposition for parallelized computation of ultra-broadband electromagnetic induction in a fully heterogeneous Earth

    NASA Astrophysics Data System (ADS)

    Weiss, Chester J.

    2013-08-01

    An essential element for computational hypothesis testing, data inversion and experiment design for electromagnetic geophysics is a robust forward solver, capable of easily and quickly evaluating the electromagnetic response of arbitrary geologic structure. The usefulness of such a solver hinges on the balance among competing desires like ease of use, speed of forward calculation, scalability to large problems or compute clusters, parsimonious use of memory access, accuracy and by necessity, the ability to faithfully accommodate a broad range of geologic scenarios over extremes in length scale and frequency content. This is indeed a tall order. The present study addresses recent progress toward the development of a forward solver with these properties. Based on the Lorenz-gauged Helmholtz decomposition, a new finite volume solution over Cartesian model domains endowed with complex-valued electrical properties is shown to be stable over the frequency range 10-2-1010 Hz and range 10-3-105 m in length scale. Benchmark examples are drawn from magnetotellurics, exploration geophysics, geotechnical mapping and laboratory-scale analysis, showing excellent agreement with reference analytic solutions. Computational efficiency is achieved through use of a matrix-free implementation of the quasi-minimum-residual (QMR) iterative solver, which eliminates explicit storage of finite volume matrix elements in favor of "on the fly" computation as needed by the iterative Krylov sequence. Further efficiency is achieved through sparse coupling matrices between the vector and scalar potentials whose non-zero elements arise only in those parts of the model domain where the conductivity gradient is non-zero. Multi-thread parallelization in the QMR solver through OpenMP pragmas is used to reduce the computational cost of its most expensive step: the single matrix-vector product at each iteration. High-level MPI communicators farm independent processes to available compute nodes for simultaneous computation of multi-frequency or multi-transmitter responses.

  1. High quality NMR structures: a new force field with implicit water and membrane solvation for Xplor-NIH.

    PubMed

    Tian, Ye; Schwieters, Charles D; Opella, Stanley J; Marassi, Francesca M

    2017-01-01

    Structure determination of proteins by NMR is unique in its ability to measure restraints, very accurately, in environments and under conditions that closely mimic those encountered in vivo. For example, advances in solid-state NMR methods enable structure determination of membrane proteins in detergent-free lipid bilayers, and of large soluble proteins prepared by sedimentation, while parallel advances in solution NMR methods and optimization of detergent-free lipid nanodiscs are rapidly pushing the envelope of the size limit for both soluble and membrane proteins. These experimental advantages, however, are partially squandered during structure calculation, because the commonly used force fields are purely repulsive and neglect solvation, Van der Waals forces and electrostatic energy. Here we describe a new force field, and updated energy functions, for protein structure calculations with EEFx implicit solvation, electrostatics, and Van der Waals Lennard-Jones forces, in the widely used program Xplor-NIH. The new force field is based primarily on CHARMM22, facilitating calculations with a wider range of biomolecules. The new EEFx energy function has been rewritten to enable OpenMP parallelism, and optimized to enhance computation efficiency. It implements solvation, electrostatics, and Van der Waals energy terms together, thus ensuring more consistent and efficient computation of the complete nonbonded energy lists. Updates in the related python module allow detailed analysis of the interaction energies and associated parameters. The new force field and energy function work with both soluble proteins and membrane proteins, including those with cofactors or engineered tags, and are very effective in situations where there are sparse experimental restraints. Results obtained for NMR-restrained calculations with a set of five soluble proteins and five membrane proteins show that structures calculated with EEFx have significant improvements in accuracy, precision, and conformation, and that structure refinement can be obtained by short relaxation with EEFx to obtain improvements in these key metrics. These developments broaden the range of biomolecular structures that can be calculated with high fidelity from NMR restraints.

  2. Multi-phase SPH modelling of violent hydrodynamics on GPUs

    NASA Astrophysics Data System (ADS)

    Mokos, Athanasios; Rogers, Benedict D.; Stansby, Peter K.; Domínguez, José M.

    2015-11-01

    This paper presents the acceleration of multi-phase smoothed particle hydrodynamics (SPH) using a graphics processing unit (GPU) enabling large numbers of particles (10-20 million) to be simulated on just a single GPU card. With novel hardware architectures such as a GPU, the optimum approach to implement a multi-phase scheme presents some new challenges. Many more particles must be included in the calculation and there are very different speeds of sound in each phase with the largest speed of sound determining the time step. This requires efficient computation. To take full advantage of the hardware acceleration provided by a single GPU for a multi-phase simulation, four different algorithms are investigated: conditional statements, binary operators, separate particle lists and an intermediate global function. Runtime results show that the optimum approach needs to employ separate cell and neighbour lists for each phase. The profiler shows that this approach leads to a reduction in both memory transactions and arithmetic operations giving significant runtime gains. The four different algorithms are compared to the efficiency of the optimised single-phase GPU code, DualSPHysics, for 2-D and 3-D simulations which indicate that the multi-phase functionality has a significant computational overhead. A comparison with an optimised CPU code shows a speed up of an order of magnitude over an OpenMP simulation with 8 threads and two orders of magnitude over a single thread simulation. A demonstration of the multi-phase SPH GPU code is provided by a 3-D dam break case impacting an obstacle. This shows better agreement with experimental results than an equivalent single-phase code. The multi-phase GPU code enables a convergence study to be undertaken on a single GPU with a large number of particles that otherwise would have required large high performance computing resources.

  3. Solution of the Skyrme-Hartree-Fock-Bogolyubov equations in the Cartesian deformed harmonic-oscillator basis.. (VII) HFODD (v2.49t): A new version of the program

    NASA Astrophysics Data System (ADS)

    Schunck, N.; Dobaczewski, J.; McDonnell, J.; Satuła, W.; Sheikh, J. A.; Staszczak, A.; Stoitsov, M.; Toivanen, P.

    2012-01-01

    We describe the new version (v2.49t) of the code HFODD which solves the nuclear Skyrme-Hartree-Fock (HF) or Skyrme-Hartree-Fock-Bogolyubov (HFB) problem by using the Cartesian deformed harmonic-oscillator basis. In the new version, we have implemented the following physics features: (i) the isospin mixing and projection, (ii) the finite-temperature formalism for the HFB and HF + BCS methods, (iii) the Lipkin translational energy correction method, (iv) the calculation of the shell correction. A number of specific numerical methods have also been implemented in order to deal with large-scale multi-constraint calculations and hardware limitations: (i) the two-basis method for the HFB method, (ii) the Augmented Lagrangian Method (ALM) for multi-constraint calculations, (iii) the linear constraint method based on the approximation of the RPA matrix for multi-constraint calculations, (iv) an interface with the axial and parity-conserving Skyrme-HFB code HFBTHO, (v) the mixing of the HF or HFB matrix elements instead of the HF fields. Special care has been paid to using the code on massively parallel leadership class computers. For this purpose, the following features are now available with this version: (i) the Message Passing Interface (MPI) framework, (ii) scalable input data routines, (iii) multi-threading via OpenMP pragmas, (iv) parallel diagonalization of the HFB matrix in the simplex-breaking case using the ScaLAPACK library. Finally, several little significant errors of the previous published version were corrected. New version program summaryProgram title:HFODD (v2.49t) Catalogue identifier: ADFL_v3_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADFL_v3_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GNU General Public Licence v3 No. of lines in distributed program, including test data, etc.: 190 614 No. of bytes in distributed program, including test data, etc.: 985 898 Distribution format: tar.gz Programming language: FORTRAN-90 Computer: Intel Pentium-III, Intel Xeon, AMD-Athlon, AMD-Opteron, Cray XT4, Cray XT5 Operating system: UNIX, LINUX, Windows XP Has the code been vectorized or parallelized?: Yes, parallelized using MPI RAM: 10 Mwords Word size: The code is written in single-precision for the use on a 64-bit processor. The compiler option -r8 or +autodblpad (or equivalent) has to be used to promote all real and complex single-precision floating-point items to double precision when the code is used on a 32-bit machine. Classification: 17.22 Catalogue identifier of previous version: ADFL_v2_2 Journal reference of previous version: Comput. Phys. Comm. 180 (2009) 2361 External routines: The user must have access to the NAGLIB subroutine f02axe, or LAPACK subroutines zhpev, zhpevx, zheevr, or zheevd, which diagonalize complex hermitian matrices, the LAPACK subroutines dgetri and dgetrf which invert arbitrary real matrices, the LAPACK subroutines dsyevd, dsytrf and dsytri which compute eigenvalues and eigenfunctions of real symmetric matrices, the LINPACK subroutines zgedi and zgeco, which invert arbitrary complex matrices and calculate determinants, the BLAS routines dcopy, dscal, dgeem and dgemv for double-precision linear algebra and zcopy, zdscal, zgeem and zgemv for complex linear algebra, or provide another set of subroutines that can perform such tasks. The BLAS and LAPACK subroutines can be obtained from the Netlib Repository at the University of Tennessee, Knoxville: http://netlib2.cs.utk.edu/. Does the new version supersede the previous version?: Yes Nature of problem: The nuclear mean field and an analysis of its symmetries in realistic cases are the main ingredients of a description of nuclear states. Within the Local Density Approximation, or for a zero-range velocity-dependent Skyrme interaction, the nuclear mean field is local and velocity dependent. The locality allows for an effective and fast solution of the self-consistent Hartree-Fock equations, even for heavy nuclei, and for various nucleonic ( n-particle- n-hole) configurations, deformations, excitation energies, or angular momenta. Similarly, Local Density Approximation in the particle-particle channel, which is equivalent to using a zero-range interaction, allows for a simple implementation of pairing effects within the Hartree-Fock-Bogolyubov method. Solution method: The program uses the Cartesian harmonic oscillator basis to expand single-particle or single-quasiparticle wave functions of neutrons and protons interacting by means of the Skyrme effective interaction and zero-range pairing interaction. The expansion coefficients are determined by the iterative diagonalization of the mean-field Hamiltonians or Routhians which depend non-linearly on the local neutron and proton densities. Suitable constraints are used to obtain states corresponding to a given configuration, deformation or angular momentum. The method of solution has been presented in: [J. Dobaczewski, J. Dudek, Comput. Phys. Commun. 102 (1997) 166]. Reasons for new version: Version 2.49s of HFODD provides a number of new options such as the isospin mixing and projection of the Skyrme functional, the finite-temperature HF and HFB formalism and optimized methods to perform multi-constrained calculations. It is also the first version of HFODD to contain threading and parallel capabilities. Summary of revisions: Isospin mixing and projection of the HF states has been implemented. The finite-temperature formalism for the HFB equations has been implemented. The Lipkin translational energy correction method has been implemented. Calculation of the shell correction has been implemented. The two-basis method for the solution to the HFB equations has been implemented. The Augmented Lagrangian Method (ALM) for calculations with multiple constraints has been implemented. The linear constraint method based on the cranking approximation of the RPA matrix has been implemented. An interface between HFODD and the axially-symmetric and parity-conserving code HFBTHO has been implemented. The mixing of the matrix elements of the HF or HFB matrix has been implemented. A parallel interface using the MPI library has been implemented. A scalable model for reading input data has been implemented. OpenMP pragmas have been implemented in three subroutines. The diagonalization of the HFB matrix in the simplex-breaking case has been parallelized using the ScaLAPACK library. Several little significant errors of the previous published version were corrected. Running time: In serial mode, running 6 HFB iterations for 152Dy for conserved parity and signature symmetries in a full spherical basis of N=14 shells takes approximately 8 min on an AMD Opteron processor at 2.6 GHz, assuming standard BLAS and LAPACK libraries. As a rule of thumb, runtime for HFB calculations for parity and signature conserved symmetries roughly increases as N, where N is the number of full HO shells. Using custom-built optimized BLAS and LAPACK libraries (such as in the ATLAS implementation) can bring down the execution time by 60%. Using the threaded version of the code with 12 threads and threaded BLAS libraries can bring an additional factor 2 speed-up, so that the same 6 HFB iterations now take of the order of 2 min 30 s.

  4. GRASS GIS: The first Open Source Temporal GIS

    NASA Astrophysics Data System (ADS)

    Gebbert, Sören; Leppelt, Thomas

    2015-04-01

    GRASS GIS is a full featured, general purpose Open Source geographic information system (GIS) with raster, 3D raster and vector processing support[1]. Recently, time was introduced as a new dimension that transformed GRASS GIS into the first Open Source temporal GIS with comprehensive spatio-temporal analysis, processing and visualization capabilities[2]. New spatio-temporal data types were introduced in GRASS GIS version 7, to manage raster, 3D raster and vector time series. These new data types are called space time datasets. They are designed to efficiently handle hundreds of thousands of time stamped raster, 3D raster and vector map layers of any size. Time stamps can be defined as time intervals or time instances in Gregorian calendar time or relative time. Space time datasets are simplifying the processing and analysis of large time series in GRASS GIS, since these new data types are used as input and output parameter in temporal modules. The handling of space time datasets is therefore equal to the handling of raster, 3D raster and vector map layers in GRASS GIS. A new dedicated Python library, the GRASS GIS Temporal Framework, was designed to implement the spatio-temporal data types and their management. The framework provides the functionality to efficiently handle hundreds of thousands of time stamped map layers and their spatio-temporal topological relations. The framework supports reasoning based on the temporal granularity of space time datasets as well as their temporal topology. It was designed in conjunction with the PyGRASS [3] library to support parallel processing of large datasets, that has a long tradition in GRASS GIS [4,5]. We will present a subset of more than 40 temporal modules that were implemented based on the GRASS GIS Temporal Framework, PyGRASS and the GRASS GIS Python scripting library. These modules provide a comprehensive temporal GIS tool set. The functionality range from space time dataset and time stamped map layer management over temporal aggregation, temporal accumulation, spatio-temporal statistics, spatio-temporal sampling, temporal algebra, temporal topology analysis, time series animation and temporal topology visualization to time series import and export capabilities with support for NetCDF and VTK data formats. We will present several temporal modules that support parallel processing of raster and 3D raster time series. [1] GRASS GIS Open Source Approaches in Spatial Data Handling In Open Source Approaches in Spatial Data Handling, Vol. 2 (2008), pp. 171-199, doi:10.1007/978-3-540-74831-19 by M. Neteler, D. Beaudette, P. Cavallini, L. Lami, J. Cepicky edited by G. Brent Hall, Michael G. Leahy [2] Gebbert, S., Pebesma, E., 2014. A temporal GIS for field based environmental modeling. Environ. Model. Softw. 53, 1-12. [3] Zambelli, P., Gebbert, S., Ciolli, M., 2013. Pygrass: An Object Oriented Python Application Programming Interface (API) for Geographic Resources Analysis Support System (GRASS) Geographic Information System (GIS). ISPRS Intl Journal of Geo-Information 2, 201-219. [4] Löwe, P., Klump, J., Thaler, J. (2012): The FOSS GIS Workbench on the GFZ Load Sharing Facility compute cluster, (Geophysical Research Abstracts Vol. 14, EGU2012-4491, 2012), General Assembly European Geosciences Union (Vienna, Austria 2012). [5] Akhter, S., Aida, K., Chemin, Y., 2010. "GRASS GIS on High Performance Computing with MPI, OpenMP and Ninf-G Programming Framework". ISPRS Conference, Kyoto, 9-12 August 2010

  5. A Coarse-to-Fine Geometric Scale-Invariant Feature Transform for Large Size High Resolution Satellite Image Registration

    PubMed Central

    Chang, Xueli; Du, Siliang; Li, Yingying; Fang, Shenghui

    2018-01-01

    Large size high resolution (HR) satellite image matching is a challenging task due to local distortion, repetitive structures, intensity changes and low efficiency. In this paper, a novel matching approach is proposed for the large size HR satellite image registration, which is based on coarse-to-fine strategy and geometric scale-invariant feature transform (SIFT). In the coarse matching step, a robust matching method scale restrict (SR) SIFT is implemented at low resolution level. The matching results provide geometric constraints which are then used to guide block division and geometric SIFT in the fine matching step. The block matching method can overcome the memory problem. In geometric SIFT, with area constraints, it is beneficial for validating the candidate matches and decreasing searching complexity. To further improve the matching efficiency, the proposed matching method is parallelized using OpenMP. Finally, the sensing image is rectified to the coordinate of reference image via Triangulated Irregular Network (TIN) transformation. Experiments are designed to test the performance of the proposed matching method. The experimental results show that the proposed method can decrease the matching time and increase the number of matching points while maintaining high registration accuracy. PMID:29702589

  6. Characterization of Proxy Application Performance on Advanced Architectures. UMT2013, MCB, AMG2013

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Howell, Louis H.; Gunney, Brian T.; Bhatele, Abhinav

    2015-10-09

    Three codes were tested at LLNL as part of a Tri-Lab effort to make detailed assessments of several proxy applications on various advanced architectures, with the eventual goal of extending these assessments to codes of programmatic interest running more realistic simulations. Teams from Sandia and Los Alamos tested proxy apps of their own. The focus in this report is on the LLNL codes UMT2013, MCB, and AMG2013. We present weak and strong MPI scaling results and studies of OpenMP efficiency on a large BG/Q system at LLNL, with comparison against similar tests on an Intel Sandy Bridge TLCC2 system. Themore » hardware counters on BG/Q provide detailed information on many aspects of on-node performance, while information from the mpiP tool gives insight into the reasons for the differing scaling behavior on these two different architectures. Results from three more speculative tests are also included: one that exploits NVRAM as extended memory, one that studies performance under a power bound, and one that illustrates the effects of changing the torus network mapping on BG/Q.« less

  7. Parallel heuristics for scalable community detection

    DOE PAGES

    Lu, Hao; Halappanavar, Mahantesh; Kalyanaraman, Ananth

    2015-08-14

    Community detection has become a fundamental operation in numerous graph-theoretic applications. Despite its potential for application, there is only limited support for community detection on large-scale parallel computers, largely owing to the irregular and inherently sequential nature of the underlying heuristics. In this paper, we present parallelization heuristics for fast community detection using the Louvain method as the serial template. The Louvain method is an iterative heuristic for modularity optimization. Originally developed in 2008, the method has become increasingly popular owing to its ability to detect high modularity community partitions in a fast and memory-efficient manner. However, the method ismore » also inherently sequential, thereby limiting its scalability. Here, we observe certain key properties of this method that present challenges for its parallelization, and consequently propose heuristics that are designed to break the sequential barrier. For evaluation purposes, we implemented our heuristics using OpenMP multithreading, and tested them over real world graphs derived from multiple application domains. Compared to the serial Louvain implementation, our parallel implementation is able to produce community outputs with a higher modularity for most of the inputs tested, in comparable number or fewer iterations, while providing real speedups of up to 16x using 32 threads.« less

  8. Parallel peak pruning for scalable SMP contour tree computation

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Carr, Hamish A.; Weber, Gunther H.; Sewell, Christopher M.

    As data sets grow to exascale, automated data analysis and visualisation are increasingly important, to intermediate human understanding and to reduce demands on disk storage via in situ analysis. Trends in architecture of high performance computing systems necessitate analysis algorithms to make effective use of combinations of massively multicore and distributed systems. One of the principal analytic tools is the contour tree, which analyses relationships between contours to identify features of more than local importance. Unfortunately, the predominant algorithms for computing the contour tree are explicitly serial, and founded on serial metaphors, which has limited the scalability of this formmore » of analysis. While there is some work on distributed contour tree computation, and separately on hybrid GPU-CPU computation, there is no efficient algorithm with strong formal guarantees on performance allied with fast practical performance. Here in this paper, we report the first shared SMP algorithm for fully parallel contour tree computation, withfor-mal guarantees of O(lgnlgt) parallel steps and O(n lgn) work, and implementations with up to 10x parallel speed up in OpenMP and up to 50x speed up in NVIDIA Thrust.« less

  9. Hybrid multicore/vectorisation technique applied to the elastic wave equation on a staggered grid

    NASA Astrophysics Data System (ADS)

    Titarenko, Sofya; Hildyard, Mark

    2017-07-01

    In modern physics it has become common to find the solution of a problem by solving numerically a set of PDEs. Whether solving them on a finite difference grid or by a finite element approach, the main calculations are often applied to a stencil structure. In the last decade it has become usual to work with so called big data problems where calculations are very heavy and accelerators and modern architectures are widely used. Although CPU and GPU clusters are often used to solve such problems, parallelisation of any calculation ideally starts from a single processor optimisation. Unfortunately, it is impossible to vectorise a stencil structured loop with high level instructions. In this paper we suggest a new approach to rearranging the data structure which makes it possible to apply high level vectorisation instructions to a stencil loop and which results in significant acceleration. The suggested method allows further acceleration if shared memory APIs are used. We show the effectiveness of the method by applying it to an elastic wave propagation problem on a finite difference grid. We have chosen Intel architecture for the test problem and OpenMP (Open Multi-Processing) since they are extensively used in many applications.

  10. Accelerated event-by-event Monte Carlo microdosimetric calculations of electrons and protons tracks on a multi-core CPU and a CUDA-enabled GPU.

    PubMed

    Kalantzis, Georgios; Tachibana, Hidenobu

    2014-01-01

    For microdosimetric calculations event-by-event Monte Carlo (MC) methods are considered the most accurate. The main shortcoming of those methods is the extensive requirement for computational time. In this work we present an event-by-event MC code of low projectile energy electron and proton tracks for accelerated microdosimetric MC simulations on a graphic processing unit (GPU). Additionally, a hybrid implementation scheme was realized by employing OpenMP and CUDA in such a way that both GPU and multi-core CPU were utilized simultaneously. The two implementation schemes have been tested and compared with the sequential single threaded MC code on the CPU. Performance comparison was established on the speed-up for a set of benchmarking cases of electron and proton tracks. A maximum speedup of 67.2 was achieved for the GPU-based MC code, while a further improvement of the speedup up to 20% was achieved for the hybrid approach. The results indicate the capability of our CPU-GPU implementation for accelerated MC microdosimetric calculations of both electron and proton tracks without loss of accuracy. Copyright © 2013 Elsevier Ireland Ltd. All rights reserved.

  11. Evaluation of the Intel Xeon Phi 7120 and NVIDIA K80 as accelerators for two-dimensional panel codes

    PubMed Central

    2017-01-01

    To optimize the geometry of airfoils for a specific application is an important engineering problem. In this context genetic algorithms have enjoyed some success as they are able to explore the search space without getting stuck in local optima. However, these algorithms require the computation of aerodynamic properties for a significant number of airfoil geometries. Consequently, for low-speed aerodynamics, panel methods are most often used as the inner solver. In this paper we evaluate the performance of such an optimization algorithm on modern accelerators (more specifically, the Intel Xeon Phi 7120 and the NVIDIA K80). For that purpose, we have implemented an optimized version of the algorithm on the CPU and Xeon Phi (based on OpenMP, vectorization, and the Intel MKL library) and on the GPU (based on CUDA and the MAGMA library). We present timing results for all codes and discuss the similarities and differences between the three implementations. Overall, we observe a speedup of approximately 2.5 for adding an Intel Xeon Phi 7120 to a dual socket workstation and a speedup between 3.4 and 3.8 for adding a NVIDIA K80 to a dual socket workstation. PMID:28582389

  12. Evaluation of the Intel Xeon Phi 7120 and NVIDIA K80 as accelerators for two-dimensional panel codes.

    PubMed

    Einkemmer, Lukas

    2017-01-01

    To optimize the geometry of airfoils for a specific application is an important engineering problem. In this context genetic algorithms have enjoyed some success as they are able to explore the search space without getting stuck in local optima. However, these algorithms require the computation of aerodynamic properties for a significant number of airfoil geometries. Consequently, for low-speed aerodynamics, panel methods are most often used as the inner solver. In this paper we evaluate the performance of such an optimization algorithm on modern accelerators (more specifically, the Intel Xeon Phi 7120 and the NVIDIA K80). For that purpose, we have implemented an optimized version of the algorithm on the CPU and Xeon Phi (based on OpenMP, vectorization, and the Intel MKL library) and on the GPU (based on CUDA and the MAGMA library). We present timing results for all codes and discuss the similarities and differences between the three implementations. Overall, we observe a speedup of approximately 2.5 for adding an Intel Xeon Phi 7120 to a dual socket workstation and a speedup between 3.4 and 3.8 for adding a NVIDIA K80 to a dual socket workstation.

  13. Power-constrained supercomputing

    NASA Astrophysics Data System (ADS)

    Bailey, Peter E.

    As we approach exascale systems, power is turning from an optimization goal to a critical operating constraint. With power bounds imposed by both stakeholders and the limitations of existing infrastructure, achieving practical exascale computing will therefore rely on optimizing performance subject to a power constraint. However, this requirement should not add to the burden of application developers; optimizing the runtime environment given restricted power will primarily be the job of high-performance system software. In this dissertation, we explore this area and develop new techniques that extract maximum performance subject to a particular power constraint. These techniques include a method to find theoretical optimal performance, a runtime system that shifts power in real time to improve performance, and a node-level prediction model for selecting power-efficient operating points. We use a linear programming (LP) formulation to optimize application schedules under various power constraints, where a schedule consists of a DVFS state and number of OpenMP threads for each section of computation between consecutive message passing events. We also provide a more flexible mixed integer-linear (ILP) formulation and show that the resulting schedules closely match schedules from the LP formulation. Across four applications, we use our LP-derived upper bounds to show that current approaches trail optimal, power-constrained performance by up to 41%. This demonstrates limitations of current systems, and our LP formulation provides future optimization approaches with a quantitative optimization target. We also introduce Conductor, a run-time system that intelligently distributes available power to nodes and cores to improve performance. The key techniques used are configuration space exploration and adaptive power balancing. Configuration exploration dynamically selects the optimal thread concurrency level and DVFS state subject to a hardware-enforced power bound. Adaptive power balancing efficiently predicts where critical paths are likely to occur and distributes power to those paths. Greater power, in turn, allows increased thread concurrency levels, CPU frequency/voltage, or both. We describe these techniques in detail and show that, compared to the state-of-the-art technique of using statically predetermined, per-node power caps, Conductor leads to a best-case performance improvement of up to 30%, and an average improvement of 19.1%. At the node level, an accurate power/performance model will aid in selecting the right configuration from a large set of available configurations. We present a novel approach to generate such a model offline using kernel clustering and multivariate linear regression. Our model requires only two iterations to select a configuration, which provides a significant advantage over exhaustive search-based strategies. We apply our model to predict power and performance for different applications using arbitrary configurations, and show that our model, when used with hardware frequency-limiting in a runtime system, selects configurations with significantly higher performance at a given power limit than those chosen by frequency-limiting alone. When applied to a set of 36 computational kernels from a range of applications, our model accurately predicts power and performance; our runtime system based on the model maintains 91% of optimal performance while meeting power constraints 88% of the time. When the runtime system violates a power constraint, it exceeds the constraint by only 6% in the average case, while simultaneously achieving 54% more performance than an oracle. Through the combination of the above contributions, we hope to provide guidance and inspiration to research practitioners working on runtime systems for power-constrained environments. We also hope this dissertation will draw attention to the need for software and runtime-controlled power management under power constraints at various levels, from the processor level to the cluster level.

  14. Review of An Introduction to Parallel and Vector Scientific Computing

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Bailey, David H.; Lefton, Lew

    2006-06-30

    On one hand, the field of high-performance scientific computing is thriving beyond measure. Performance of leading-edge systems on scientific calculations, as measured say by the Top500 list, has increased by an astounding factor of 8000 during the 15-year period from 1993 to 2008, which is slightly faster even than Moore's Law. Even more importantly, remarkable advances in numerical algorithms, numerical libraries and parallel programming environments have led to improvements in the scope of what can be computed that are entirely on a par with the advances in computing hardware. And these successes have spread far beyond the confines of largemore » government-operated laboratories, many universities, modest-sized research institutes and private firms now operate clusters that differ only in scale from the behemoth systems at the large-scale facilities. In the wake of these recent successes, researchers from fields that heretofore have not been part of the scientific computing world have been drawn into the arena. For example, at the recent SC07 conference, the exhibit hall, which long has hosted displays from leading computer systems vendors and government laboratories, featured some 70 exhibitors who had not previously participated. In spite of all these exciting developments, and in spite of the clear need to present these concepts to a much broader technical audience, there is a perplexing dearth of training material and textbooks in the field, particularly at the introductory level. Only a handful of universities offer coursework in the specific area of highly parallel scientific computing, and instructors of such courses typically rely on custom-assembled material. For example, the present reviewer and Robert F. Lucas relied on materials assembled in a somewhat ad-hoc fashion from colleagues and personal resources when presenting a course on parallel scientific computing at the University of California, Berkeley, a few years ago. Thus it is indeed refreshing to see the publication of the book An Introduction to Parallel and Vector Scientic Computing, written by Ronald W. Shonkwiler and Lew Lefton, both of the Georgia Institute of Technology. They have taken the bull by the horns and produced a book that appears to be entirely satisfactory as an introductory textbook for use in such a course. It is also of interest to the much broader community of researchers who are already in the field, laboring day by day to improve the power and performance of their numerical simulations. The book is organized into 11 chapters, plus an appendix. The first three chapters describe the basics of system architecture including vector, parallel and distributed memory systems, the details of task dependence and synchronization, and the various programming models currently in use - threads, MPI and OpenMP. Chapters four through nine provide a competent introduction to floating-point arithmetic, numerical error and numerical linear algebra. Some of the topics presented include Gaussian elimination, LU decomposition, tridiagonal systems, Givens rotations, QR decompositions, Gauss-Seidel iterations and Householder transformations. Chapters 10 and 11 introduce Monte Carlo methods and schemes for discrete optimization such as genetic algorithms.« less

  15. OFF, Open source Finite volume Fluid dynamics code: A free, high-order solver based on parallel, modular, object-oriented Fortran API

    NASA Astrophysics Data System (ADS)

    Zaghi, S.

    2014-07-01

    OFF, an open source (free software) code for performing fluid dynamics simulations, is presented. The aim of OFF is to solve, numerically, the unsteady (and steady) compressible Navier-Stokes equations of fluid dynamics by means of finite volume techniques: the research background is mainly focused on high-order (WENO) schemes for multi-fluids, multi-phase flows over complex geometries. To this purpose a highly modular, object-oriented application program interface (API) has been developed. In particular, the concepts of data encapsulation and inheritance available within Fortran language (from standard 2003) have been stressed in order to represent each fluid dynamics "entity" (e.g. the conservative variables of a finite volume, its geometry, etc…) by a single object so that a large variety of computational libraries can be easily (and efficiently) developed upon these objects. The main features of OFF can be summarized as follows: Programming LanguageOFF is written in standard (compliant) Fortran 2003; its design is highly modular in order to enhance simplicity of use and maintenance without compromising the efficiency; Parallel Frameworks Supported the development of OFF has been also targeted to maximize the computational efficiency: the code is designed to run on shared-memory multi-cores workstations and distributed-memory clusters of shared-memory nodes (supercomputers); the code's parallelization is based on Open Multiprocessing (OpenMP) and Message Passing Interface (MPI) paradigms; Usability, Maintenance and Enhancement in order to improve the usability, maintenance and enhancement of the code also the documentation has been carefully taken into account; the documentation is built upon comprehensive comments placed directly into the source files (no external documentation files needed): these comments are parsed by means of doxygen free software producing high quality html and latex documentation pages; the distributed versioning system referred as git has been adopted in order to facilitate the collaborative maintenance and improvement of the code; CopyrightsOFF is a free software that anyone can use, copy, distribute, study, change and improve under the GNU Public License version 3. The present paper is a manifesto of OFF code and presents the currently implemented features and ongoing developments. This work is focused on the computational techniques adopted and a detailed description of the main API characteristics is reported. OFF capabilities are demonstrated by means of one and two dimensional examples and a three dimensional real application.

  16. SU-F-I-49: Vendor-Independent, Model-Based Iterative Reconstruction On a Rotating Grid with Coordinate-Descent Optimization for CT Imaging Investigations

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Young, S; Hoffman, J; McNitt-Gray, M

    Purpose: Iterative reconstruction methods show promise for improving image quality and lowering the dose in helical CT. We aim to develop a novel model-based reconstruction method that offers potential for dose reduction with reasonable computation speed and storage requirements for vendor-independent reconstruction from clinical data on a normal desktop computer. Methods: In 2012, Xu proposed reconstructing on rotating slices to exploit helical symmetry and reduce the storage requirements for the CT system matrix. Inspired by this concept, we have developed a novel reconstruction method incorporating the stored-system-matrix approach together with iterative coordinate-descent (ICD) optimization. A penalized-least-squares objective function with amore » quadratic penalty term is solved analytically voxel-by-voxel, sequentially iterating along the axial direction first, followed by the transaxial direction. 8 in-plane (transaxial) neighbors are used for the ICD algorithm. The forward problem is modeled via a unique approach that combines the principle of Joseph’s method with trilinear B-spline interpolation to enable accurate reconstruction with low storage requirements. Iterations are accelerated with multi-CPU OpenMP libraries. For preliminary evaluations, we reconstructed (1) a simulated 3D ellipse phantom and (2) an ACR accreditation phantom dataset exported from a clinical scanner (Definition AS, Siemens Healthcare). Image quality was evaluated in the resolution module. Results: Image quality was excellent for the ellipse phantom. For the ACR phantom, image quality was comparable to clinical reconstructions and reconstructions using open-source FreeCT-wFBP software. Also, we did not observe any deleterious impact associated with the utilization of rotating slices. The system matrix storage requirement was only 4.5GB, and reconstruction time was 50 seconds per iteration. Conclusion: Our reconstruction method shows potential for furthering research in low-dose helical CT, in particular as part of our ongoing development of an acquisition/reconstruction pipeline for generating images under a wide range of conditions. Our algorithm will be made available open-source as “FreeCT-ICD”. NIH U01 CA181156; Disclosures (McNitt-Gray): Institutional research agreement, Siemens Healthcare; Past recipient, research grant support, Siemens Healthcare; Consultant, Toshiba America Medical Systems; Consultant, Samsung Electronics.« less

  17. Finite-Difference Algorithm for Simulating 3D Electromagnetic Wavefields in Conductive Media

    NASA Astrophysics Data System (ADS)

    Aldridge, D. F.; Bartel, L. C.; Knox, H. A.

    2013-12-01

    Electromagnetic (EM) wavefields are routinely used in geophysical exploration for detection and characterization of subsurface geological formations of economic interest. Recorded EM signals depend strongly on the current conductivity of geologic media. Hence, they are particularly useful for inferring fluid content of saturated porous bodies. In order to enhance understanding of field-recorded data, we are developing a numerical algorithm for simulating three-dimensional (3D) EM wave propagation and diffusion in heterogeneous conductive materials. Maxwell's equations are combined with isotropic constitutive relations to obtain a set of six, coupled, first-order partial differential equations governing the electric and magnetic vectors. An advantage of this system is that it does not contain spatial derivatives of the three medium parameters electric permittivity, magnetic permeability, and current conductivity. Numerical solution methodology consists of explicit, time-domain finite-differencing on a 3D staggered rectangular grid. Temporal and spatial FD operators have order 2 and N, where N is user-selectable. We use an artificially-large electric permittivity to maximize the FD timestep, and thus reduce execution time. For the low frequencies typically used in geophysical exploration, accuracy is not unduly compromised. Grid boundary reflections are mitigated via convolutional perfectly matched layers (C-PMLs) imposed at the six grid flanks. A shared-memory-parallel code implementation via OpenMP directives enables rapid algorithm execution on a multi-thread computational platform. Good agreement is obtained in comparisons of numerically-generated data with reference solutions. EM wavefields are sourced via point current density and magnetic dipole vectors. Spatially-extended inductive sources (current carrying wire loops) are under development. We are particularly interested in accurate representation of high-conductivity sub-grid-scale features that are common in industrial environments (borehole casing, pipes, railroad tracks). Present efforts are oriented toward calculating the EM responses of these objects via a First Born Approximation approach. Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the US Department of Energy's National Nuclear Security Administration under contract DE-AC04-94AL85000.

  18. Hierarchical parallelisation of functional renormalisation group calculations - hp-fRG

    NASA Astrophysics Data System (ADS)

    Rohe, Daniel

    2016-10-01

    The functional renormalisation group (fRG) has evolved into a versatile tool in condensed matter theory for studying important aspects of correlated electron systems. Practical applications of the method often involve a high numerical effort, motivating the question in how far High Performance Computing (HPC) can leverage the approach. In this work we report on a multi-level parallelisation of the underlying computational machinery and show that this can speed up the code by several orders of magnitude. This in turn can extend the applicability of the method to otherwise inaccessible cases. We exploit three levels of parallelisation: Distributed computing by means of Message Passing (MPI), shared-memory computing using OpenMP, and vectorisation by means of SIMD units (single-instruction-multiple-data). Results are provided for two distinct High Performance Computing (HPC) platforms, namely the IBM-based BlueGene/Q system JUQUEEN and an Intel Sandy-Bridge-based development cluster. We discuss how certain issues and obstacles were overcome in the course of adapting the code. Most importantly, we conclude that this vast improvement can actually be accomplished by introducing only moderate changes to the code, such that this strategy may serve as a guideline for other researcher to likewise improve the efficiency of their codes.

  19. Snowflake: A Lightweight Portable Stencil DSL

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Zhang, Nathan; Driscoll, Michael; Markley, Charles

    Stencil computations are not well optimized by general-purpose production compilers and the increased use of multicore, manycore, and accelerator-based systems makes the optimization problem even more challenging. In this paper we present Snowflake, a Domain Specific Language (DSL) for stencils that uses a 'micro-compiler' approach, i.e., small, focused, domain-specific code generators. The approach is similar to that used in image processing stencils, but Snowflake handles the much more complex stencils that arise in scientific computing, including complex boundary conditions, higher-order operators (larger stencils), higher dimensions, variable coefficients, non-unit-stride iteration spaces, and multiple input or output meshes. Snowflake is embedded inmore » the Python language, allowing it to interoperate with popular scientific tools like SciPy and iPython; it also takes advantage of built-in Python libraries for powerful dependence analysis as part of a just-in-time compiler. We demonstrate the power of the Snowflake language and the micro-compiler approach with a complex scientific benchmark, HPGMG, that exercises the generality of stencil support in Snowflake. By generating OpenMP comparable to, and OpenCL within a factor of 2x of hand-optimized HPGMG, Snowflake demonstrates that a micro-compiler can support diverse processor architectures and is performance-competitive whilst preserving a high-level Python implementation.« less

  20. Performance Evaluation of NWChem Ab-Initio Molecular Dynamics (AIMD) Simulations on the Intel® Xeon Phi™ Processor

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Bylaska, Eric J.; Jacquelin, Mathias; De Jong, Wibe A.

    2017-10-20

    Ab-initio Molecular Dynamics (AIMD) methods are an important class of algorithms, as they enable scientists to understand the chemistry and dynamics of molecular and condensed phase systems while retaining a first-principles-based description of their interactions. Many-core architectures such as the Intel® Xeon Phi™ processor are an interesting and promising target for these algorithms, as they can provide the computational power that is needed to solve interesting problems in chemistry. In this paper, we describe the efforts of refactoring the existing AIMD plane-wave method of NWChem from an MPI-only implementation to a scalable, hybrid code that employs MPI and OpenMP tomore » exploit the capabilities of current and future many-core architectures. We describe the optimizations required to get close to optimal performance for the multiplication of the tall-and-skinny matrices that form the core of the computational algorithm. We present strong scaling results on the complete AIMD simulation for a test case that simulates 256 water molecules and that strong-scales well on a cluster of 1024 nodes of Intel Xeon Phi processors. We compare the performance obtained with a cluster of dual-socket Intel® Xeon® E5–2698v3 processors.« less

  1. Snowflake: A Lightweight Portable Stencil DSL

    DOE PAGES

    Zhang, Nathan; Driscoll, Michael; Markley, Charles; ...

    2017-05-01

    Stencil computations are not well optimized by general-purpose production compilers and the increased use of multicore, manycore, and accelerator-based systems makes the optimization problem even more challenging. In this paper we present Snowflake, a Domain Specific Language (DSL) for stencils that uses a 'micro-compiler' approach, i.e., small, focused, domain-specific code generators. The approach is similar to that used in image processing stencils, but Snowflake handles the much more complex stencils that arise in scientific computing, including complex boundary conditions, higher-order operators (larger stencils), higher dimensions, variable coefficients, non-unit-stride iteration spaces, and multiple input or output meshes. Snowflake is embedded inmore » the Python language, allowing it to interoperate with popular scientific tools like SciPy and iPython; it also takes advantage of built-in Python libraries for powerful dependence analysis as part of a just-in-time compiler. We demonstrate the power of the Snowflake language and the micro-compiler approach with a complex scientific benchmark, HPGMG, that exercises the generality of stencil support in Snowflake. By generating OpenMP comparable to, and OpenCL within a factor of 2x of hand-optimized HPGMG, Snowflake demonstrates that a micro-compiler can support diverse processor architectures and is performance-competitive whilst preserving a high-level Python implementation.« less

  2. HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks

    PubMed Central

    Azad, Ariful; Ouzounis, Christos A; Kyrpides, Nikos C; Buluç, Aydin

    2018-01-01

    Abstract Biological networks capture structural or functional properties of relevant entities such as molecules, proteins or genes. Characteristic examples are gene expression networks or protein–protein interaction networks, which hold information about functional affinities or structural similarities. Such networks have been expanding in size due to increasing scale and abundance of biological data. While various clustering algorithms have been proposed to find highly connected regions, Markov Clustering (MCL) has been one of the most successful approaches to cluster sequence similarity or expression networks. Despite its popularity, MCL’s scalability to cluster large datasets still remains a bottleneck due to high running times and memory demands. Here, we present High-performance MCL (HipMCL), a parallel implementation of the original MCL algorithm that can run on distributed-memory computers. We show that HipMCL can efficiently utilize 2000 compute nodes and cluster a network of ∼70 million nodes with ∼68 billion edges in ∼2.4 h. By exploiting distributed-memory environments, HipMCL clusters large-scale networks several orders of magnitude faster than MCL and enables clustering of even bigger networks. HipMCL is based on MPI and OpenMP and is freely available under a modified BSD license. PMID:29315405

  3. Dynamic Multiple Work Stealing Strategy for Flexible Load Balancing

    NASA Astrophysics Data System (ADS)

    Adnan; Sato, Mitsuhisa

    Lazy-task creation is an efficient method of overcoming the overhead of the grain-size problem in parallel computing. Work stealing is an effective load balancing strategy for parallel computing. In this paper, we present dynamic work stealing strategies in a lazy-task creation technique for efficient fine-grain task scheduling. The basic idea is to control load balancing granularity depending on the number of task parents in a stack. The dynamic-length strategy of work stealing uses run-time information, which is information on the load of the victim, to determine the number of tasks that a thief is allowed to steal. We compare it with the bottommost first work stealing strategy used in StackThread/MP, and the fixed-length strategy of work stealing, where a thief requests to steal a fixed number of tasks, as well as other multithreaded frameworks such as Cilk and OpenMP task implementations. The experiments show that the dynamic-length strategy of work stealing performs well in irregular workloads such as in UTS benchmarks, as well as in regular workloads such as Fibonacci, Strassen's matrix multiplication, FFT, and Sparse-LU factorization. The dynamic-length strategy works better than the fixed-length strategy because it is more flexible than the latter; this strategy can avoid load imbalance due to overstealing.

  4. HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks

    DOE PAGES

    Azad, Ariful; Pavlopoulos, Georgios A.; Ouzounis, Christos A.; ...

    2018-01-05

    Biological networks capture structural or functional properties of relevant entities such as molecules, proteins or genes. Characteristic examples are gene expression networks or protein–protein interaction networks, which hold information about functional affinities or structural similarities. Such networks have been expanding in size due to increasing scale and abundance of biological data. While various clustering algorithms have been proposed to find highly connected regions, Markov Clustering (MCL) has been one of the most successful approaches to cluster sequence similarity or expression networks. Despite its popularity, MCL’s scalability to cluster large datasets still remains a bottleneck due to high running times andmore » memory demands. In this paper, we present High-performance MCL (HipMCL), a parallel implementation of the original MCL algorithm that can run on distributed-memory computers. We show that HipMCL can efficiently utilize 2000 compute nodes and cluster a network of ~70 million nodes with ~68 billion edges in ~2.4 h. By exploiting distributed-memory environments, HipMCL clusters large-scale networks several orders of magnitude faster than MCL and enables clustering of even bigger networks. Finally, HipMCL is based on MPI and OpenMP and is freely available under a modified BSD license.« less

  5. HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Azad, Ariful; Pavlopoulos, Georgios A.; Ouzounis, Christos A.

    Biological networks capture structural or functional properties of relevant entities such as molecules, proteins or genes. Characteristic examples are gene expression networks or protein–protein interaction networks, which hold information about functional affinities or structural similarities. Such networks have been expanding in size due to increasing scale and abundance of biological data. While various clustering algorithms have been proposed to find highly connected regions, Markov Clustering (MCL) has been one of the most successful approaches to cluster sequence similarity or expression networks. Despite its popularity, MCL’s scalability to cluster large datasets still remains a bottleneck due to high running times andmore » memory demands. In this paper, we present High-performance MCL (HipMCL), a parallel implementation of the original MCL algorithm that can run on distributed-memory computers. We show that HipMCL can efficiently utilize 2000 compute nodes and cluster a network of ~70 million nodes with ~68 billion edges in ~2.4 h. By exploiting distributed-memory environments, HipMCL clusters large-scale networks several orders of magnitude faster than MCL and enables clustering of even bigger networks. Finally, HipMCL is based on MPI and OpenMP and is freely available under a modified BSD license.« less

  6. Hybrid Parallel Contour Trees, Version 1.0

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Sewell, Christopher; Fasel, Patricia; Carr, Hamish

    A common operation in scientific visualization is to compute and render a contour of a data set. Given a function of the form f : R^d -> R, a level set is defined as an inverse image f^-1(h) for an isovalue h, and a contour is a single connected component of a level set. The Reeb graph can then be defined to be the result of contracting each contour to a single point, and is well defined for Euclidean spaces or for general manifolds. For simple domains, the graph is guaranteed to be a tree, and is called the contourmore » tree. Analysis can then be performed on the contour tree in order to identify isovalues of particular interest, based on various metrics, and render the corresponding contours, without having to know such isovalues a priori. This code is intended to be the first data-parallel algorithm for computing contour trees. Our implementation will use the portable data-parallel primitives provided by Nvidia’s Thrust library, allowing us to compile our same code for both GPUs and multi-core CPUs. Native OpenMP and purely serial versions of the code will likely also be included. It will also be extended to provide a hybrid data-parallel / distributed algorithm, allowing scaling beyond a single GPU or CPU.« less

  7. QR-decomposition based SENSE reconstruction using parallel architecture.

    PubMed

    Ullah, Irfan; Nisar, Habab; Raza, Haseeb; Qasim, Malik; Inam, Omair; Omer, Hammad

    2018-04-01

    Magnetic Resonance Imaging (MRI) is a powerful medical imaging technique that provides essential clinical information about the human body. One major limitation of MRI is its long scan time. Implementation of advance MRI algorithms on a parallel architecture (to exploit inherent parallelism) has a great potential to reduce the scan time. Sensitivity Encoding (SENSE) is a Parallel Magnetic Resonance Imaging (pMRI) algorithm that utilizes receiver coil sensitivities to reconstruct MR images from the acquired under-sampled k-space data. At the heart of SENSE lies inversion of a rectangular encoding matrix. This work presents a novel implementation of GPU based SENSE algorithm, which employs QR decomposition for the inversion of the rectangular encoding matrix. For a fair comparison, the performance of the proposed GPU based SENSE reconstruction is evaluated against single and multicore CPU using openMP. Several experiments against various acceleration factors (AFs) are performed using multichannel (8, 12 and 30) phantom and in-vivo human head and cardiac datasets. Experimental results show that GPU significantly reduces the computation time of SENSE reconstruction as compared to multi-core CPU (approximately 12x speedup) and single-core CPU (approximately 53x speedup) without any degradation in the quality of the reconstructed images. Copyright © 2018 Elsevier Ltd. All rights reserved.

  8. 3D Elastic Wavefield Tomography

    NASA Astrophysics Data System (ADS)

    Guasch, L.; Warner, M.; Stekl, I.; Umpleby, A.; Shah, N.

    2010-12-01

    Wavefield tomography, or waveform inversion, aims to extract the maximum information from seismic data by matching trace by trace the response of the solid earth to seismic waves using numerical modelling tools. Its first formulation dates from the early 80's, when Albert Tarantola developed a solid theoretical basis that is still used today with little change. Due to computational limitations, the application of the method to 3D problems has been unaffordable until a few years ago, and then only under the acoustic approximation. Although acoustic wavefield tomography is widely used, a complete solution of the seismic inversion problem requires that we account properly for the physics of wave propagation, and so must include elastic effects. We have developed a 3D tomographic wavefield inversion code that incorporates the full elastic wave equation. The bottle neck of the different implementations is the forward modelling algorithm that generates the synthetic data to be compared with the field seismograms as well as the backpropagation of the residuals needed to form the direction update of the model parameters. Furthermore, one or two extra modelling runs are needed in order to calculate the step-length. Our approach uses a FD scheme explicit time-stepping by finite differences that are 4th order in space and 2nd order in time, which is a 3D version of the one developed by Jean Virieux in 1986. We chose the time domain because an explicit time scheme is much less demanding in terms of memory than its frequency domain analogue, although the discussion of wich domain is more efficient still remains open. We calculate the parameter gradients for Vp and Vs by correlating the normal and shear stress wavefields respectively. A straightforward application would lead to the storage of the wavefield at all grid points at each time-step. We tackled this problem using two different approaches. The first one makes better use of resources for small models of dimension equal or less than 300x300x300 nodes, and it under-samples the wavefield reducing the number of stored time-steps by an order of magnitude. For bigger models the wavefield is stored only at the boundaries of the model and then re-injected while the residuals are backpropagated allowing to compute the correlation 'on the fly'. In terms of computational resource, the elastic code is an order of magnitude more demanding than the equivalent acoustic code. We have combined shared memory with distributed memory parallelisation using OpenMP and MPI respectively. Thus, we take advantage of the increasingly common multi-core architecture processors. We have successfully applied our inversion algorithm to different realistic complex 3D models. The models had non-linear relations between pressure and shear wave velocities. The shorter wavelengths of the shear waves improve the resolution of the images obtained with respect to a purely acoustic approach.

  9. PhysiCell: An open source physics-based cell simulator for 3-D multicellular systems.

    PubMed

    Ghaffarizadeh, Ahmadreza; Heiland, Randy; Friedman, Samuel H; Mumenthaler, Shannon M; Macklin, Paul

    2018-02-01

    Many multicellular systems problems can only be understood by studying how cells move, grow, divide, interact, and die. Tissue-scale dynamics emerge from systems of many interacting cells as they respond to and influence their microenvironment. The ideal "virtual laboratory" for such multicellular systems simulates both the biochemical microenvironment (the "stage") and many mechanically and biochemically interacting cells (the "players" upon the stage). PhysiCell-physics-based multicellular simulator-is an open source agent-based simulator that provides both the stage and the players for studying many interacting cells in dynamic tissue microenvironments. It builds upon a multi-substrate biotransport solver to link cell phenotype to multiple diffusing substrates and signaling factors. It includes biologically-driven sub-models for cell cycling, apoptosis, necrosis, solid and fluid volume changes, mechanics, and motility "out of the box." The C++ code has minimal dependencies, making it simple to maintain and deploy across platforms. PhysiCell has been parallelized with OpenMP, and its performance scales linearly with the number of cells. Simulations up to 105-106 cells are feasible on quad-core desktop workstations; larger simulations are attainable on single HPC compute nodes. We demonstrate PhysiCell by simulating the impact of necrotic core biomechanics, 3-D geometry, and stochasticity on the dynamics of hanging drop tumor spheroids and ductal carcinoma in situ (DCIS) of the breast. We demonstrate stochastic motility, chemical and contact-based interaction of multiple cell types, and the extensibility of PhysiCell with examples in synthetic multicellular systems (a "cellular cargo delivery" system, with application to anti-cancer treatments), cancer heterogeneity, and cancer immunology. PhysiCell is a powerful multicellular systems simulator that will be continually improved with new capabilities and performance improvements. It also represents a significant independent code base for replicating results from other simulation platforms. The PhysiCell source code, examples, documentation, and support are available under the BSD license at http://PhysiCell.MathCancer.org and http://PhysiCell.sf.net.

  10. FY17 CSSE L2 Milestone Report: Analyzing Power Usage Characteristics of Workloads Running on Trinity.

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Pedretti, Kevin

    This report summarizes the work performed as part of a FY17 CSSE L2 milestone to in- vestigate the power usage behavior of ASC workloads running on the ATS-1 Trinity plat- form. Techniques were developed to instrument application code regions of interest using the Power API together with the Kokkos profiling interface and Caliper annotation library. Experiments were performed to understand the power usage behavior of mini-applications and the SNL/ATDM SPARC application running on ATS-1 Trinity Haswell and Knights Landing compute nodes. A taxonomy of power measurement approaches was identified and presented, providing a guide for application developers to follow. Controlledmore » scaling study experiments were performed on up to 2048 nodes of Trinity along with smaller scale ex- periments on Trinity testbed systems. Additionally, power and energy system monitoring information from Trinity was collected and archived for post analysis of "in-the-wild" work- loads. Results were analyzed to assess the sensitivity of the workloads to ATS-1 compute node type (Haswell vs. Knights Landing), CPU frequency control, node-level power capping control, OpenMP configuration, Knights Landing on-package memory configuration, and algorithm/solver configuration. Overall, this milestone lays groundwork for addressing the long-term goal of determining how to best use and operate future ASC platforms to achieve the greatest benefit subject to a constrained power budget.« less

  11. A GPU-accelerated semi-implicit fractional step method for numerical solutions of incompressible Navier-Stokes equations

    NASA Astrophysics Data System (ADS)

    Ha, Sanghyun; Park, Junshin; You, Donghyun

    2017-11-01

    Utility of the computational power of modern Graphics Processing Units (GPUs) is elaborated for solutions of incompressible Navier-Stokes equations which are integrated using a semi-implicit fractional-step method. Due to its serial and bandwidth-bound nature, the present choice of numerical methods is considered to be a good candidate for evaluating the potential of GPUs for solving Navier-Stokes equations using non-explicit time integration. An efficient algorithm is presented for GPU acceleration of the Alternating Direction Implicit (ADI) and the Fourier-transform-based direct solution method used in the semi-implicit fractional-step method. OpenMP is employed for concurrent collection of turbulence statistics on a CPU while Navier-Stokes equations are computed on a GPU. Extension to multiple NVIDIA GPUs is implemented using NVLink supported by the Pascal architecture. Performance of the present method is experimented on multiple Tesla P100 GPUs compared with a single-core Xeon E5-2650 v4 CPU in simulations of boundary-layer flow over a flat plate. Supported by the National Research Foundation of Korea (NRF) Grant funded by the Korea government (Ministry of Science, ICT and Future Planning NRF-2016R1E1A2A01939553, NRF-2014R1A2A1A11049599, and Ministry of Trade, Industry and Energy 201611101000230).

  12. Use of program logic models in the Southern Rural Access Program evaluation.

    PubMed

    Pathman, Donald; Thaker, Samruddhi; Ricketts, Thomas C; Albright, Jennifer B

    2003-01-01

    The Southern Rural Access Program (SRAP) evaluation team used program logic models to clarify grantees' activities, objectives, and timelines. This information was used to benchmark data from grantees' progress reports to assess the program's successes. This article presents a brief background on the use of program logic models--essentially charts or diagrams specifying a program's planned activities, objectives, and goals--for evaluating and managing a program. It discusses the structure of the logic models chosen for the SRAP and how the model concept was introduced to the grantees to promote acceptance and use of the models. The article describes how the models helped clarify the program's objectives and helped lead agencies plan and manage the many program initiatives and subcontractors in their states. Models also provided a framework for grantees to report their progress to the National Program Office and evaluators and promoted the evaluators' visibility and acceptance by the grantees. Program logics, however, increased grantees' reporting requirements and demanded substantial time of the evaluators. Program logic models, on balance, proved their merit in the SRAP through their contributions to its management and evaluation and by providing a better understanding of the program's initiatives, successes, and potential impact.

  13. Roofline Analysis in the Intel® Advisor to Deliver Optimized Performance for applications on Intel® Xeon Phi™ Processor

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Koskela, Tuomas S.; Lobet, Mathieu; Deslippe, Jack

    In this session we show, in two case studies, how the roofline feature of Intel Advisor has been utilized to optimize the performance of kernels of the XGC1 and PICSAR codes in preparation for Intel Knights Landing architecture. The impact of the implemented optimizations and the benefits of using the automatic roofline feature of Intel Advisor to study performance of large applications will be presented. This demonstrates an effective optimization strategy that has enabled these science applications to achieve up to 4.6 times speed-up and prepare for future exascale architectures. # Goal/Relevance of Session The roofline model [1,2] is amore » powerful tool for analyzing the performance of applications with respect to the theoretical peak achievable on a given computer architecture. It allows one to graphically represent the performance of an application in terms of operational intensity, i.e. the ratio of flops performed and bytes moved from memory in order to guide optimization efforts. Given the scale and complexity of modern science applications, it can often be a tedious task for the user to perform the analysis on the level of functions or loops to identify where performance gains can be made. With new Intel tools, it is now possible to automate this task, as well as base the estimates of peak performance on measurements rather than vendor specifications. The goal of this session is to demonstrate how the roofline feature of Intel Advisor can be used to balance memory vs. computation related optimization efforts and effectively identify performance bottlenecks. A series of typical optimization techniques: cache blocking, structure refactoring, data alignment, and vectorization illustrated by the kernel cases will be addressed. # Description of the codes ## XGC1 The XGC1 code [3] is a magnetic fusion Particle-In-Cell code that uses an unstructured mesh for its Poisson solver that allows it to accurately resolve the edge plasma of a magnetic fusion device. After recent optimizations to its collision kernel [4], most of the computing time is spent in the electron push (pushe) kernel, where these optimization efforts have been focused. The kernel code scaled well with MPI+OpenMP but had almost no automatic compiler vectorization, in part due to indirect memory addresses and in part due to low trip counts of low-level loops that would be candidates for vectorization. Particle blocking and sorting have been implemented to increase trip counts of low-level loops and improve memory locality, and OpenMP directives have been added to vectorize compute-intensive loops that were identified by Advisor. The optimizations have improved the performance of the pushe kernel 2x on Haswell processors and 1.7x on KNL. The KNL node-for-node performance has been brought to within 30% of a NERSC Cori phase I Haswell node and we expect to bridge this gap by reducing the memory footprint of compute intensive routines to improve cache reuse. ## PICSAR is a Fortran/Python high-performance Particle-In-Cell library targeting at MIC architectures first designed to be coupled with the PIC code WARP for the simulation of laser-matter interaction and particle accelerators. PICSAR also contains a FORTRAN stand-alone kernel for performance studies and benchmarks. A MPI domain decomposition is used between NUMA domains and a tile decomposition (cache-blocking) handled by OpenMP has been added for shared-memory parallelism and better cache management. The so-called current deposition and field gathering steps that compose the PIC time loop constitute major hotspots that have been rewritten to enable more efficient vectorization. Particle communications between tiles and MPI domain has been merged and parallelized. All considered, these improvements provide speedups of 3.1 for order 1 and 4.6 for order 3 interpolation shape factors on KNL configured in SNC4 quadrant flat mode. Performance is similar between a node of cori phase 1 and KNL at order 1 and better on KNL by a factor 1.6 at order 3 with the considered test case (homogeneous thermal plasma).« less

  14. PhysiCell: An open source physics-based cell simulator for 3-D multicellular systems

    PubMed Central

    Ghaffarizadeh, Ahmadreza; Mumenthaler, Shannon M.

    2018-01-01

    Many multicellular systems problems can only be understood by studying how cells move, grow, divide, interact, and die. Tissue-scale dynamics emerge from systems of many interacting cells as they respond to and influence their microenvironment. The ideal “virtual laboratory” for such multicellular systems simulates both the biochemical microenvironment (the “stage”) and many mechanically and biochemically interacting cells (the “players” upon the stage). PhysiCell—physics-based multicellular simulator—is an open source agent-based simulator that provides both the stage and the players for studying many interacting cells in dynamic tissue microenvironments. It builds upon a multi-substrate biotransport solver to link cell phenotype to multiple diffusing substrates and signaling factors. It includes biologically-driven sub-models for cell cycling, apoptosis, necrosis, solid and fluid volume changes, mechanics, and motility “out of the box.” The C++ code has minimal dependencies, making it simple to maintain and deploy across platforms. PhysiCell has been parallelized with OpenMP, and its performance scales linearly with the number of cells. Simulations up to 105-106 cells are feasible on quad-core desktop workstations; larger simulations are attainable on single HPC compute nodes. We demonstrate PhysiCell by simulating the impact of necrotic core biomechanics, 3-D geometry, and stochasticity on the dynamics of hanging drop tumor spheroids and ductal carcinoma in situ (DCIS) of the breast. We demonstrate stochastic motility, chemical and contact-based interaction of multiple cell types, and the extensibility of PhysiCell with examples in synthetic multicellular systems (a “cellular cargo delivery” system, with application to anti-cancer treatments), cancer heterogeneity, and cancer immunology. PhysiCell is a powerful multicellular systems simulator that will be continually improved with new capabilities and performance improvements. It also represents a significant independent code base for replicating results from other simulation platforms. The PhysiCell source code, examples, documentation, and support are available under the BSD license at http://PhysiCell.MathCancer.org and http://PhysiCell.sf.net. PMID:29474446

  15. Fast and Accurate Support Vector Machines on Large Scale Systems

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Vishnu, Abhinav; Narasimhan, Jayenthi; Holder, Larry

    Support Vector Machines (SVM) is a supervised Machine Learning and Data Mining (MLDM) algorithm, which has become ubiquitous largely due to its high accuracy and obliviousness to dimensionality. The objective of SVM is to find an optimal boundary --- also known as hyperplane --- which separates the samples (examples in a dataset) of different classes by a maximum margin. Usually, very few samples contribute to the definition of the boundary. However, existing parallel algorithms use the entire dataset for finding the boundary, which is sub-optimal for performance reasons. In this paper, we propose a novel distributed memory algorithm to eliminatemore » the samples which do not contribute to the boundary definition in SVM. We propose several heuristics, which range from early (aggressive) to late (conservative) elimination of the samples, such that the overall time for generating the boundary is reduced considerably. In a few cases, a sample may be eliminated (shrunk) pre-emptively --- potentially resulting in an incorrect boundary. We propose a scalable approach to synchronize the necessary data structures such that the proposed algorithm maintains its accuracy. We consider the necessary trade-offs of single/multiple synchronization using in-depth time-space complexity analysis. We implement the proposed algorithm using MPI and compare it with libsvm--- de facto sequential SVM software --- which we enhance with OpenMP for multi-core/many-core parallelism. Our proposed approach shows excellent efficiency using up to 4096 processes on several large datasets such as UCI HIGGS Boson dataset and Offending URL dataset.« less

  16. GPU-accelerated algorithms for many-particle continuous-time quantum walks

    NASA Astrophysics Data System (ADS)

    Piccinini, Enrico; Benedetti, Claudia; Siloi, Ilaria; Paris, Matteo G. A.; Bordone, Paolo

    2017-06-01

    Many-particle continuous-time quantum walks (CTQWs) represent a resource for several tasks in quantum technology, including quantum search algorithms and universal quantum computation. In order to design and implement CTQWs in a realistic scenario, one needs effective simulation tools for Hamiltonians that take into account static noise and fluctuations in the lattice, i.e. Hamiltonians containing stochastic terms. To this aim, we suggest a parallel algorithm based on the Taylor series expansion of the evolution operator, and compare its performances with those of algorithms based on the exact diagonalization of the Hamiltonian or a 4th order Runge-Kutta integration. We prove that both Taylor-series expansion and Runge-Kutta algorithms are reliable and have a low computational cost, the Taylor-series expansion showing the additional advantage of a memory allocation not depending on the precision of calculation. Both algorithms are also highly parallelizable within the SIMT paradigm, and are thus suitable for GPGPU computing. In turn, we have benchmarked 4 NVIDIA GPUs and 3 quad-core Intel CPUs for a 2-particle system over lattices of increasing dimension, showing that the speedup provided by GPU computing, with respect to the OPENMP parallelization, lies in the range between 8x and (more than) 20x, depending on the frequency of post-processing. GPU-accelerated codes thus allow one to overcome concerns about the execution time, and make it possible simulations with many interacting particles on large lattices, with the only limit of the memory available on the device.

  17. Logic models as a tool for sexual violence prevention program development.

    PubMed

    Hawkins, Stephanie R; Clinton-Sherrod, A Monique; Irvin, Neil; Hart, Laurie; Russell, Sarah Jane

    2009-01-01

    Sexual violence is a growing public health problem, and there is an urgent need to develop sexual violence prevention programs. Logic models have emerged as a vital tool in program development. The Centers for Disease Control and Prevention funded an empowerment evaluation designed to work with programs focused on the prevention of first-time male perpetration of sexual violence, and it included as one of its goals, the development of program logic models. Two case studies are presented that describe how significant positive changes can be made to programs as a result of their developing logic models that accurately describe desired outcomes. The first case study describes how the logic model development process made an organization aware of the importance of a program's environmental context for program success; the second case study demonstrates how developing a program logic model can elucidate gaps in organizational programming and suggest ways to close those gaps.

  18. Outdoor Program Models: Placing Cooperative Adventure and Adventure Education Models on the Continuum.

    ERIC Educational Resources Information Center

    Guthrie, Steven P.

    In two articles on outdoor programming models, Watters distinguished four models on a continuum ranging from the common adventure model, with minimal organizational structure and leadership control, to the guide service model, in which leaders are autocratic and trips are highly structured. Club programs and instructional programs were in between,…

  19. LDA-Based Unified Topic Modeling for Similar TV User Grouping and TV Program Recommendation.

    PubMed

    Pyo, Shinjee; Kim, Eunhui; Kim, Munchurl

    2015-08-01

    Social TV is a social media service via TV and social networks through which TV users exchange their experiences about TV programs that they are viewing. For social TV service, two technical aspects are envisioned: grouping of similar TV users to create social TV communities and recommending TV programs based on group and personal interests for personalizing TV. In this paper, we propose a unified topic model based on grouping of similar TV users and recommending TV programs as a social TV service. The proposed unified topic model employs two latent Dirichlet allocation (LDA) models. One is a topic model of TV users, and the other is a topic model of the description words for viewed TV programs. The two LDA models are then integrated via a topic proportion parameter for TV programs, which enforces the grouping of similar TV users and associated description words for watched TV programs at the same time in a unified topic modeling framework. The unified model identifies the semantic relation between TV user groups and TV program description word groups so that more meaningful TV program recommendations can be made. The unified topic model also overcomes an item ramp-up problem such that new TV programs can be reliably recommended to TV users. Furthermore, from the topic model of TV users, TV users with similar tastes can be grouped as topics, which can then be recommended as social TV communities. To verify our proposed method of unified topic-modeling-based TV user grouping and TV program recommendation for social TV services, in our experiments, we used real TV viewing history data and electronic program guide data from a seven-month period collected by a TV poll agency. The experimental results show that the proposed unified topic model yields an average 81.4% precision for 50 topics in TV program recommendation and its performance is an average of 6.5% higher than that of the topic model of TV users only. For TV user prediction with new TV programs, the average prediction precision was 79.6%. Also, we showed the superiority of our proposed model in terms of both topic modeling performance and recommendation performance compared to two related topic models such as polylingual topic model and bilingual topic model.

  20. Generic algorithms for high performance scalable geocomputing

    NASA Astrophysics Data System (ADS)

    de Jong, Kor; Schmitz, Oliver; Karssenberg, Derek

    2016-04-01

    During the last decade, the characteristics of computing hardware have changed a lot. For example, instead of a single general purpose CPU core, personal computers nowadays contain multiple cores per CPU and often general purpose accelerators, like GPUs. Additionally, compute nodes are often grouped together to form clusters or a supercomputer, providing enormous amounts of compute power. For existing earth simulation models to be able to use modern hardware platforms, their compute intensive parts must be rewritten. This can be a major undertaking and may involve many technical challenges. Compute tasks must be distributed over CPU cores, offloaded to hardware accelerators, or distributed to different compute nodes. And ideally, all of this should be done in such a way that the compute task scales well with the hardware resources. This presents two challenges: 1) how to make good use of all the compute resources and 2) how to make these compute resources available for developers of simulation models, who may not (want to) have the required technical background for distributing compute tasks. The first challenge requires the use of specialized technology (e.g.: threads, OpenMP, MPI, OpenCL, CUDA). The second challenge requires the abstraction of the logic handling the distribution of compute tasks from the model-specific logic, hiding the technical details from the model developer. To assist the model developer, we are developing a C++ software library (called Fern) containing algorithms that can use all CPU cores available in a single compute node (distributing tasks over multiple compute nodes will be done at a later stage). The algorithms are grid-based (finite difference) and include local and spatial operations such as convolution filters. The algorithms handle distribution of the compute tasks to CPU cores internally. In the resulting model the low-level details of how this is done is separated from the model-specific logic representing the modeled system. This contrasts with practices in which code for distributing of compute tasks is mixed with model-specific code, and results in a better maintainable model. For flexibility and efficiency, the algorithms are configurable at compile-time with the respect to the following aspects: data type, value type, no-data handling, input value domain handling, and output value range handling. This makes the algorithms usable in very different contexts, without the need for making intrusive changes to existing models when using them. Applications that benefit from using the Fern library include the construction of forward simulation models in (global) hydrology (e.g. PCR-GLOBWB (Van Beek et al. 2011)), ecology, geomorphology, or land use change (e.g. PLUC (Verstegen et al. 2014)) and manipulation of hyper-resolution land surface data such as digital elevation models and remote sensing data. Using the Fern library, we have also created an add-on to the PCRaster Python Framework (Karssenberg et al. 2010) allowing its users to speed up their spatio-temporal models, sometimes by changing just a single line of Python code in their model. In our presentation we will give an overview of the design of the algorithms, providing examples of different contexts where they can be used to replace existing sequential algorithms, including the PCRaster environmental modeling software (www.pcraster.eu). We will show how the algorithms can be configured to behave differently when necessary. References Karssenberg, D., Schmitz, O., Salamon, P., De Jong, K. and Bierkens, M.F.P., 2010, A software framework for construction of process-based stochastic spatio-temporal models and data assimilation. Environmental Modelling & Software, 25, pp. 489-502, Link. Best Paper Award 2010: Software and Decision Support. Van Beek, L. P. H., Y. Wada, and M. F. P. Bierkens. 2011. Global monthly water stress: 1. Water balance and water availability. Water Resources Research. 47. Verstegen, J. A., D. Karssenberg, F. van der Hilst, and A. P. C. Faaij. 2014. Identifying a land use change cellular automaton by Bayesian data assimilation. Environmental Modelling & Software 53:121-136.

  1. Addressing Dynamic Issues of Program Model Checking

    NASA Technical Reports Server (NTRS)

    Lerda, Flavio; Visser, Willem

    2001-01-01

    Model checking real programs has recently become an active research area. Programs however exhibit two characteristics that make model checking difficult: the complexity of their state and the dynamic nature of many programs. Here we address both these issues within the context of the Java PathFinder (JPF) model checker. Firstly, we will show how the state of a Java program can be encoded efficiently and how this encoding can be exploited to improve model checking. Next we show how to use symmetry reductions to alleviate some of the problems introduced by the dynamic nature of Java programs. Lastly, we show how distributed model checking of a dynamic program can be achieved, and furthermore, how dynamic partitions of the state space can improve model checking. We support all our findings with results from applying these techniques within the JPF model checker.

  2. A simulation model for wind energy storage systems. Volume 3: Program descriptions

    NASA Technical Reports Server (NTRS)

    Warren, A. W.; Edsinger, R. W.; Burroughs, J. D.

    1977-01-01

    Program descriptions, flow charts, and program listings for the SIMWEST model generation program, the simulation program, the file maintenance program, and the printer plotter program are given. For Vol 2, see .

  3. Model-based control strategies for systems with constraints of the program type

    NASA Astrophysics Data System (ADS)

    Jarzębowska, Elżbieta

    2006-08-01

    The paper presents a model-based tracking control strategy for constrained mechanical systems. Constraints we consider can be material and non-material ones referred to as program constraints. The program constraint equations represent tasks put upon system motions and they can be differential equations of orders higher than one or two, and be non-integrable. The tracking control strategy relies upon two dynamic models: a reference model, which is a dynamic model of a system with arbitrary order differential constraints and a dynamic control model. The reference model serves as a motion planner, which generates inputs to the dynamic control model. It is based upon a generalized program motion equations (GPME) method. The method enables to combine material and program constraints and merge them both into the motion equations. Lagrange's equations with multipliers are the peculiar case of the GPME, since they can be applied to systems with constraints of first orders. Our tracking strategy referred to as a model reference program motion tracking control strategy enables tracking of any program motion predefined by the program constraints. It extends the "trajectory tracking" to the "program motion tracking". We also demonstrate that our tracking strategy can be extended to a hybrid program motion/force tracking.

  4. [Documenting a rehabilitation program using a logic model: an advantage to the assessment process].

    PubMed

    Poncet, Frédérique; Swaine, Bonnie; Pradat-Diehl, Pascale

    2017-03-06

    The cognitive and behavioral disorders after brain injury can result in severe limitations of activities and restrictions of participation. An interdisciplinary rehabilitation program was developed in physical medicine and rehabilitation at the Pitié-Salpêtriere Hospital, Paris, France. Clinicians believe this program decreases activity limitations and improves participation in patients. However, the program’s effectiveness had never been assessed. To do this, we had to define/describe this program. However rehabilitation programs are holistic and thus complex making them difficult to describe. Therefore, to facilitate the evaluation of complex programs, including those for rehabilitation, we illustrate the use of a theoretical logic model, as proposed by Champagne, through the process of documentation of a specific complex and interdisciplinary rehabilitation program. Through participatory/collaborative research, the rehabilitation program was analyzed using three “submodels” of the logic model of intervention: causal model, intervention model and program theory model. This should facilitate the evaluation of programs, including those for rehabilitation.

  5. Unified Program Design: Organizing Existing Programming Models, Delivery Options, and Curriculum

    ERIC Educational Resources Information Center

    Rubenstein, Lisa DaVia; Ridgley, Lisa M.

    2017-01-01

    A persistent problem in the field of gifted education has been the lack of categorization and delineation of gifted programming options. To address this issue, we propose Unified Program Design as a structural framework for gifted program models. This framework defines gifted programs as the combination of delivery methods and curriculum models.…

  6. Structure and application of an interface program between a geographic-information system and a ground-water flow model

    USGS Publications Warehouse

    Van Metre, P.C.

    1990-01-01

    A computer-program interface between a geographic-information system and a groundwater flow model links two unrelated software systems for use in developing the flow models. The interface program allows the modeler to compile and manage geographic components of a groundwater model within the geographic information system. A significant savings of time and effort is realized in developing, calibrating, and displaying the groundwater flow model. Four major guidelines were followed in developing the interface program: (1) no changes to the groundwater flow model code were to be made; (2) a data structure was to be designed within the geographic information system that follows the same basic data structure as the groundwater flow model; (3) the interface program was to be flexible enough to support all basic data options available within the model; and (4) the interface program was to be as efficient as possible in terms of computer time used and online-storage space needed. Because some programs in the interface are written in control-program language, the interface will run only on a computer with the PRIMOS operating system. (USGS)

  7. Program management model study

    NASA Technical Reports Server (NTRS)

    Connelly, J. J.; Russell, J. E.; Seline, J. R.; Sumner, N. R., Jr.

    1972-01-01

    Two models, a system performance model and a program assessment model, have been developed to assist NASA management in the evaluation of development alternatives for the Earth Observations Program. Two computer models were developed and demonstrated on the Goddard Space Flight Center Computer Facility. Procedures have been outlined to guide the user of the models through specific evaluation processes, and the preparation of inputs describing earth observation needs and earth observation technology. These models are intended to assist NASA in increasing the effectiveness of the overall Earth Observation Program by providing a broader view of system and program development alternatives.

  8. A Linear Programming Model to Optimize Various Objective Functions of a Foundation Type State Support Program.

    ERIC Educational Resources Information Center

    Matzke, Orville R.

    The purpose of this study was to formulate a linear programming model to simulate a foundation type support program and to apply this model to a state support program for the public elementary and secondary school districts in the State of Iowa. The model was successful in producing optimal solutions to five objective functions proposed for…

  9. Program listing for the REEDM (Rocket Exhaust Effluent Diffusion Model) computer program

    NASA Technical Reports Server (NTRS)

    Bjorklund, J. R.; Dumbauld, R. K.; Cheney, C. S.; Geary, H. V.

    1982-01-01

    The program listing for the REEDM Computer Program is provided. A mathematical description of the atmospheric dispersion models, cloud-rise models, and other formulas used in the REEDM model; vehicle and source parameters, other pertinent physical properties of the rocket exhaust cloud and meteorological layering techniques; user's instructions for the REEDM computer program; and worked example problems are contained in NASA CR-3646.

  10. A Framework of Operating Models for Interdisciplinary Research Programs in Clinical Service Organizations

    ERIC Educational Resources Information Center

    King, Gillian; Currie, Melissa; Smith, Linda; Servais, Michelle; McDougall, Janette

    2008-01-01

    A framework of operating models for interdisciplinary research programs in clinical service organizations is presented, consisting of a "clinician-researcher" skill development model, a program evaluation model, a researcher-led knowledge generation model, and a knowledge conduit model. Together, these models comprise a tailored, collaborative…

  11. Using the OIG model compliance programs to fight fraud.

    PubMed

    Lovitky, Jeffrey A; Ahern, Jack

    2002-03-01

    Many healthcare organizations already have implemented compliance programs for their facilities. However, in light of recent fines and continued scrutiny of such programs by the HHS Office of Inspector General (OIG), healthcare organizations should consider reviewing their current programs against the OIG's relevant model compliance program. Although healthcare organizations are not required to adhere strictly to OIG's model programs, they would benefit from ensuring that their programs meet all the OIG's requirements. The common, minimum elements suggested by the OIG model programs include development and distribution of written compliance policies, the designation of a chief compliance officer to manage the program, the development of a corrective action and enforcement system, and the use of audits to monitor compliance. Using these models as guides, healthcare organizations should be better able to avoid the possibility of fraud and abuse within their organizations.

  12. Devil is in the details: Using logic models to investigate program process.

    PubMed

    Peyton, David J; Scicchitano, Michael

    2017-12-01

    Theory-based logic models are commonly developed as part of requirements for grant funding. As a tool to communicate complex social programs, theory based logic models are an effective visual communication. However, after initial development, theory based logic models are often abandoned and remain in their initial form despite changes in the program process. This paper examines the potential benefits of committing time and resources to revising the initial theory driven logic model and developing detailed logic models that describe key activities to accurately reflect the program and assist in effective program management. The authors use a funded special education teacher preparation program to exemplify the utility of drill down logic models. The paper concludes with lessons learned from the iterative revision process and suggests how the process can lead to more flexible and calibrated program management. Copyright © 2017 Elsevier Ltd. All rights reserved.

  13. Hybrid-view programming of nuclear fusion simulation code in the PGAS parallel programming language XcalableMP

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Tsugane, Keisuke; Boku, Taisuke; Murai, Hitoshi

    Recently, the Partitioned Global Address Space (PGAS) parallel programming model has emerged as a usable distributed memory programming model. XcalableMP (XMP) is a PGAS parallel programming language that extends base languages such as C and Fortran with directives in OpenMP-like style. XMP supports a global-view model that allows programmers to define global data and to map them to a set of processors, which execute the distributed global data as a single thread. In XMP, the concept of a coarray is also employed for local-view programming. In this study, we port Gyrokinetic Toroidal Code - Princeton (GTC-P), which is a three-dimensionalmore » gyrokinetic PIC code developed at Princeton University to study the microturbulence phenomenon in magnetically confined fusion plasmas, to XMP as an example of hybrid memory model coding with the global-view and local-view programming models. In local-view programming, the coarray notation is simple and intuitive compared with Message Passing Interface (MPI) programming while the performance is comparable to that of the MPI version. Thus, because the global-view programming model is suitable for expressing the data parallelism for a field of grid space data, we implement a hybrid-view version using a global-view programming model to compute the field and a local-view programming model to compute the movement of particles. Finally, the performance is degraded by 20% compared with the original MPI version, but the hybrid-view version facilitates more natural data expression for static grid space data (in the global-view model) and dynamic particle data (in the local-view model), and it also increases the readability of the code for higher productivity.« less

  14. Hybrid-view programming of nuclear fusion simulation code in the PGAS parallel programming language XcalableMP

    DOE PAGES

    Tsugane, Keisuke; Boku, Taisuke; Murai, Hitoshi; ...

    2016-06-01

    Recently, the Partitioned Global Address Space (PGAS) parallel programming model has emerged as a usable distributed memory programming model. XcalableMP (XMP) is a PGAS parallel programming language that extends base languages such as C and Fortran with directives in OpenMP-like style. XMP supports a global-view model that allows programmers to define global data and to map them to a set of processors, which execute the distributed global data as a single thread. In XMP, the concept of a coarray is also employed for local-view programming. In this study, we port Gyrokinetic Toroidal Code - Princeton (GTC-P), which is a three-dimensionalmore » gyrokinetic PIC code developed at Princeton University to study the microturbulence phenomenon in magnetically confined fusion plasmas, to XMP as an example of hybrid memory model coding with the global-view and local-view programming models. In local-view programming, the coarray notation is simple and intuitive compared with Message Passing Interface (MPI) programming while the performance is comparable to that of the MPI version. Thus, because the global-view programming model is suitable for expressing the data parallelism for a field of grid space data, we implement a hybrid-view version using a global-view programming model to compute the field and a local-view programming model to compute the movement of particles. Finally, the performance is degraded by 20% compared with the original MPI version, but the hybrid-view version facilitates more natural data expression for static grid space data (in the global-view model) and dynamic particle data (in the local-view model), and it also increases the readability of the code for higher productivity.« less

  15. From Petascale to Exascale: Eight Focus Areas of R&D Challenges for HPC Simulation Environments

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Springmeyer, R; Still, C; Schulz, M

    2011-03-17

    Programming models bridge the gap between the underlying hardware architecture and the supporting layers of software available to applications. Programming models are different from both programming languages and application programming interfaces (APIs). Specifically, a programming model is an abstraction of the underlying computer system that allows for the expression of both algorithms and data structures. In comparison, languages and APIs provide implementations of these abstractions and allow the algorithms and data structures to be put into practice - a programming model exists independently of the choice of both the programming language and the supporting APIs. Programming models are typically focusedmore » on achieving increased developer productivity, performance, and portability to other system designs. The rapidly changing nature of processor architectures and the complexity of designing an exascale platform provide significant challenges for these goals. Several other factors are likely to impact the design of future programming models. In particular, the representation and management of increasing levels of parallelism, concurrency and memory hierarchies, combined with the ability to maintain a progressive level of interoperability with today's applications are of significant concern. Overall the design of a programming model is inherently tied not only to the underlying hardware architecture, but also to the requirements of applications and libraries including data analysis, visualization, and uncertainty quantification. Furthermore, the successful implementation of a programming model is dependent on exposed features of the runtime software layers and features of the operating system. Successful use of a programming model also requires effective presentation to the software developer within the context of traditional and new software development tools. Consideration must also be given to the impact of programming models on both languages and the associated compiler infrastructure. Exascale programming models must reflect several, often competing, design goals. These design goals include desirable features such as abstraction and separation of concerns. However, some aspects are unique to large-scale computing. For example, interoperability and composability with existing implementations will prove critical. In particular, performance is the essential underlying goal for large-scale systems. A key evaluation metric for exascale models will be the extent to which they support these goals rather than merely enable them.« less

  16. Testing of a Program Evaluation Model: Final Report.

    ERIC Educational Resources Information Center

    Nagler, Phyllis J.; Marson, Arthur A.

    A program evaluation model developed by Moraine Park Technical Institute (MPTI) is described in this report. Following background material, the four main evaluation criteria employed in the model are identified as program quality, program relevance to community needs, program impact on MPTI, and the transition and growth of MPTI graduates in the…

  17. More-Realistic Digital Modeling of a Human Body

    NASA Technical Reports Server (NTRS)

    Rogge, Renee

    2010-01-01

    A MATLAB computer program has been written to enable improved (relative to an older program) modeling of a human body for purposes of designing space suits and other hardware with which an astronaut must interact. The older program implements a kinematic model based on traditional anthropometric measurements that do provide important volume and surface information. The present program generates a three-dimensional (3D) whole-body model from 3D body-scan data. The program utilizes thin-plate spline theory to reposition the model without need for additional scans.

  18. Advanced very high resolution radiometer

    NASA Technical Reports Server (NTRS)

    1976-01-01

    The advanced very high resolution radiometer development program is considered. The program covered the design, construction, and test of a breadboard model, engineering model, protoflight model, mechanical structural model, and a life test model. Special bench test and calibration equipment was also developed for use on the program.

  19. Testing the Structure of Hydrological Models using Genetic Programming

    NASA Astrophysics Data System (ADS)

    Selle, B.; Muttil, N.

    2009-04-01

    Genetic Programming is able to systematically explore many alternative model structures of different complexity from available input and response data. We hypothesised that genetic programming can be used to test the structure hydrological models and to identify dominant processes in hydrological systems. To test this, genetic programming was used to analyse a data set from a lysimeter experiment in southeastern Australia. The lysimeter experiment was conducted to quantify the deep percolation response under surface irrigated pasture to different soil types, water table depths and water ponding times during surface irrigation. Using genetic programming, a simple model of deep percolation was consistently evolved in multiple model runs. This simple and interpretable model confirmed the dominant process contributing to deep percolation represented in a conceptual model that was published earlier. Thus, this study shows that genetic programming can be used to evaluate the structure of hydrological models and to gain insight about the dominant processes in hydrological systems.

  20. An OpenACC-Based Unified Programming Model for Multi-accelerator Systems

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kim, Jungwon; Lee, Seyong; Vetter, Jeffrey S

    2015-01-01

    This paper proposes a novel SPMD programming model of OpenACC. Our model integrates the different granularities of parallelism from vector-level parallelism to node-level parallelism into a single, unified model based on OpenACC. It allows programmers to write programs for multiple accelerators using a uniform programming model whether they are in shared or distributed memory systems. We implement a prototype of our model and evaluate its performance with a GPU-based supercomputer using three benchmark applications.

  1. User's manual for LINEAR, a FORTRAN program to derive linear aircraft models

    NASA Technical Reports Server (NTRS)

    Duke, Eugene L.; Patterson, Brian P.; Antoniewicz, Robert F.

    1987-01-01

    This report documents a FORTRAN program that provides a powerful and flexible tool for the linearization of aircraft models. The program LINEAR numerically determines a linear system model using nonlinear equations of motion and a user-supplied nonlinear aerodynamic model. The system model determined by LINEAR consists of matrices for both state and observation equations. The program has been designed to allow easy selection and definition of the state, control, and observation variables to be used in a particular model.

  2. The Spiral-Interactive Program Evaluation Model.

    ERIC Educational Resources Information Center

    Khaleel, Ibrahim Adamu

    1988-01-01

    Describes the spiral interactive program evaluation model, which is designed to evaluate vocational-technical education programs in secondary schools in Nigeria. Program evaluation is defined; utility oriented and process oriented models for evaluation are described; and internal and external evaluative factors and variables that define each…

  3. Puerto Rico water resources planning model program description

    USGS Publications Warehouse

    Moody, D.W.; Maddock, Thomas; Karlinger, M.R.; Lloyd, J.J.

    1973-01-01

    Because the use of the Mathematical Programming System -Extended (MPSX) to solve large linear and mixed integer programs requires the preparation of many input data cards, a matrix generator program to produce the MPSX input data from a much more limited set of data may expedite the use of the mixed integer programming optimization technique. The Model Definition and Control Program (MODCQP) is intended to assist a planner in preparing MPSX input data for the Puerto Rico Water Resources Planning Model. The model utilizes a mixed-integer mathematical program to identify a minimum present cost set of water resources projects (diversions, reservoirs, ground-water fields, desalinization plants, water treatment plants, and inter-basin transfers of water) which will meet a set of future water demands and to determine their sequence of construction. While MODCOP was specifically written to generate MPSX input data for the planning model described in this report, the program can be easily modified to reflect changes in the model's mathematical structure.

  4. Los Alamos Programming Models

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Bergen, Benjamin Karl

    This is the PDF of a powerpoint presentation from a teleconference on Los Alamos programming models. It starts by listing their assumptions for the programming models and then details a hierarchical programming model at the System Level and Node Level. Then it details how to map this to their internal nomenclature. Finally, a list is given of what they are currently doing in this regard.

  5. An efficient and portable SIMD algorithm for charge/current deposition in Particle-In-Cell codes

    DOE PAGES

    Vincenti, H.; Lobet, M.; Lehe, R.; ...

    2016-09-19

    In current computer architectures, data movement (from die to network) is by far the most energy consuming part of an algorithm (≈20pJ/word on-die to ≈10,000 pJ/word on the network). To increase memory locality at the hardware level and reduce energy consumption related to data movement, future exascale computers tend to use many-core processors on each compute nodes that will have a reduced clock speed to allow for efficient cooling. To compensate for frequency decrease, machine vendors are making use of long SIMD instruction registers that are able to process multiple data with one arithmetic operator in one clock cycle. SIMD registermore » length is expected to double every four years. As a consequence, Particle-In-Cell (PIC) codes will have to achieve good vectorization to fully take advantage of these upcoming architectures. In this paper, we present a new algorithm that allows for efficient and portable SIMD vectorization of current/charge deposition routines that are, along with the field gathering routines, among the most time consuming parts of the PIC algorithm. Our new algorithm uses a particular data structure that takes into account memory alignment constraints and avoids gather/scat;ter instructions that can significantly affect vectorization performances on current CPUs. The new algorithm was successfully implemented in the 3D skeleton PIC code PICSAR and tested on Haswell Xeon processors (AVX2-256 bits wide data registers). Results show a factor of ×2 to ×2.5 speed-up in double precision for particle shape factor of orders 1–3. The new algorithm can be applied as is on future KNL (Knights Landing) architectures that will include AVX-512 instruction sets with 512 bits register lengths (8 doubles/16 singles). Program summary Program Title: vec_deposition Program Files doi:http://dx.doi.org/10.17632/nh77fv9k8c.1 Licensing provisions: BSD 3-Clause Programming language: Fortran 90 External routines/libraries:  OpenMP > 4.0 Nature of problem: Exascale architectures will have many-core processors per node with long vector data registers capable of performing one single instruction on multiple data during one clock cycle. Data register lengths are expected to double every four years and this pushes for new portable solutions for efficiently vectorizing Particle-In-Cell codes on these future many-core architectures. One of the main hotspot routines of the PIC algorithm is the current/charge deposition for which there is no efficient and portable vector algorithm. Solution method: Here we provide an efficient and portable vector algorithm of current/charge deposition routines that uses a new data structure, which significantly reduces gather/scatter operations. Vectorization is controlled using OpenMP 4.0 compiler directives for vectorization which ensures portability across different architectures. Restrictions: Here we do not provide the full PIC algorithm with an executable but only vector routines for current/charge deposition. These scalar/vector routines can be used as library routines in your 3D Particle-In-Cell code. However, to get the best performances out of vector routines you have to satisfy the two following requirements: (1) Your code should implement particle tiling (as explained in the manuscript) to allow for maximized cache reuse and reduce memory accesses that can hinder vector performances. The routines can be used directly on each particle tile. (2) You should compile your code with a Fortran 90 compiler (e.g Intel, gnu or cray) and provide proper alignment flags and compiler alignment directives (more details in README file).« less

  6. An efficient and portable SIMD algorithm for charge/current deposition in Particle-In-Cell codes

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Vincenti, H.; Lobet, M.; Lehe, R.

    In current computer architectures, data movement (from die to network) is by far the most energy consuming part of an algorithm (≈20pJ/word on-die to ≈10,000 pJ/word on the network). To increase memory locality at the hardware level and reduce energy consumption related to data movement, future exascale computers tend to use many-core processors on each compute nodes that will have a reduced clock speed to allow for efficient cooling. To compensate for frequency decrease, machine vendors are making use of long SIMD instruction registers that are able to process multiple data with one arithmetic operator in one clock cycle. SIMD registermore » length is expected to double every four years. As a consequence, Particle-In-Cell (PIC) codes will have to achieve good vectorization to fully take advantage of these upcoming architectures. In this paper, we present a new algorithm that allows for efficient and portable SIMD vectorization of current/charge deposition routines that are, along with the field gathering routines, among the most time consuming parts of the PIC algorithm. Our new algorithm uses a particular data structure that takes into account memory alignment constraints and avoids gather/scat;ter instructions that can significantly affect vectorization performances on current CPUs. The new algorithm was successfully implemented in the 3D skeleton PIC code PICSAR and tested on Haswell Xeon processors (AVX2-256 bits wide data registers). Results show a factor of ×2 to ×2.5 speed-up in double precision for particle shape factor of orders 1–3. The new algorithm can be applied as is on future KNL (Knights Landing) architectures that will include AVX-512 instruction sets with 512 bits register lengths (8 doubles/16 singles). Program summary Program Title: vec_deposition Program Files doi:http://dx.doi.org/10.17632/nh77fv9k8c.1 Licensing provisions: BSD 3-Clause Programming language: Fortran 90 External routines/libraries:  OpenMP > 4.0 Nature of problem: Exascale architectures will have many-core processors per node with long vector data registers capable of performing one single instruction on multiple data during one clock cycle. Data register lengths are expected to double every four years and this pushes for new portable solutions for efficiently vectorizing Particle-In-Cell codes on these future many-core architectures. One of the main hotspot routines of the PIC algorithm is the current/charge deposition for which there is no efficient and portable vector algorithm. Solution method: Here we provide an efficient and portable vector algorithm of current/charge deposition routines that uses a new data structure, which significantly reduces gather/scatter operations. Vectorization is controlled using OpenMP 4.0 compiler directives for vectorization which ensures portability across different architectures. Restrictions: Here we do not provide the full PIC algorithm with an executable but only vector routines for current/charge deposition. These scalar/vector routines can be used as library routines in your 3D Particle-In-Cell code. However, to get the best performances out of vector routines you have to satisfy the two following requirements: (1) Your code should implement particle tiling (as explained in the manuscript) to allow for maximized cache reuse and reduce memory accesses that can hinder vector performances. The routines can be used directly on each particle tile. (2) You should compile your code with a Fortran 90 compiler (e.g Intel, gnu or cray) and provide proper alignment flags and compiler alignment directives (more details in README file).« less

  7. The Component Model of Infrastructure: A Practical Approach to Understanding Public Health Program Infrastructure

    PubMed Central

    Snyder, Kimberly; Rieker, Patricia P.

    2014-01-01

    Functioning program infrastructure is necessary for achieving public health outcomes. It is what supports program capacity, implementation, and sustainability. The public health program infrastructure model presented in this article is grounded in data from a broader evaluation of 18 state tobacco control programs and previous work. The newly developed Component Model of Infrastructure (CMI) addresses the limitations of a previous model and contains 5 core components (multilevel leadership, managed resources, engaged data, responsive plans and planning, networked partnerships) and 3 supporting components (strategic understanding, operations, contextual influences). The CMI is a practical, implementation-focused model applicable across public health programs, enabling linkages to capacity, sustainability, and outcome measurement. PMID:24922125

  8. Testing the structure of a hydrological model using Genetic Programming

    NASA Astrophysics Data System (ADS)

    Selle, Benny; Muttil, Nitin

    2011-01-01

    SummaryGenetic Programming is able to systematically explore many alternative model structures of different complexity from available input and response data. We hypothesised that Genetic Programming can be used to test the structure of hydrological models and to identify dominant processes in hydrological systems. To test this, Genetic Programming was used to analyse a data set from a lysimeter experiment in southeastern Australia. The lysimeter experiment was conducted to quantify the deep percolation response under surface irrigated pasture to different soil types, watertable depths and water ponding times during surface irrigation. Using Genetic Programming, a simple model of deep percolation was recurrently evolved in multiple Genetic Programming runs. This simple and interpretable model supported the dominant process contributing to deep percolation represented in a conceptual model that was published earlier. Thus, this study shows that Genetic Programming can be used to evaluate the structure of hydrological models and to gain insight about the dominant processes in hydrological systems.

  9. Model Checking Abstract PLEXIL Programs with SMART

    NASA Technical Reports Server (NTRS)

    Siminiceanu, Radu I.

    2007-01-01

    We describe a method to automatically generate discrete-state models of abstract Plan Execution Interchange Language (PLEXIL) programs that can be analyzed using model checking tools. Starting from a high-level description of a PLEXIL program or a family of programs with common characteristics, the generator lays the framework that models the principles of program execution. The concrete parts of the program are not automatically generated, but require the modeler to introduce them by hand. As a case study, we generate models to verify properties of the PLEXIL macro constructs that are introduced as shorthand notation. After an exhaustive analysis, we conclude that the macro definitions obey the intended semantics and behave as expected, but contingently on a few specific requirements on the timing semantics of micro-steps in the concrete executive implementation.

  10. A Field-Based Curriculum Model for Earth Science Teacher-Preparation Programs.

    ERIC Educational Resources Information Center

    Dubois, David D.

    1979-01-01

    This study proposed a model set of cognitive-behavioral objectives for field-based teacher education programs for earth science teachers. It describes field experience integration into teacher education programs. The model is also applicable for evaluation of earth science teacher education programs. (RE)

  11. Model Entrepreneurship Programs. Special Publication Series No. 53.

    ERIC Educational Resources Information Center

    Ross, Novella; Kurth, Paula

    This collection contains descriptions of model entrepreneurship education programs submitted by 10 states. The first section deals with the Colorado Network of Small Business Programs and the Colorado Entrepreneurship Education Infusion Project. Described next is a model teacher education program in entrepreneurship education that was implemented…

  12. User's guide to the western spruce budworm modeling system

    Treesearch

    Nicholas L. Crookston; J. J. Colbert; Paul W. Thomas; Katharine A. Sheehan; William P. Kemp

    1990-01-01

    The Budworm Modeling System is a set of four computer programs: The Budworm Dynamics Model, the Prognosis-Budworm Dynamics Model, the Prognosis-Budworm Damage Model, and the Parallel Processing-Budworm Dynamics Model. Input to the first three programs and the output produced are described in this guide. A guide to the fourth program will be published separately....

  13. Hybrid MPI-OpenMP Parallelism in the ONETEP Linear-Scaling Electronic Structure Code: Application to the Delamination of Cellulose Nanofibrils.

    PubMed

    Wilkinson, Karl A; Hine, Nicholas D M; Skylaris, Chris-Kriton

    2014-11-11

    We present a hybrid MPI-OpenMP implementation of Linear-Scaling Density Functional Theory within the ONETEP code. We illustrate its performance on a range of high performance computing (HPC) platforms comprising shared-memory nodes with fast interconnect. Our work has focused on applying OpenMP parallelism to the routines which dominate the computational load, attempting where possible to parallelize different loops from those already parallelized within MPI. This includes 3D FFT box operations, sparse matrix algebra operations, calculation of integrals, and Ewald summation. While the underlying numerical methods are unchanged, these developments represent significant changes to the algorithms used within ONETEP to distribute the workload across CPU cores. The new hybrid code exhibits much-improved strong scaling relative to the MPI-only code and permits calculations with a much higher ratio of cores to atoms. These developments result in a significantly shorter time to solution than was possible using MPI alone and facilitate the application of the ONETEP code to systems larger than previously feasible. We illustrate this with benchmark calculations from an amyloid fibril trimer containing 41,907 atoms. We use the code to study the mechanism of delamination of cellulose nanofibrils when undergoing sonification, a process which is controlled by a large number of interactions that collectively determine the structural properties of the fibrils. Many energy evaluations were needed for these simulations, and as these systems comprise up to 21,276 atoms this would not have been feasible without the developments described here.

  14. EvoBuild: A Quickstart Toolkit for Programming Agent-Based Models of Evolutionary Processes

    NASA Astrophysics Data System (ADS)

    Wagh, Aditi; Wilensky, Uri

    2018-04-01

    Extensive research has shown that one of the benefits of programming to learn about scientific phenomena is that it facilitates learning about mechanisms underlying the phenomenon. However, using programming activities in classrooms is associated with costs such as requiring additional time to learn to program or students needing prior experience with programming. This paper presents a class of programming environments that we call quickstart: Environments with a negligible threshold for entry into programming and a modest ceiling. We posit that such environments can provide benefits of programming for learning without incurring associated costs for novice programmers. To make this claim, we present a design-based research study conducted to compare programming models of evolutionary processes with a quickstart toolkit with exploring pre-built models of the same processes. The study was conducted in six seventh grade science classes in two schools. Students in the programming condition used EvoBuild, a quickstart toolkit for programming agent-based models of evolutionary processes, to build their NetLogo models. Students in the exploration condition used pre-built NetLogo models. We demonstrate that although students came from a range of academic backgrounds without prior programming experience, and all students spent the same number of class periods on the activities including the time students took to learn programming in this environment, EvoBuild students showed greater learning about evolutionary mechanisms. We discuss the implications of this work for design research on programming environments in K-12 science education.

  15. Development of a program to fit data to a new logistic model for microbial growth.

    PubMed

    Fujikawa, Hiroshi; Kano, Yoshihiro

    2009-06-01

    Recently we developed a mathematical model for microbial growth in food. The model successfully predicted microbial growth at various patterns of temperature. In this study, we developed a program to fit data to the model with a spread sheet program, Microsoft Excel. Users can instantly get curves fitted to the model by inputting growth data and choosing the slope portion of a curve. The program also could estimate growth parameters including the rate constant of growth and the lag period. This program would be a useful tool for analyzing growth data and further predicting microbial growth.

  16. Enhancements to the Branched Lagrangian Transport Modeling System

    USGS Publications Warehouse

    Jobson, Harvey E.

    1997-01-01

    The Branched Lagrangian Transport Model (BLTM) has received wide use within the U.S. Geological Survey over the past 10 years. This report documents the enhancements and modifications that have been made to this modeling system since it was first introduced. The programs in the modeling system are arranged into five levels?programs to generate time-series of meteorological data (EQULTMP, SOLAR), programs to process time-series data (INTRP, MRG), programs to build input files for transport model (BBLTM, BQUAL2E), the model with defined reaction kinetics (BLTM, QUAL2E), and post processor plotting programs (CTPLT, CXPLT). An example application is presented to illustrate how the modeling system can be used to simulate 10 water-quality constituents in the Chattahoochee River below Atlanta, Georgia.

  17. Development of a program logic model and evaluation plan for a participatory ergonomics intervention in construction.

    PubMed

    Jaegers, Lisa; Dale, Ann Marie; Weaver, Nancy; Buchholz, Bryan; Welch, Laura; Evanoff, Bradley

    2014-03-01

    Intervention studies in participatory ergonomics (PE) are often difficult to interpret due to limited descriptions of program planning and evaluation. In an ongoing PE program with floor layers, we developed a logic model to describe our program plan, and process and summative evaluations designed to describe the efficacy of the program. The logic model was a useful tool for describing the program elements and subsequent modifications. The process evaluation measured how well the program was delivered as intended, and revealed the need for program modifications. The summative evaluation provided early measures of the efficacy of the program as delivered. Inadequate information on program delivery may lead to erroneous conclusions about intervention efficacy due to Type III error. A logic model guided the delivery and evaluation of our intervention and provides useful information to aid interpretation of results. © 2013 Wiley Periodicals, Inc.

  18. Development of a Program Logic Model and Evaluation Plan for a Participatory Ergonomics Intervention in Construction

    PubMed Central

    Jaegers, Lisa; Dale, Ann Marie; Weaver, Nancy; Buchholz, Bryan; Welch, Laura; Evanoff, Bradley

    2013-01-01

    Background Intervention studies in participatory ergonomics (PE) are often difficult to interpret due to limited descriptions of program planning and evaluation. Methods In an ongoing PE program with floor layers, we developed a logic model to describe our program plan, and process and summative evaluations designed to describe the efficacy of the program. Results The logic model was a useful tool for describing the program elements and subsequent modifications. The process evaluation measured how well the program was delivered as intended, and revealed the need for program modifications. The summative evaluation provided early measures of the efficacy of the program as delivered. Conclusions Inadequate information on program delivery may lead to erroneous conclusions about intervention efficacy due to Type III error. A logic model guided the delivery and evaluation of our intervention and provides useful information to aid interpretation of results. PMID:24006097

  19. Fuzzy bi-objective linear programming for portfolio selection problem with magnitude ranking function

    NASA Astrophysics Data System (ADS)

    Kusumawati, Rosita; Subekti, Retno

    2017-04-01

    Fuzzy bi-objective linear programming (FBOLP) model is bi-objective linear programming model in fuzzy number set where the coefficients of the equations are fuzzy number. This model is proposed to solve portfolio selection problem which generate an asset portfolio with the lowest risk and the highest expected return. FBOLP model with normal fuzzy numbers for risk and expected return of stocks is transformed into linear programming (LP) model using magnitude ranking function.

  20. Two Models for Implementing Senior Mentor Programs in Academic Medical Settings

    ERIC Educational Resources Information Center

    Corwin, Sara J.; Bates, Tovah; Cohan, Mary; Bragg, Dawn S.; Roberts, Ellen

    2007-01-01

    This paper compares two models of undergraduate geriatric medical education utilizing senior mentoring programs. Descriptive, comparative multiple-case study was employed analyzing program documents, archival records, and focus group data. Themes were compared for similarities and differences between the two program models. Findings indicate that…

  1. Programming While Construction of Engineering 3D Models of Complex Geometry

    NASA Astrophysics Data System (ADS)

    Kheyfets, A. L.

    2017-11-01

    The capabilities of geometrically accurate computational 3D models construction with the use of programming are presented. The construction of models of an architectural arch and a glo-boid worm gear is considered as an example. The models are designed in the AutoCAD pack-age. Three programs of construction are given. The first program is for designing a multi-section architectural arch. The control of the arch’s geometry by impacting its main parameters is shown. The second program is for designing and studying the working surface of a globoid gear’s worm. The article shows how to make the animation for this surface’s formation. The third program is for formation of a worm gear cavity surface. The cavity formation dynamics is studied. The programs are written in the AutoLisp programming language. The program texts are provided.

  2. Using the Knowledge, Process, Practice (KPP) model for driving the design and development of online postgraduate medical education.

    PubMed

    Shaw, Tim; Barnet, Stewart; Mcgregor, Deborah; Avery, Jennifer

    2015-01-01

    Online learning is a primary delivery method for continuing health education programs. It is critical that programs have curricula objectives linked to educational models that support learning. Using a proven educational modelling process ensures that curricula objectives are met and a solid basis for learning and assessment is achieved. To develop an educational design model that produces an educationally sound program development plan for use by anyone involved in online course development. We have described the development of a generic educational model designed for continuing health education programs. The Knowledge, Process, Practice (KPP) model is founded on recognised educational theory and online education practice. This paper presents a step-by-step guide on using this model for program development that encases reliable learning and evaluation. The model supports a three-step approach, KPP, based on learning outcomes and supporting appropriate assessment activities. It provides a program structure for online or blended learning that is explicit, educationally defensible, and supports multiple assessment points for health professionals. The KPP model is based on best practice educational design using a structure that can be adapted for a variety of online or flexibly delivered postgraduate medical education programs.

  3. Community Modeling Program for Space Weather: A CCMC Perspective

    NASA Technical Reports Server (NTRS)

    Hesse, Michael

    2009-01-01

    A community modeling program, which provides a forum for exchange and integration between modelers, has excellent potential for furthering our Space Weather modeling and forecasting capabilities. The design of such a program is of great importance to its success. In this presentation, we will argue that the most effective community modeling program should be focused on Space Weather-related objectives, and that it should be open and inclusive. The tremendous successes of prior community research activities further suggest that the most effective implementation of a new community modeling program should be based on community leadership, rather than on domination by individual institutions or centers. This presentation will provide an experience-based justification for these conclusions.

  4. AutoCAD-To-NASTRAN Translator Program

    NASA Technical Reports Server (NTRS)

    Jones, A.

    1989-01-01

    Program facilitates creation of finite-element mathematical models from geometric entities. AutoCAD to NASTRAN translator (ACTON) computer program developed to facilitate quick generation of small finite-element mathematical models for use with NASTRAN finite-element modeling program. Reads geometric data of drawing from Data Exchange File (DXF) used in AutoCAD and other PC-based drafting programs. Written in Microsoft Quick-Basic (Version 2.0).

  5. A sample theory-based logic model to improve program development, implementation, and sustainability of Farm to School programs.

    PubMed

    Ratcliffe, Michelle M

    2012-08-01

    Farm to School programs hold promise to address childhood obesity. These programs may increase students’ access to healthier foods, increase students’ knowledge of and desire to eat these foods, and increase their consumption of them. Implementing Farm to School programs requires the involvement of multiple people, including nutrition services, educators, and food producers. Because these groups have not traditionally worked together and each has different goals, it is important to demonstrate how Farm to School programs that are designed to decrease childhood obesity may also address others’ objectives, such as academic achievement and economic development. A logic model is an effective tool to help articulate a shared vision for how Farm to School programs may work to accomplish multiple goals. Furthermore, there is evidence that programs based on theory are more likely to be effective at changing individuals’ behaviors. Logic models based on theory may help to explain how a program works, aid in efficient and sustained implementation, and support the development of a coherent evaluation plan. This article presents a sample theory-based logic model for Farm to School programs. The presented logic model is informed by the polytheoretical model for food and garden-based education in school settings (PMFGBE). The logic model has been applied to multiple settings, including Farm to School program development and evaluation in urban and rural school districts. This article also includes a brief discussion on the development of the PMFGBE, a detailed explanation of how Farm to School programs may enhance the curricular, physical, and social learning environments of schools, and suggestions for the applicability of the logic model for practitioners, researchers, and policy makers.

  6. Elementary Teacher Training Models.

    ERIC Educational Resources Information Center

    Blewett, Evelyn J., Ed.

    This collection of articles contains descriptions of nine elementary teacher training program models conducted at universities throughout the United States. The articles include the following: (a) "The University of Toledo Model Program," by George E. Dickson; (b) "The Florida State University Model Program," by G. Wesley Sowards; (c) "The…

  7. Life prediction and constitutive models for engine hot section anisotropic materials program

    NASA Technical Reports Server (NTRS)

    Nissley, D. M.; Meyer, T. G.

    1992-01-01

    This report presents the results from a 35 month period of a program designed to develop generic constitutive and life prediction approaches and models for nickel-based single crystal gas turbine airfoils. The program is composed of a base program and an optional program. The base program addresses the high temperature coated single crystal regime above the airfoil root platform. The optional program investigates the low temperature uncoated single crystal regime below the airfoil root platform including the notched conditions of the airfoil attachment. Both base and option programs involve experimental and analytical efforts. Results from uniaxial constitutive and fatigue life experiments of coated and uncoated PWA 1480 single crystal material form the basis for the analytical modeling effort. Four single crystal primary orientations were used in the experiments: (001), (011), (111), and (213). Specific secondary orientations were also selected for the notched experiments in the optional program. Constitutive models for an overlay coating and PWA 1480 single crystal material were developed based on isothermal hysteresis loop data and verified using thermomechanical (TMF) hysteresis loop data. A fatigue life approach and life models were selected for TMF crack initiation of coated PWA 1480. An initial life model used to correlate smooth and notched fatigue data obtained in the option program shows promise. Computer software incorporating the overlay coating and PWA 1480 constitutive models was developed.

  8. Some theoretical models and constructs generic to substance abuse prevention programs for adolescents: possible relevance and limitations for problem gambling.

    PubMed

    Evans, Richard I

    2003-01-01

    For the past several years the author and his colleagues have explored the area of how social psychological constructs and theoretical models can be applied to the prevention of health threatening behaviors in adolescents. In examining the need for the development of gambling prevention programs for adolescents, it might be of value to consider the application of such constructs and theoretical models as a foundation to the development of prevention programs in this emerging problem behavior among adolescents. In order to provide perspective to the reader, the present paper reviews the history of various psychosocial models and constructs generic to programs directed at prevention of substance abuse in adolescents. A brief history of some of these models, possibly most applicable to gambling prevention programs, are presented. Social inoculation, reasoned action, planned behavior, and problem behavior theory, are among those discussed. Some deficits of these models, are also articulated. How such models may have relevance to developing programs for prevention of problem gambling in adolescents is also discussed. However, the inherent differences between gambling and more directly health threatening behaviors such as substance abuse must, of course, be seriously considered in utilizing such models. Most current gambling prevention programs have seldom been guided by theoretical models. Developers of gambling prevention programs should consider theoretical foundations, particularly since such foundations not only provide a guide for programs, but may become critical tools in evaluating their effectiveness.

  9. Trained simulated ultrasound patients: medical students as models, learners, and teachers.

    PubMed

    Blickendorf, J Matthew; Adkins, Eric J; Boulger, Creagh; Bahner, David P

    2014-01-01

    Medical educators must develop ultrasound education programs to ensure that future physicians are prepared to face the changing demands of clinical practice. It can be challenging to find human models for hands-on scanning sessions. This article outlines an educational model from a large university medical center that uses medical students to fulfill the need for human models. During the 2011-2012 academic year, medical students from The Ohio State University College of Medicine served as trained simulated ultrasound patients (TSUP) for hands-on scanning sessions held by the college and many residency programs. The extracurricular program is voluntary and coordinated by medical students with faculty supervision. Students receive a longitudinal didactic and hands-on ultrasound education program as an incentive for serving as a TSUP. The College of Medicine and 7 residency programs used the program, which included 47 second-year and 7 first-year student volunteers. Participation has increased annually because of the program's ease, reliability, and cost savings in providing normal anatomic models for ultrasound education programs. A key success of this program is its inherent reproducibility, as a new class of eager students constitutes the volunteer pool each year. The TSUP program is a feasible and sustainable method of fulfilling the need for normal anatomic ultrasound models while serving as a valuable extracurricular ultrasound education program for medical students. The program facilitates the coordination of ultrasound education programs by educators at the undergraduate and graduate levels.

  10. The FITS model office ergonomics program: a model for best practice.

    PubMed

    Chim, Justine M Y

    2014-01-01

    An effective office ergonomics program can predict positive results in reducing musculoskeletal injury rates, enhancing productivity, and improving staff well-being and job satisfaction. Its objective is to provide a systematic solution to manage the potential risk of musculoskeletal disorders among computer users in an office setting. A FITS Model office ergonomics program is developed. The FITS Model Office Ergonomics Program has been developed which draws on the legislative requirements for promoting the health and safety of workers using computers for extended periods as well as previous research findings. The Model is developed according to the practical industrial knowledge in ergonomics, occupational health and safety management, and human resources management in Hong Kong and overseas. This paper proposes a comprehensive office ergonomics program, the FITS Model, which considers (1) Furniture Evaluation and Selection; (2) Individual Workstation Assessment; (3) Training and Education; (4) Stretching Exercises and Rest Break as elements of an effective program. An experienced ergonomics practitioner should be included in the program design and implementation. Through the FITS Model Office Ergonomics Program, the risk of musculoskeletal disorders among computer users can be eliminated or minimized, and workplace health and safety and employees' wellness enhanced.

  11. Community College Programs for Older Adults: A Resource Directory of Guidelines, Comprehensive Programming Models, and Selected Programs.

    ERIC Educational Resources Information Center

    Beckman, Brenda Marshall; Ventura-Merkel, Catherine

    In an effort to more effectively disseminate information about community college programs for older adults, this directory was developed for three purposes: to make guidelines available for establishing, expanding, or revising programs; to offer a selection of successful programming models; and to provide a compendium of existing programs. Part I…

  12. Practical Application of Model-based Programming and State-based Architecture to Space Missions

    NASA Technical Reports Server (NTRS)

    Horvath, Gregory; Ingham, Michel; Chung, Seung; Martin, Oliver; Williams, Brian

    2006-01-01

    A viewgraph presentation to develop models from systems engineers that accomplish mission objectives and manage the health of the system is shown. The topics include: 1) Overview; 2) Motivation; 3) Objective/Vision; 4) Approach; 5) Background: The Mission Data System; 6) Background: State-based Control Architecture System; 7) Background: State Analysis; 8) Overview of State Analysis; 9) Background: MDS Software Frameworks; 10) Background: Model-based Programming; 10) Background: Titan Model-based Executive; 11) Model-based Execution Architecture; 12) Compatibility Analysis of MDS and Titan Architectures; 13) Integrating Model-based Programming and Execution into the Architecture; 14) State Analysis and Modeling; 15) IMU Subsystem State Effects Diagram; 16) Titan Subsystem Model: IMU Health; 17) Integrating Model-based Programming and Execution into the Software IMU; 18) Testing Program; 19) Computationally Tractable State Estimation & Fault Diagnosis; 20) Diagnostic Algorithm Performance; 21) Integration and Test Issues; 22) Demonstrated Benefits; and 23) Next Steps

  13. Development of a sustainable community-based dental education program.

    PubMed

    Piskorowski, Wilhelm A; Fitzgerald, Mark; Mastey, Jerry; Krell, Rachel E

    2011-08-01

    Increasing the use of community-based programs is an important trend in improving dental education to meet the needs of students and the public. To support this trend, understanding the history of programs that have established successful models for community-based education is valuable for the creation and development of new programs. The community-based education model of the University of Michigan School of Dentistry (UMSOD) offers a useful guide for understanding the essential steps and challenges involved in developing a successful program. Initial steps in program development were as follows: raising funds, selecting an outreach clinical model, and recruiting clinics to become partners. As the program developed, the challenges of creating a sustainable financial model with the highest educational value required the inclusion of new clinical settings and the creation of a unique revenue-sharing model. Since the beginning of the community-based program at UMSOD in 2000, the number of community partners has increased to twenty-seven clinics, and students have treated thousands of patients in need. Fourth-year students now spend a minimum of ten weeks in community-based clinical education. The community-based program at UMSOD demonstrates the value of service-based education and offers a sustainable model for the development of future programs.

  14. ModelMate - A graphical user interface for model analysis

    USGS Publications Warehouse

    Banta, Edward R.

    2011-01-01

    ModelMate is a graphical user interface designed to facilitate use of model-analysis programs with models. This initial version of ModelMate supports one model-analysis program, UCODE_2005, and one model software program, MODFLOW-2005. ModelMate can be used to prepare input files for UCODE_2005, run UCODE_2005, and display analysis results. A link to the GW_Chart graphing program facilitates visual interpretation of results. ModelMate includes capabilities for organizing directories used with the parallel-processing capabilities of UCODE_2005 and for maintaining files in those directories to be identical to a set of files in a master directory. ModelMate can be used on its own or in conjunction with ModelMuse, a graphical user interface for MODFLOW-2005 and PHAST.

  15. A spatial stochastic programming model for timber and core area management under risk of stand-replacing fire

    Treesearch

    Dung Tuan Nguyen

    2012-01-01

    Forest harvest scheduling has been modeled using deterministic and stochastic programming models. Past models seldom address explicit spatial forest management concerns under the influence of natural disturbances. In this research study, we employ multistage full recourse stochastic programming models to explore the challenges and advantages of building spatial...

  16. Designing a Model of Vocational Training Programs for Disables through ODL

    ERIC Educational Resources Information Center

    Majid, Shaista; Razzak, Adeela

    2015-01-01

    This study was conducted to designing a model of vocational training programs for disables. For this purpose desk review was carried out and the vocational training models/programs of Israel, U.K., Vietnam, Japan and Thailand were analyzed to form a conceptual framework of the model. Keeping in view the local conditions/requirements a model of…

  17. Development of a Logic Model to Guide Evaluations of the ASCA National Model for School Counseling Programs

    ERIC Educational Resources Information Center

    Martin, Ian; Carey, John

    2014-01-01

    A logic model was developed based on an analysis of the 2012 American School Counselor Association (ASCA) National Model in order to provide direction for program evaluation initiatives. The logic model identified three outcomes (increased student achievement/gap reduction, increased school counseling program resources, and systemic change and…

  18. Emile: Software-Realized Scaffolding for Science Learners Programming in Mixed Media

    NASA Astrophysics Data System (ADS)

    Guzdial, Mark Joseph

    Emile is a computer program that facilitates students using programming to create models of kinematics (physics of motion without forces) and executing these models as simulations. Emile facilitates student programming and model-building with software-realized scaffolding (SRS). Emile integrates a range of SRS and provides mechanisms to fade (or diminish) most scaffolding. By fading Emile's SRS, students can adapt support to their individual needs. Programming in Emile involves graphic and text elements (as compared with more traditional text-based programming). For example, students create graphical objects which can be dragged on the screen, and when dropped, fall as if in a gravitational field. Emile supports a simplified, HyperCard-like mixed media programming framework. Scaffolding is defined as support which enables student performance (called the immediate benefit of scaffolding) and which facilitates student learning (called the lasting benefit of scaffolding). Scaffolding provides this support through three methods: Modeling, coaching, and eliciting articulation. For example, Emile has tools to structure the programming task (modeling), menus identify the next step in the programming and model-building process (coaching), and prompts for student plans and predictions (eliciting articulation). Five students used Emile in a summer workshop (45 hours total) focusing on creating kinematics simulations and multimedia demonstrations. Evaluation of Emile's scaffolding addressed use of scaffolding and the expected immediate and lasting benefits. Emile created records of student interactions (log files) which were analyzed to determine how students used Emile's SRS and how they faded that scaffolding. Student projects and articulations about those projects were analyzed to assess success of student's model-building and programming activities. Clinical interviews were conducted before and after the workshop to determine students' conceptualizations of kinematics and programming and how they changed. The results indicate that students were successful at model-building and programming, learned physics and programming, and used and faded Emile's scaffolding over time. These results are from a small sample who were self -selected and highly-motivated. Nonetheless, this study provides a theory and operationalization for SRS, an example of a successful model-building environment, and a description of student use of mixed media programming.

  19. Energy efficiency in nonprofit agencies: Creating effective program models

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Brown, M.A.; Prindle, B.; Scherr, M.I.

    Nonprofit agencies are a critical component of the health and human services system in the US. It has been clearly demonstrated by programs that offer energy efficiency services to nonprofits that, with minimal investment, they can educe their energy consumption by ten to thirty percent. This energy conservation potential motivated the Department of Energy and Oak Ridge National Laboratory to conceive a project to help states develop energy efficiency programs for nonprofits. The purpose of the project was two-fold: (1) to analyze existing programs to determine which design and delivery mechanisms are particularly effective, and (2) to create model programsmore » for states to follow in tailoring their own plans for helping nonprofits with energy efficiency programs. Twelve existing programs were reviewed, and three model programs were devised and put into operation. The model programs provide various forms of financial assistance to nonprofits and serve as a source of information on energy efficiency as well. After examining the results from the model programs (which are still on-going) and from the existing programs, several replicability factors'' were developed for use in the implementation of programs by other states. These factors -- some concrete and practical, others more generalized -- serve as guidelines for states devising program based on their own particular needs and resources.« less

  20. A Comparison of Three Programming Models for Adaptive Applications

    NASA Technical Reports Server (NTRS)

    Shan, Hong-Zhang; Singh, Jaswinder Pal; Oliker, Leonid; Biswa, Rupak; Kwak, Dochan (Technical Monitor)

    2000-01-01

    We study the performance and programming effort for two major classes of adaptive applications under three leading parallel programming models. We find that all three models can achieve scalable performance on the state-of-the-art multiprocessor machines. The basic parallel algorithms needed for different programming models to deliver their best performance are similar, but the implementations differ greatly, far beyond the fact of using explicit messages versus implicit loads/stores. Compared with MPI and SHMEM, CC-SAS (cache-coherent shared address space) provides substantial ease of programming at the conceptual and program orchestration level, which often leads to the performance gain. However it may also suffer from the poor spatial locality of physically distributed shared data on large number of processors. Our CC-SAS implementation of the PARMETIS partitioner itself runs faster than in the other two programming models, and generates more balanced result for our application.

  1. Occupant behavior models: A critical review of implementation and representation approaches in building performance simulation programs

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Hong, Tianzhen; Chen, Yixing; Belafi, Zsofia

    Occupant behavior (OB) in buildings is a leading factor influencing energy use in buildings. Quantifying this influence requires the integration of OB models with building performance simulation (BPS). This study reviews approaches to representing and implementing OB models in today’s popular BPS programs, and discusses weaknesses and strengths of these approaches and key issues in integrating of OB models with BPS programs. Two of the key findings are: (1) a common data model is needed to standardize the representation of OB models, enabling their flexibility and exchange among BPS programs and user applications; the data model can be implemented usingmore » a standard syntax (e.g., in the form of XML schema), and (2) a modular software implementation of OB models, such as functional mock-up units for co-simulation, adopting the common data model, has advantages in providing a robust and interoperable integration with multiple BPS programs. Such common OB model representation and implementation approaches help standardize the input structures of OB models, enable collaborative development of a shared library of OB models, and allow for rapid and widespread integration of OB models with BPS programs to improve the simulation of occupant behavior and quantification of their impact on building performance.« less

  2. Occupant behavior models: A critical review of implementation and representation approaches in building performance simulation programs

    DOE PAGES

    Hong, Tianzhen; Chen, Yixing; Belafi, Zsofia; ...

    2017-07-27

    Occupant behavior (OB) in buildings is a leading factor influencing energy use in buildings. Quantifying this influence requires the integration of OB models with building performance simulation (BPS). This study reviews approaches to representing and implementing OB models in today’s popular BPS programs, and discusses weaknesses and strengths of these approaches and key issues in integrating of OB models with BPS programs. Two of the key findings are: (1) a common data model is needed to standardize the representation of OB models, enabling their flexibility and exchange among BPS programs and user applications; the data model can be implemented usingmore » a standard syntax (e.g., in the form of XML schema), and (2) a modular software implementation of OB models, such as functional mock-up units for co-simulation, adopting the common data model, has advantages in providing a robust and interoperable integration with multiple BPS programs. Such common OB model representation and implementation approaches help standardize the input structures of OB models, enable collaborative development of a shared library of OB models, and allow for rapid and widespread integration of OB models with BPS programs to improve the simulation of occupant behavior and quantification of their impact on building performance.« less

  3. An Instructional Approach to Modeling in Microevolution.

    ERIC Educational Resources Information Center

    Thompson, Steven R.

    1988-01-01

    Describes an approach to teaching population genetics and evolution and some of the ways models can be used to enhance understanding of the processes being studied. Discusses the instructional plan, and the use of models including utility programs and analysis with models. Provided are a basic program and sample program outputs. (CW)

  4. Development and validation of a general purpose linearization program for rigid aircraft models

    NASA Technical Reports Server (NTRS)

    Duke, E. L.; Antoniewicz, R. F.

    1985-01-01

    A FORTRAN program that provides the user with a powerful and flexible tool for the linearization of aircraft models is discussed. The program LINEAR numerically determines a linear systems model using nonlinear equations of motion and a user-supplied, nonlinear aerodynamic model. The system model determined by LINEAR consists of matrices for both the state and observation equations. The program has been designed to allow easy selection and definition of the state, control, and observation variables to be used in a particular model. Also, included in the report is a comparison of linear and nonlinear models for a high performance aircraft.

  5. Genetic Programming as Alternative for Predicting Development Effort of Individual Software Projects

    PubMed Central

    Chavoya, Arturo; Lopez-Martin, Cuauhtemoc; Andalon-Garcia, Irma R.; Meda-Campaña, M. E.

    2012-01-01

    Statistical and genetic programming techniques have been used to predict the software development effort of large software projects. In this paper, a genetic programming model was used for predicting the effort required in individually developed projects. Accuracy obtained from a genetic programming model was compared against one generated from the application of a statistical regression model. A sample of 219 projects developed by 71 practitioners was used for generating the two models, whereas another sample of 130 projects developed by 38 practitioners was used for validating them. The models used two kinds of lines of code as well as programming language experience as independent variables. Accuracy results from the model obtained with genetic programming suggest that it could be used to predict the software development effort of individual projects when these projects have been developed in a disciplined manner within a development-controlled environment. PMID:23226305

  6. Milestone-specific, Observed data points for evaluating levels of performance (MODEL) assessment strategy for anesthesiology residency programs.

    PubMed

    Nagy, Christopher J; Fitzgerald, Brian M; Kraus, Gregory P

    2014-01-01

    Anesthesiology residency programs will be expected to have Milestones-based evaluation systems in place by July 2014 as part of the Next Accreditation System. The San Antonio Uniformed Services Health Education Consortium (SAUSHEC) anesthesiology residency program developed and implemented a Milestones-based feedback and evaluation system a year ahead of schedule. It has been named the Milestone-specific, Observed Data points for Evaluating Levels of performance (MODEL) assessment strategy. The "MODEL Menu" and the "MODEL Blueprint" are tools that other anesthesiology residency programs can use in developing their own Milestones-based feedback and evaluation systems prior to ACGME-required implementation. Data from our early experience with the streamlined MODEL blueprint assessment strategy showed substantially improved faculty compliance with reporting requirements. The MODEL assessment strategy provides programs with a workable assessment method for residents, and important Milestones data points to programs for ACGME reporting.

  7. SMP: A solid modeling program version 2.0

    NASA Technical Reports Server (NTRS)

    Randall, D. P.; Jones, K. H.; Vonofenheim, W. H.; Gates, R. L.; Matthews, C. G.

    1986-01-01

    The Solid Modeling Program (SMP) provides the capability to model complex solid objects through the composition of primitive geometric entities. In addition to the construction of solid models, SMP has extensive facilities for model editing, display, and analysis. The geometric model produced by the software system can be output in a format compatible with existing analysis programs such as PATRAN-G. The present version of the SMP software supports six primitives: boxes, cones, spheres, paraboloids, tori, and trusses. The details for creating each of the major primitive types is presented. The analysis capabilities of SMP, including interfaces to existing analysis programs, are discussed.

  8. Modifications of the U.S. Geological Survey modular, finite-difference, ground-water flow model to read and write geographic information system files

    USGS Publications Warehouse

    Orzol, Leonard L.; McGrath, Timothy S.

    1992-01-01

    This report documents modifications to the U.S. Geological Survey modular, three-dimensional, finite-difference, ground-water flow model, commonly called MODFLOW, so that it can read and write files used by a geographic information system (GIS). The modified model program is called MODFLOWARC. Simulation programs such as MODFLOW generally require large amounts of input data and produce large amounts of output data. Viewing data graphically, generating head contours, and creating or editing model data arrays such as hydraulic conductivity are examples of tasks that currently are performed either by the use of independent software packages or by tedious manual editing, manipulating, and transferring data. Programs such as GIS programs are commonly used to facilitate preparation of the model input data and analyze model output data; however, auxiliary programs are frequently required to translate data between programs. Data translations are required when different programs use different data formats. Thus, the user might use GIS techniques to create model input data, run a translation program to convert input data into a format compatible with the ground-water flow model, run the model, run a translation program to convert the model output into the correct format for GIS, and use GIS to display and analyze this output. MODFLOWARC, avoids the two translation steps and transfers data directly to and from the ground-water-flow model. This report documents the design and use of MODFLOWARC and includes instructions for data input/output of the Basic, Block-centered flow, River, Recharge, Well, Drain, Evapotranspiration, General-head boundary, and Streamflow-routing packages. The modification to MODFLOW and the Streamflow-Routing package was minimized. Flow charts and computer-program code describe the modifications to the original computer codes for each of these packages. Appendix A contains a discussion on the operation of MODFLOWARC using a sample problem.

  9. Solid rocket booster performance evaluation model. Volume 2: Users manual

    NASA Technical Reports Server (NTRS)

    1974-01-01

    This users manual for the solid rocket booster performance evaluation model (SRB-II) contains descriptions of the model, the program options, the required program inputs, the program output format and the program error messages. SRB-II is written in FORTRAN and is operational on both the IBM 370/155 and the MSFC UNIVAC 1108 computers.

  10. Workforce Education Models for K-12 STEM Education Programs: Reflections On, and Implications For, the NSF ITEST Program

    ERIC Educational Resources Information Center

    Reider, David; Knestis, Kirk; Malyn-Smith, Joyce

    2016-01-01

    This article proposes a STEM workforce education logic model, tailored to the particular context of the National Science Foundation's Innovative Technology Experiences for Students and Teachers (ITEST) program. This model aims to help program designers and researchers address challenges particular to designing, implementing, and studying education…

  11. Beyond Instruction. Comprehensive Program Planning for Business and Education. Jossey-Bass Business and Management Series.

    ERIC Educational Resources Information Center

    Rothwell, William J.; Cookson, Peter S.

    This book introduces key issues in program planning as practiced in business and educational settings. Two chapters in part 1 introduce two foundational models--Lifelong Education Program Planning (LEPP) model and Contingency-Based Program Planning--and provide background on models designed by Houle, Knowles, Boyle, and Nadler. Parts 2-5 focus on…

  12. The Evolution of Cartography Graduate Programs and the Development of New Graduate Programs in Cartography: An Assessment of Models.

    ERIC Educational Resources Information Center

    Steinke, Theodore R.

    This paper traces the historical development of cartography graduate programs, establishes an evolutionary model, and evaluates the model to determine if it has some utility today for the development of programs capable of producing highly skilled cartographers. Cartography is defined to include traditional cartography, computer cartography,…

  13. Flexible language constructs for large parallel programs

    NASA Technical Reports Server (NTRS)

    Rosing, Matthew; Schnabel, Robert

    1993-01-01

    The goal of the research described is to develop flexible language constructs for writing large data parallel numerical programs for distributed memory (MIMD) multiprocessors. Previously, several models have been developed to support synchronization and communication. Models for global synchronization include SIMD (Single Instruction Multiple Data), SPMD (Single Program Multiple Data), and sequential programs annotated with data distribution statements. The two primary models for communication include implicit communication based on shared memory and explicit communication based on messages. None of these models by themselves seem sufficient to permit the natural and efficient expression of the variety of algorithms that occur in large scientific computations. An overview of a new language that combines many of these programming models in a clean manner is given. This is done in a modular fashion such that different models can be combined to support large programs. Within a module, the selection of a model depends on the algorithm and its efficiency requirements. An overview of the language and discussion of some of the critical implementation details is given.

  14. Life prediction and constitutive models for engine hot section anisotropic materials program

    NASA Technical Reports Server (NTRS)

    Nissley, D. M.; Meyer, T. G.; Walker, K. P.

    1992-01-01

    This report presents a summary of results from a 7 year program designed to develop generic constitutive and life prediction approaches and models for nickel-based single crystal gas turbine airfoils. The program was composed of a base program and an optional program. The base program addressed the high temperature coated single crystal regime above the airfoil root platform. The optional program investigated the low temperature uncoated single crystal regime below the airfoil root platform including the notched conditions of the airfoil attachment. Both base and option programs involved experimental and analytical efforts. Results from uniaxial constitutive and fatigue life experiments of coated and uncoated PWA 1480 single crystal material formed the basis for the analytical modeling effort. Four single crystal primary orientations were used in the experiments: group of zone axes (001), group of zone axes (011), group of zone axes (111), and group of zone axes (213). Specific secondary orientations were also selected for the notched experiments in the optional program. Constitutive models for an overlay coating and PWA 1480 single crystal materials were developed based on isothermal hysteresis loop data and verified using thermomechanical (TMF) hysteresis loop data. A fatigue life approach and life models were developed for TMF crack initiation of coated PWA 1480. A life model was developed for smooth and notched fatigue in the option program. Finally, computer software incorporating the overlay coating and PWA 1480 constitutive and life models was developed.

  15. Improving the Impact and Implementation of Disaster Education: Programs for Children Through Theory-Based Evaluation.

    PubMed

    Johnson, Victoria A; Ronan, Kevin R; Johnston, David M; Peace, Robin

    2016-11-01

    A main weakness in the evaluation of disaster education programs for children is evaluators' propensity to judge program effectiveness based on changes in children's knowledge. Few studies have articulated an explicit program theory of how children's education would achieve desired outcomes and impacts related to disaster risk reduction in households and communities. This article describes the advantages of constructing program theory models for the purpose of evaluating disaster education programs for children. Following a review of some potential frameworks for program theory development, including the logic model, the program theory matrix, and the stage step model, the article provides working examples of these frameworks. The first example is the development of a program theory matrix used in an evaluation of ShakeOut, an earthquake drill practiced in two Washington State school districts. The model illustrates a theory of action; specifically, the effectiveness of school earthquake drills in preventing injuries and deaths during disasters. The second example is the development of a stage step model used for a process evaluation of What's the Plan Stan?, a voluntary teaching resource distributed to all New Zealand primary schools for curricular integration of disaster education. The model illustrates a theory of use; specifically, expanding the reach of disaster education for children through increased promotion of the resource. The process of developing the program theory models for the purpose of evaluation planning is discussed, as well as the advantages and shortcomings of the theory-based approaches. © 2015 Society for Risk Analysis.

  16. Nested ocean models: Work in progress

    NASA Technical Reports Server (NTRS)

    Perkins, A. Louise

    1991-01-01

    The ongoing work of combining three existing software programs into a nested grid oceanography model is detailed. The HYPER domain decomposition program, the SPEM ocean modeling program, and a quasi-geostrophic model written in England are being combined into a general ocean modeling facility. This facility will be used to test the viability and the capability of two-way nested grids in the North Atlantic.

  17. AVHRR/1-FM Advanced Very High Resolution Radiometer

    NASA Technical Reports Server (NTRS)

    1979-01-01

    The advanced very high resolution radiometer is discussed. The program covers design, construction, and test of a breadboard model, engineering model, protoflight model, mechanical/structural model, and a life test model. Special bench test and calibration equipment was developed for use on the program. The flight model program objectives were to fabricate, assemble and test four of the advanced very high resolution radiometers along with a bench cooler and collimator.

  18. Semi-Markov adjunction to the Computer-Aided Markov Evaluator (CAME)

    NASA Technical Reports Server (NTRS)

    Rosch, Gene; Hutchins, Monica A.; Leong, Frank J.; Babcock, Philip S., IV

    1988-01-01

    The rule-based Computer-Aided Markov Evaluator (CAME) program was expanded in its ability to incorporate the effect of fault-handling processes into the construction of a reliability model. The fault-handling processes are modeled as semi-Markov events and CAME constructs and appropriate semi-Markov model. To solve the model, the program outputs it in a form which can be directly solved with the Semi-Markov Unreliability Range Evaluator (SURE) program. As a means of evaluating the alterations made to the CAME program, the program is used to model the reliability of portions of the Integrated Airframe/Propulsion Control System Architecture (IAPSA 2) reference configuration. The reliability predictions are compared with a previous analysis. The results bear out the feasibility of utilizing CAME to generate appropriate semi-Markov models to model fault-handling processes.

  19. The importance of socio-economic context for social marketing models for improving reproductive health: evidence from 555 years of program experience.

    PubMed

    Meekers, Dominique; Rahaim, Stephen

    2005-01-27

    Over the past two decades, social marketing programs have become an important element of the national family planning and HIV prevention strategy in several developing countries. As yet, there has not been any comprehensive empirical assessment to determine which of several social marketing models is most effective for a given socio-economic context. Such an assessment is urgently needed to inform the design of future social marketing programs, and to avoid that programs are designed using an ineffective model. This study addresses this issue using a database of annual statistics about reproductive health oriented social marketing programs in over 70 countries. In total, the database covers 555 years of program experience with social marketing programs that distribute and promote the use of oral contraceptives and condoms. Specifically, our analysis assesses to what extent the model used by different reproductive health social marketing programs has varied across different socio-economic contexts. We then use random effects regression to test in which socio-economic context each of the models is most successful at increasing use of socially marketed oral contraceptives and condoms. The results show that there has been a tendency to design reproductive health social marketing program with a management structure that matches the local context. However, the evidence also shows that this has not always been the case. While socio-economic context clearly influences the effectiveness of some of the social marketing models, program maturity and the size of the target population appear equally important. To maximize the effectiveness of future social marketing programs, it is essential that more effort is devoted to ensuring that such programs are designed using the model or approach that is most suitable for the local context.

  20. The importance of socio-economic context for social marketing models for improving reproductive health: Evidence from 555 years of program experience

    PubMed Central

    Meekers, Dominique; Rahaim, Stephen

    2005-01-01

    Background Over the past two decades, social marketing programs have become an important element of the national family planning and HIV prevention strategy in several developing countries. As yet, there has not been any comprehensive empirical assessment to determine which of several social marketing models is most effective for a given socio-economic context. Such an assessment is urgently needed to inform the design of future social marketing programs, and to avoid that programs are designed using an ineffective model. Methods This study addresses this issue using a database of annual statistics about reproductive health oriented social marketing programs in over 70 countries. In total, the database covers 555 years of program experience with social marketing programs that distribute and promote the use of oral contraceptives and condoms. Specifically, our analysis assesses to what extent the model used by different reproductive health social marketing programs has varied across different socio-economic contexts. We then use random effects regression to test in which socio-economic context each of the models is most successful at increasing use of socially marketed oral contraceptives and condoms. Results The results show that there has been a tendency to design reproductive health social marketing program with a management structure that matches the local context. However, the evidence also shows that this has not always been the case. While socio-economic context clearly influences the effectiveness of some of the social marketing models, program maturity and the size of the target population appear equally important. Conclusions To maximize the effectiveness of future social marketing programs, it is essential that more effort is devoted to ensuring that such programs are designed using the model or approach that is most suitable for the local context. PMID:15676068

  1. New reflective symmetry design capability in the JPL-IDEAS Structure Optimization Program

    NASA Technical Reports Server (NTRS)

    Strain, D.; Levy, R.

    1986-01-01

    The JPL-IDEAS antenna structure analysis and design optimization computer program was modified to process half structure models of symmetric structures subjected to arbitrary external static loads, synthesize the performance, and optimize the design of the full structure. Significant savings in computation time and cost (more than 50%) were achieved compared to the cost of full model computer runs. The addition of the new reflective symmetry analysis design capabilities to the IDEAS program allows processing of structure models whose size would otherwise prevent automated design optimization. The new program produced synthesized full model iterative design results identical to those of actual full model program executions at substantially reduced cost, time, and computer storage.

  2. PDF modeling of turbulent flows on unstructured grids

    NASA Astrophysics Data System (ADS)

    Bakosi, Jozsef

    In probability density function (PDF) methods of turbulent flows, the joint PDF of several flow variables is computed by numerically integrating a system of stochastic differential equations for Lagrangian particles. Because the technique solves a transport equation for the PDF of the velocity and scalars, a mathematically exact treatment of advection, viscous effects and arbitrarily complex chemical reactions is possible; these processes are treated without closure assumptions. A set of algorithms is proposed to provide an efficient solution of the PDF transport equation modeling the joint PDF of turbulent velocity, frequency and concentration of a passive scalar in geometrically complex configurations. An unstructured Eulerian grid is employed to extract Eulerian statistics, to solve for quantities represented at fixed locations of the domain and to track particles. All three aspects regarding the grid make use of the finite element method. Compared to hybrid methods, the current methodology is stand-alone, therefore it is consistent both numerically and at the level of turbulence closure without the use of consistency conditions. Since both the turbulent velocity and scalar concentration fields are represented in a stochastic way, the method allows for a direct and close interaction between these fields, which is beneficial in computing accurate scalar statistics. Boundary conditions implemented along solid bodies are of the free-slip and no-slip type without the need for ghost elements. Boundary layers at no-slip boundaries are either fully resolved down to the viscous sublayer, explicitly modeling the high anisotropy and inhomogeneity of the low-Reynolds-number wall region without damping or wall-functions or specified via logarithmic wall-functions. As in moment closures and large eddy simulation, these wall-treatments provide the usual trade-off between resolution and computational cost as required by the given application. Particular attention is focused on modeling the dispersion of passive scalars in inhomogeneous turbulent flows. Two different micromixing models are investigated that incorporate the effect of small scale mixing on the transported scalar: the widely used interaction by exchange with the mean and the interaction by exchange with the conditional mean model. An adaptive algorithm to compute the velocity-conditioned scalar mean is proposed that homogenizes the statistical error over the sample space with no assumption on the shape of the underlying velocity PDF. The development also concentrates on a generally applicable micromixing timescale for complex flow domains. Several newly developed algorithms are described in detail that facilitate a stable numerical solution in arbitrarily complex flow geometries, including a stabilized mean-pressure projection scheme, the estimation of conditional and unconditional Eulerian statistics and their derivatives from stochastic particle fields employing finite element shapefunctions, particle tracking through unstructured grids, an efficient particle redistribution procedure and techniques related to efficient random number generation. The algorithm is validated and tested by computing three different turbulent flows: the fully developed turbulent channel flow, a street canyon (or cavity) flow and the turbulent wake behind a circular cylinder at a sub-critical Reynolds number. The solver has been parallelized and optimized for shared memory and multi-core architectures using the OpenMP standard. Relevant aspects of performance and parallelism on cache-based shared memory machines are discussed and presented in detail. The methodology shows great promise in the simulation of high-Reynolds-number incompressible inert or reactive turbulent flows in realistic configurations.

  3. Learning from communities: overcoming difficulties in dissemination of prevention and promotion efforts.

    PubMed

    Miller, Robin L; Shinn, Marybeth

    2005-06-01

    The model of prevention science advocated by the Institute of Medicine (P. J. Mrazek & R. J. Haggerty, 1994) has not lead to widespread adoption of prevention and promotion programs for four reasons. The model of dissemination of programs to communities fails to consider community and organizational capacity to implement programs, ignores the need for congruence in values between programs and host sites, displays a pro-innovation bias that undervalues indigenous practices, and assumes a simplistic model of how community organizations adopt innovations. To address these faults, researchers should locate, study, and help disseminate successful indigenous programs that fit community capacity and values. In addition, they should build on theoretical models of how locally developed programs work to make existing programs and polices more effective.

  4. A thermal vacuum test optimization procedure

    NASA Technical Reports Server (NTRS)

    Kruger, R.; Norris, H. P.

    1979-01-01

    An analytical model was developed that can be used to establish certain parameters of a thermal vacuum environmental test program based on an optimization of program costs. This model is in the form of a computer program that interacts with a user insofar as the input of certain parameters. The program provides the user a list of pertinent information regarding an optimized test program and graphs of some of the parameters. The model is a first attempt in this area and includes numerous simplifications. The model appears useful as a general guide and provides a way for extrapolating past performance to future missions.

  5. 'Achievement, pride and inspiration': outcomes for volunteer role models in a community outreach program in remote Aboriginal communities.

    PubMed

    Cinelli, Renata L; Peralta, Louisa R

    2015-01-01

    There is growing support for the prosocial value of role modelling in programs for adolescents and the potentially positive impact role models can have on health and health behaviours in remote communities. Despite known benefits for remote outreach program recipients, there is limited literature on the outcomes of participation for role models. Twenty-four role models participated in a remote outreach program across four remote Aboriginal communities in the Northern Territory, Australia (100% recruitment). Role models participated in semi-structured one-on-one interviews. Transcripts were coded and underwent thematic analysis by both authors. Cultural training, Indigenous heritage and prior experience contributed to general feelings of preparedness, yet some role models experienced a level of culture shock, being confronted by how disparate the communities were to their home communities. Benefits of participation included exposure to and experience with remote Aboriginal peoples and community, increased cultural knowledge, personal learning, forming and building relationships, and skill development. Effective role model programs designed for remote Indigenous youth can have positive outcomes for both role models and the program recipients. Cultural safety training is an important factor for preparing role models and for building their cultural competency for implementing health and education programs in remote Indigenous communities in Australia. This will maximise the opportunities for participants to achieve outcomes and minimise their culture shock.

  6. Building Comprehensive Career Guidance Programs for Secondary Schools: A Handbook of Programs, Practices, and Models. Research and Development Series No. 147.

    ERIC Educational Resources Information Center

    Campbell, Robert E.; And Others

    This handbook presents management techniques, program ideas, and student activities for building comprehensive secondary career guidance programs. Part 1 (chapter 1) traces the history of guidance to set the stage for the current emphasis on comprehensive programs, summarizes four representative models for designing comprehensive programs, and…

  7. The importance of context in logic model construction for a multi-site community-based Aboriginal driver licensing program.

    PubMed

    Cullen, Patricia; Clapham, Kathleen; Byrne, Jake; Hunter, Kate; Senserrick, Teresa; Keay, Lisa; Ivers, Rebecca

    2016-08-01

    Evidence indicates that Aboriginal people are underrepresented among driver licence holders in New South Wales, which has been attributed to licensing barriers for Aboriginal people. The Driving Change program was developed to provide culturally responsive licensing services that engage Aboriginal communities and build local capacity. This paper outlines the formative evaluation of the program, including logic model construction and exploration of contextual factors. Purposive sampling was used to identify key informants (n=12) from a consultative committee of key stakeholders and program staff. Semi-structured interviews were transcribed and thematically analysed. Data from interviews informed development of the logic model. Participants demonstrated high level of support for the program and reported that it filled an important gap. The program context revealed systemic barriers to licensing that were correspondingly targeted by specific program outputs in the logic model. Addressing underlying assumptions of the program involved managing local capacity and support to strengthen implementation. This formative evaluation highlights the importance of exploring program context as a crucial first step in logic model construction. The consultation process assisted in clarifying program goals and ensuring that the program was responding to underlying systemic factors that contribute to inequitable licensing access for Aboriginal people. Copyright © 2016 Elsevier Ltd. All rights reserved.

  8. A Component-based Programming Model for Composite, Distributed Applications

    NASA Technical Reports Server (NTRS)

    Eidson, Thomas M.; Bushnell, Dennis M. (Technical Monitor)

    2001-01-01

    The nature of scientific programming is evolving to larger, composite applications that are composed of smaller element applications. These composite applications are more frequently being targeted for distributed, heterogeneous networks of computers. They are most likely programmed by a group of developers. Software component technology and computational frameworks are being proposed and developed to meet the programming requirements of these new applications. Historically, programming systems have had a hard time being accepted by the scientific programming community. In this paper, a programming model is outlined that attempts to organize the software component concepts and fundamental programming entities into programming abstractions that will be better understood by the application developers. The programming model is designed to support computational frameworks that manage many of the tedious programming details, but also that allow sufficient programmer control to design an accurate, high-performance application.

  9. A constructive Indian country response to the evidence-based program mandate.

    PubMed

    Walker, R Dale; Bigelow, Douglas A

    2011-01-01

    Over the last 20 years governmental mandates for preferentially funding evidence-based "model" practices and programs has become doctrine in some legislative bodies, federal agencies, and state agencies. It was assumed that what works in small sample, controlled settings would work in all community settings, substantially improving safety, effectiveness, and value-for-money. The evidence-based "model" programs mandate has imposed immutable "core components," fidelity testing, alien programming and program developers, loss of familiar programs, and resource capacity requirements upon tribes, while infringing upon their tribal sovereignty and consultation rights. Tribal response in one state (Oregon) went through three phases: shock and rejection; proposing an alternative approach using criteria of cultural appropriateness, aspiring to evaluability; and adopting logic modeling. The state heard and accepted the argument that the tribal way of knowing is different and valid. Currently, a state-authorized tribal logic model and a review panel process are used to approve tribal best practices for state funding. This constructive response to the evidence-based program mandate elevates tribal practices in the funding and regulatory world, facilitates continuing quality improvement and evaluation, while ensuring that practices and programs remain based on local community context and culture. This article provides details of a model that could well serve tribes facing evidence-based model program mandates throughout the country.

  10. Moving beyond GK–12

    PubMed Central

    Ufnar, J. A.; Kuner, Susan; Shepherd, V. L.

    2012-01-01

    The National Science Foundation GK–12 program has made more than 300 awards to universities, supported thousands of graduate student trainees, and impacted thousands of K–12 students and teachers. The goals of the current study were to determine the number of sustained GK–12 programs that follow the original GK–12 structure of placing graduate students into classrooms and to propose models for universities with current funding or universities interested in starting a program. Results from surveys, literature reviews, and Internet searches of programs funded between 1999 and 2008 indicated that 19 of 188 funded sites had sustained in-classroom programs. Three distinct models emerged from an analysis of these programs: a full-stipend model, in which graduate fellows worked with partner teachers in a K–12 classroom for 2 d/wk; a supplemental stipend model in which fellows worked with teachers for 1 d/wk; and a service-learning model, in which in-classroom activity was integrated into university academic coursework. Based on these results, potential models for sustainability and replication are suggested, including establishment of formal collaborations between sustained GK–12 programs and universities interested in starting in-classroom programs; development of a new Teaching Experience for Fellows program; and integration of supplemental fellow stipends into grant broader-impact sections. PMID:22949421

  11. Program Model Checking: A Practitioner's Guide

    NASA Technical Reports Server (NTRS)

    Pressburger, Thomas T.; Mansouri-Samani, Masoud; Mehlitz, Peter C.; Pasareanu, Corina S.; Markosian, Lawrence Z.; Penix, John J.; Brat, Guillaume P.; Visser, Willem C.

    2008-01-01

    Program model checking is a verification technology that uses state-space exploration to evaluate large numbers of potential program executions. Program model checking provides improved coverage over testing by systematically evaluating all possible test inputs and all possible interleavings of threads in a multithreaded system. Model-checking algorithms use several classes of optimizations to reduce the time and memory requirements for analysis, as well as heuristics for meaningful analysis of partial areas of the state space Our goal in this guidebook is to assemble, distill, and demonstrate emerging best practices for applying program model checking. We offer it as a starting point and introduction for those who want to apply model checking to software verification and validation. The guidebook will not discuss any specific tool in great detail, but we provide references for specific tools.

  12. Model driver screening and evaluation program. Volume 1, Project summary and model program recommendations

    DOT National Transportation Integrated Search

    2003-05-01

    This research project studied the feasibility as well as the scientific validity and utility of performing functional capacity screening with older drivers. A Model Program was described encompassing procedures to detect functionally impaired drivers...

  13. FEASIBILITY STUDY ON EXECUTIVE PROGRAM DEVELOPMENT FOR BASIN ECOSYSTEMS MODELING

    EPA Science Inventory

    The project was undertaken in order to provide a feasibility study in developing and implementing a complete executive program to interface automatically various basin-wide water quality models for use by relatively inexperienced modelers. This executive program should ultimately...

  14. Marketing for a Web-Based Master's Degree Program in Light of Marketing Mix Model

    ERIC Educational Resources Information Center

    Pan, Cheng-Chang

    2012-01-01

    The marketing mix model was applied with a focus on Web media to re-strategize a Web-based Master's program in a southern state university in U.S. The program's existing marketing strategy was examined using the four components of the model: product, price, place, and promotion, in hopes to repackage the program (product) to prospective students…

  15. Behavior Principles Structural Model of a Follow Through Program, Dayton, Ohio: Model Programs. Childhood Education.

    ERIC Educational Resources Information Center

    American Institutes for Research in the Behavioral Sciences, Palo Alto, CA.

    Prepared for a White House Conference on Children (December 1970), this report describes a program in which first- through third-graders in three schools in Dayton, Ohio, participate in a model of a Follow Through program sponsored by Siegfried Engelmann and Wesley Becker of the University of Oregon at Eugene. All teachers chose to participate,…

  16. 42 CFR § 510.320 - Treatment of incentive programs or add-on payments under existing Medicare payment systems.

    Code of Federal Regulations, 2010 CFR

    2017-10-01

    ... INFRASTRUCTURE AND MODEL PROGRAMS COMPREHENSIVE CARE FOR JOINT REPLACEMENT MODEL Pricing and Payment § 510.320 Treatment of incentive programs or add-on payments under existing Medicare payment systems. The CJR model... 42 Public Health 5 2017-10-01 2017-10-01 false Treatment of incentive programs or add-on payments...

  17. 42 CFR § 510.320 - Treatment of incentive programs or add-on payments under existing Medicare payment systems.

    Code of Federal Regulations, 2010 CFR

    2016-10-01

    ... INFRASTRUCTURE AND MODEL PROGRAMS COMPREHENSIVE CARE FOR JOINT REPLACEMENT MODEL Pricing and Payment § 510.320 Treatment of incentive programs or add-on payments under existing Medicare payment systems. The CJR model... 42 Public Health 5 2016-10-01 2016-10-01 false Treatment of incentive programs or add-on payments...

  18. Software reliability: Application of a reliability model to requirements error analysis

    NASA Technical Reports Server (NTRS)

    Logan, J.

    1980-01-01

    The application of a software reliability model having a well defined correspondence of computer program properties to requirements error analysis is described. Requirements error categories which can be related to program structural elements are identified and their effect on program execution considered. The model is applied to a hypothetical B-5 requirement specification for a program module.

  19. The Use of Molecular Modeling Programs in Medicinal Chemistry Instruction.

    ERIC Educational Resources Information Center

    Harrold, Marc W.

    1992-01-01

    This paper describes and evaluates the use of a molecular modeling computer program (Alchemy II) in a pharmaceutical education program. Provided are the hardware requirements and basic program features as well as several examples of how this program and its features have been applied in the classroom. (GLR)

  20. Research and development program for non-linear structural modeling with advanced time-temperature dependent constitutive relationships

    NASA Technical Reports Server (NTRS)

    Walker, K. P.

    1981-01-01

    Results of a 20-month research and development program for nonlinear structural modeling with advanced time-temperature constitutive relationships are reported. The program included: (1) the evaluation of a number of viscoplastic constitutive models in the published literature; (2) incorporation of three of the most appropriate constitutive models into the MARC nonlinear finite element program; (3) calibration of the three constitutive models against experimental data using Hastelloy-X material; and (4) application of the most appropriate constitutive model to a three dimensional finite element analysis of a cylindrical combustor liner louver test specimen to establish the capability of the viscoplastic model to predict component structural response.

  1. Computer aided reliability, availability, and safety modeling for fault-tolerant computer systems with commentary on the HARP program

    NASA Technical Reports Server (NTRS)

    Shooman, Martin L.

    1991-01-01

    Many of the most challenging reliability problems of our present decade involve complex distributed systems such as interconnected telephone switching computers, air traffic control centers, aircraft and space vehicles, and local area and wide area computer networks. In addition to the challenge of complexity, modern fault-tolerant computer systems require very high levels of reliability, e.g., avionic computers with MTTF goals of one billion hours. Most analysts find that it is too difficult to model such complex systems without computer aided design programs. In response to this need, NASA has developed a suite of computer aided reliability modeling programs beginning with CARE 3 and including a group of new programs such as: HARP, HARP-PC, Reliability Analysts Workbench (Combination of model solvers SURE, STEM, PAWS, and common front-end model ASSIST), and the Fault Tree Compiler. The HARP program is studied and how well the user can model systems using this program is investigated. One of the important objectives will be to study how user friendly this program is, e.g., how easy it is to model the system, provide the input information, and interpret the results. The experiences of the author and his graduate students who used HARP in two graduate courses are described. Some brief comparisons were made with the ARIES program which the students also used. Theoretical studies of the modeling techniques used in HARP are also included. Of course no answer can be any more accurate than the fidelity of the model, thus an Appendix is included which discusses modeling accuracy. A broad viewpoint is taken and all problems which occurred in the use of HARP are discussed. Such problems include: computer system problems, installation manual problems, user manual problems, program inconsistencies, program limitations, confusing notation, long run times, accuracy problems, etc.

  2. Optimization Research of Generation Investment Based on Linear Programming Model

    NASA Astrophysics Data System (ADS)

    Wu, Juan; Ge, Xueqian

    Linear programming is an important branch of operational research and it is a mathematical method to assist the people to carry out scientific management. GAMS is an advanced simulation and optimization modeling language and it will combine a large number of complex mathematical programming, such as linear programming LP, nonlinear programming NLP, MIP and other mixed-integer programming with the system simulation. In this paper, based on the linear programming model, the optimized investment decision-making of generation is simulated and analyzed. At last, the optimal installed capacity of power plants and the final total cost are got, which provides the rational decision-making basis for optimized investments.

  3. Representing and reasoning about program in situation calculus

    NASA Astrophysics Data System (ADS)

    Yang, Bo; Zhang, Ming-yi; Wu, Mao-nian; Xie, Gang

    2011-12-01

    Situation calculus is an expressive tool for modeling dynamical system in artificial intelligence, changes in a dynamical world is represented naturally by the notions of action, situation and fluent in situation calculus. Program can be viewed as a discrete dynamical system, so it is possible to model program with situation calculus. To model program written in a smaller core programming language CL, notion of fluent is expanded for representing value of expression. Together with some functions returning concerned objects from expressions, a basic action theory of CL programming is constructed. Under such a theory, some properties of program, such as correctness and termination can be reasoned about.

  4. Numerical simulation of nonlinear feedback model of saccade generation circuit implemented in the LabView graphical programming language.

    PubMed

    Jackson, M E; Gnadt, J W

    1999-03-01

    The object-oriented graphical programming language LabView was used to implement the numerical solution to a computational model of saccade generation in primates. The computational model simulates the activity and connectivity of anatomical strictures known to be involved in saccadic eye movements. The LabView program provides a graphical user interface to the model that makes it easy to observe and modify the behavior of each element of the model. Essential elements of the source code of the LabView program are presented and explained. A copy of the model is available for download from the internet.

  5. Modeling small-scale dairy farms in central Mexico using multi-criteria programming.

    PubMed

    Val-Arreola, D; Kebreab, E; France, J

    2006-05-01

    Milk supply from Mexican dairy farms does not meet demand and small-scale farms can contribute toward closing the gap. Two multi-criteria programming techniques, goal programming and compromise programming, were used in a study of small-scale dairy farms in central Mexico. To build the goal and compromise programming models, 4 ordinary linear programming models were also developed, which had objective functions to maximize metabolizable energy for milk production, to maximize margin of income over feed costs, to maximize metabolizable protein for milk production, and to minimize purchased feedstuffs. Neither multi-criteria approach was significantly better than the other; however, by applying both models it was possible to perform a more comprehensive analysis of these small-scale dairy systems. The multi-criteria programming models affirm findings from previous work and suggest that a forage strategy based on alfalfa, ryegrass, and corn silage would meet nutrient requirements of the herd. Both models suggested that there is an economic advantage in rescheduling the calving season to the second and third calendar quarters to better synchronize higher demand for nutrients with the period of high forage availability.

  6. Evaluation and Strategic Planning for the GLOBE Program

    NASA Astrophysics Data System (ADS)

    Geary, E. E.; Williams, V. L.

    2010-12-01

    The Global Learning and Observations to Benefit the Environment (GLOBE) Program is an international environmental education program. It unites educators, students and scientists worldwide to collaborate on inquiry based investigations of the environment and Earth system science. Evaluation of the GLOBE program has been challenging because of its broad reach, diffuse models of implementation, and multiple stakeholders. In an effort to guide current evaluation efforts, a logic model was developed that provides a visual display of how the GLOBE program operates. Using standard elements of inputs, activities, outputs, customers and outcomes, this model describes how the program operates to achieve its goals. The template used to develop this particular logic model aligns the GLOBE program operations with its program strategy, thus ensuring that what the program is doing supports the achievement of long-term, intermediate and annual goals. It also provides a foundation for the development of key programmatic metrics that can be used to gauge progress toward the achievement of strategic goals.

  7. Airport Performance Model : Volume 2 - User's Manual and Program Documentation

    DOT National Transportation Integrated Search

    1978-10-01

    Volume II contains a User's manual and program documentation for the Airport Performance Model. This computer-based model is written in FORTRAN IV for the DEC-10. The user's manual describes the user inputs to the interactive program and gives sample...

  8. The Bridges SOI Model School Program at Palo Verde School, Palo Verde, Arizona.

    ERIC Educational Resources Information Center

    Stock, William A.; DiSalvo, Pamela M.

    The Bridges SOI Model School Program is an educational service based upon the SOI (Structure of Intellect) Model School curriculum. For the middle seven months of the academic year, all students in the program complete brief daily exercises that develop specific cognitive skills delineated in the SOI model. Additionally, intensive individual…

  9. Computer programs for the calculation of dual sting pitch and roll angles required for an articulated sting to obtain angles of attack and sideslip on wind-tunnel models

    NASA Technical Reports Server (NTRS)

    Peterson, John B., Jr.

    1991-01-01

    Two programs were developed to calculate the pitch and roll position of the conventional sting drive and the pitch of a high angle articulated sting to position a wind tunnel model at the desired angle of attack and sideslip and position the model as near as possible to the centerline of the tunnel. These programs account for the effects of sting offset angles, sting bending angles, and wind-tunnel stream flow angles. In addition, the second program incorporates inputs form on-board accelerometers that measure model pitch and roll with respect to gravity. The programs are presented and a description of the numerical operation of the programs with a definition of the variables used in the programs is given.

  10. The NASA/Industry Design Analysis Methods for Vibrations (DAMVIBS) Program - A government overview. [of rotorcraft technology development using finite element method

    NASA Technical Reports Server (NTRS)

    Kvaternik, Raymond G.

    1992-01-01

    An overview is presented of government contributions to the program called Design Analysis Methods for Vibrations (DAMV) which attempted to develop finite-element-based analyses of rotorcraft vibrations. NASA initiated the program with a finite-element modeling program for the CH-47D tandem-rotor helicopter. The DAMV program emphasized four areas including: airframe finite-element modeling, difficult components studies, coupled rotor-airframe vibrations, and airframe structural optimization. Key accomplishments of the program include industrywide standards for modeling metal and composite airframes, improved industrial designs for vibrations, and the identification of critical structural contributors to airframe vibratory responses. The program also demonstrated the value of incorporating secondary modeling details to improving correlation, and the findings provide the basis for an improved finite-element-based dynamics design-analysis capability.

  11. A hybrid health service accreditation program model incorporating mandated standards and continuous improvement: interview study of multiple stakeholders in Australian health care.

    PubMed

    Greenfield, David; Hinchcliff, Reece; Hogden, Anne; Mumford, Virginia; Debono, Deborah; Pawsey, Marjorie; Westbrook, Johanna; Braithwaite, Jeffrey

    2016-07-01

    The study aim was to investigate the understandings and concerns of stakeholders regarding the evolution of health service accreditation programs in Australia. Stakeholder representatives from programs in the primary, acute and aged care sectors participated in semi-structured interviews. Across 2011-12 there were 47 group and individual interviews involving 258 participants. Interviews lasted, on average, 1 h, and were digitally recorded and transcribed. Transcriptions were analysed using textual referencing software. Four significant issues were considered to have directed the evolution of accreditation programs: altering underlying program philosophies; shifting of program content focus and details; different surveying expectations and experiences and the influence of external contextual factors upon accreditation programs. Three accreditation program models were noted by participants: regulatory compliance; continuous quality improvement and a hybrid model, incorporating elements of these two. Respondents noted the compatibility or incommensurability of the first two models. Participation in a program was reportedly experienced as ranging on a survey continuum from "malicious compliance" to "performance audits" to "quality improvement journeys". Wider contextual factors, in particular, political and community expectations, and associated media reporting, were considered significant influences on the operation and evolution of programs. A hybrid accreditation model was noted to have evolved. The hybrid model promotes minimum standards and continuous quality improvement, through examining the structure and processes of organisations and the outcomes of care. The hybrid model appears to be directing organisational and professional attention to enhance their safety cultures. Copyright © 2015 John Wiley & Sons, Ltd. Copyright © 2015 John Wiley & Sons, Ltd.

  12. Strategies for Energy Efficient Resource Management of Hybrid Programming Models

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Li, Dong; Supinski, Bronis de; Schulz, Martin

    2013-01-01

    Many scientific applications are programmed using hybrid programming models that use both message-passing and shared-memory, due to the increasing prevalence of large-scale systems with multicore, multisocket nodes. Previous work has shown that energy efficiency can be improved using software-controlled execution schemes that consider both the programming model and the power-aware execution capabilities of the system. However, such approaches have focused on identifying optimal resource utilization for one programming model, either shared-memory or message-passing, in isolation. The potential solution space, thus the challenge, increases substantially when optimizing hybrid models since the possible resource configurations increase exponentially. Nonetheless, with the accelerating adoptionmore » of hybrid programming models, we increasingly need improved energy efficiency in hybrid parallel applications on large-scale systems. In this work, we present new software-controlled execution schemes that consider the effects of dynamic concurrency throttling (DCT) and dynamic voltage and frequency scaling (DVFS) in the context of hybrid programming models. Specifically, we present predictive models and novel algorithms based on statistical analysis that anticipate application power and time requirements under different concurrency and frequency configurations. We apply our models and methods to the NPB MZ benchmarks and selected applications from the ASC Sequoia codes. Overall, we achieve substantial energy savings (8.74% on average and up to 13.8%) with some performance gain (up to 7.5%) or negligible performance loss.« less

  13. A computer program for predicting nonlinear uniaxial material responses using viscoplastic models

    NASA Technical Reports Server (NTRS)

    Chang, T. Y.; Thompson, R. L.

    1984-01-01

    A computer program was developed for predicting nonlinear uniaxial material responses using viscoplastic constitutive models. Four specific models, i.e., those due to Miller, Walker, Krieg-Swearengen-Rhode, and Robinson, are included. Any other unified model is easily implemented into the program in the form of subroutines. Analysis features include stress-strain cycling, creep response, stress relaxation, thermomechanical fatigue loop, or any combination of these responses. An outline is given on the theoretical background of uniaxial constitutive models, analysis procedure, and numerical integration methods for solving the nonlinear constitutive equations. In addition, a discussion on the computer program implementation is also given. Finally, seven numerical examples are included to demonstrate the versatility of the computer program developed.

  14. Formally verifying Ada programs which use real number types

    NASA Technical Reports Server (NTRS)

    Sutherland, David

    1986-01-01

    Formal verification is applied to programs which use real number arithmetic operations (mathematical programs). Formal verification of a program P consists of creating a mathematical model of F, stating the desired properties of P in a formal logical language, and proving that the mathematical model has the desired properties using a formal proof calculus. The development and verification of the mathematical model are discussed.

  15. 42 CFR § 512.320 - Treatment of incentive programs or add-on payments under existing Medicare payment systems.

    Code of Federal Regulations, 2010 CFR

    2017-10-01

    ... INFRASTRUCTURE AND MODEL PROGRAMS EPISODE PAYMENT MODEL Pricing and Payment § 512.320 Treatment of incentive... under such models are independent of, and do not affect, any incentive programs or add-on payments under... 42 Public Health 5 2017-10-01 2017-10-01 false Treatment of incentive programs or add-on payments...

  16. [Innovative Services: The Use of Parent Aides in Child Protective Services]. Module 2. Program Models--Which One is Right for You?

    ERIC Educational Resources Information Center

    Anderson, Stephen C.; And Others

    Module 2 of a seven module package for child protective service workers explores various types of parent aide programs for abused and neglected children and their families. Four training activities address models of parent aide programs, organization analysis, and selection of the appropriate program model. Included are directions for using the…

  17. Streamlining DOD Acquisitions: Balancing Schedule with Complexity

    DTIC Science & Technology

    2006-09-01

    from them has a distinct industrial flavor: streamlined processes, benchmarking, and business models . The requirements generation com- munity led by... model ), and the Department of the Navy assumed program lead. [Stable Program Inputs (-)] By 1984, the program goals included delivery of 913 V-22...they subsequently specified a crew of two. [Stable Program Input (-)] The contractor team won in a “fly-off” solely via modeling and simulation

  18. Case studies of fifth-grade student modeling in science through programming: Comparison of modeling practices and conversations

    NASA Astrophysics Data System (ADS)

    Louca, Loucas

    This is a descriptive case study investigating the use of two computer-based programming environments (CPEs), MicroWorlds(TM) (MW) and Stagecast Creator(TM) (SC), as modeling tools for collaborative fifth grade science learning. In this study I investigated how CPEs might support fifth grade student work and inquiry in science. There is a longstanding awareness of the need to help students learn about models and modeling in science, and CPEs are promising tools for this. A computer program can be a model of a physical system, and modeling through programming may make the process more tangible: Programming involves making decisions and assumptions; the code is used to express ideas; running the program shows the implications of those ideas. In this study I have analyzed and compared students' activities and conversations in two after-school clubs, one working with MW and the other with SC. The findings confirm the promise of CPEs as tools for teaching practices of modeling and science, and they suggest advantages and disadvantages to that purpose of particular aspects of CPE designs. MW is an open-ended, textual CPE that uses procedural programming. MW students focused on breaking down phenomena into small programmable pieces, which is useful for scientific modeling. Developing their programs, the students focused on writing, testing and debugging code, which are also useful for scientific modeling. SC is a non-linear, object-oriented CPE that uses visual program language. SC students saw their work as creating games. They were focused on the overall story which they then translated it into SC rules, which was in conflict with SC's object-oriented interface. However, telling the story of individual causal agents was useful for scientific modeling. Programming in SC was easier, whereas reading code in MW was more tangible. The latter helped MW students to use the code as the representation of the phenomenon rather than merely as a tool for creating a simulation. The analyses also pointed to three emerging "frames" that describe student's work focus, based on their goals, strategies, and criteria for success. Emerging "frames" are the programming, the visualization, and the modeling frame. One way to understand the respective advantages and disadvantages of the two CPEs is with respect to which frames they engendered in students.

  19. A Leadership Model for University Geology Department Teacher Inservice Programs.

    ERIC Educational Resources Information Center

    Sheldon, Daniel S.; And Others

    1983-01-01

    Provides geology departments and science educators with a leadership model for developing earth science inservice programs. Model emphasizes cooperation/coordination among departments, science educators, and curriculum specialists at local/intermediate/state levels. Includes rationale for inservice programs and geology department involvement in…

  20. ESPVI 4.0 ELECTROSTATIS PRECIPITATOR V-1 AND PERFORMANCE MODEL: USER'S MANUAL

    EPA Science Inventory

    The manual is the companion document for the microcomputer program ESPVI 4.0, Electrostatic Precipitation VI and Performance Model. The program was developed to provide a user- friendly interface to an advanced model of electrostatic precipitation (ESP) performance. The program i...

  1. Implementing a citizen's DWI reporting program using the Extra Eyes model

    DOT National Transportation Integrated Search

    2008-09-01

    This manual is a guide for law enforcement agencies and community organizations in creating and implementing a citizens DWI reporting program in their communities modeling the Operation Extra Eyes program. Extra Eyes is a program that engages volu...

  2. Factor Analysis of the HEW National Strategy for Youth Development Model's Community Program Impact Scales.

    ERIC Educational Resources Information Center

    Truckenmiller, James L.

    The former HEW (Health, Education, and Welfare) National Strategy for Youth Development Model proposed a community-based program to promote positive youth development and to prevent delinquency through a sequence of youth needs assessments, needs-targeted programs, and program impact evaluation. HEW Community Program Impact Scales data obtained…

  3. Evaluation of Career Guidance Programs: Models, Methods, and Microcomputers. Information Series No. 317.

    ERIC Educational Resources Information Center

    Crites, John O.

    Evaluating the effectiveness of career guidance programs is a complex process, and few comprehensive models for evaluating such programs exist. Evaluation of career guidance programs has been hampered by the myth that program outcomes are uniform and monolithic. Findings from studies of attribute treatment interactions have revealed only a few…

  4. A Growth Model for Academic Program Life Cycle (APLC): A Theoretical and Empirical Analysis

    ERIC Educational Resources Information Center

    Acquah, Edward H. K.

    2010-01-01

    Academic program life cycle concept states each program's life flows through several stages: introduction, growth, maturity, and decline. A mixed-influence diffusion growth model is fitted to enrolment data on academic programs to analyze the factors determining progress of academic programs through their life cycles. The regression analysis yield…

  5. The Origin and Evolution of the Behavior Analysis Program at the University of Nevada, Reno.

    PubMed

    Hayes, Linda J; Houmanfar, Ramona A; Ghezzi, Patrick M; Williams, W Larry; Locey, Matthew; Hayes, Steven C

    2016-05-01

    The origins of the Behavior Analysis program at the University of Nevada, Reno by way of a self-capitalized model through its transition to a more typical graduate program is described. Details of the original proposal to establish the program and the funding model are described. Some of the unusual features of the program executed in this way are discussed, along with problems engendered by the model. Also included is the diversification of faculty interests over time. The status of the program, now, after 25 years of operation, is presented.

  6. DIVWAG Model Documentation. Volume II. Programmer/Analyst Manual. Part 4.

    DTIC Science & Technology

    1976-07-01

    Model Constant Data Deck Structure . .. .... IV-13-A-40 Appendix B. Movement Model Program Descriptions . .. .. . .IV-13-B-1 1. Introduction...Data ................ IV-15-A-17 11. Airmobile Constant Data Deck Structure .. ...... .. IV-15-A-30 Appendix B. Airmobile Model Program Descriptions...Make no changes. 12. AIRMOBILE CONSTANT DATA DECK STRUCTURE . The deck structure required by the Airmobile Model constant data load program and the data

  7. Logic Models in Out-of-School Time Programs: What Are They and Why Are They Important? Research-to-Results Brief. Publication #2007-01

    ERIC Educational Resources Information Center

    Hamilton, Jenny; Bronte-Tinkew, Jacinta

    2007-01-01

    A logic model, also called a conceptual model and theory-of-change model, is a visual representation of how a program is expected to "work." It relates resources, activities, and the intended changes or impacts that a program is expected to create. Typically, logic models are diagrams or flow charts with illustrations, text, and arrows that…

  8. An assessment of patient navigator activities in breast cancer patient navigation programs using a nine-principle framework.

    PubMed

    Gunn, Christine M; Clark, Jack A; Battaglia, Tracy A; Freund, Karen M; Parker, Victoria A

    2014-10-01

    To determine how closely a published model of navigation reflects the practice of navigation in breast cancer patient navigation programs. Observational field notes describing patient navigator activities collected from 10 purposefully sampled, foundation-funded breast cancer navigation programs in 2008-2009. An exploratory study evaluated a model framework for patient navigation published by Harold Freeman by using an a priori coding scheme based on model domains. Field notes were compiled and coded. Inductive codes were added during analysis to characterize activities not included in the original model. Programs were consistent with individual-level principles representing tasks focused on individual patients. There was variation with respect to program-level principles that related to program organization and structure. Program characteristics such as the use of volunteer or clinical navigators were identified as contributors to patterns of model concordance. This research provides a framework for defining the navigator role as focused on eliminating barriers through the provision of individual-level interventions. The diversity observed at the program level in these programs was a reflection of implementation according to target population. Further guidance may be required to assist patient navigation programs to define and tailor goals and measurement to community needs. © Health Research and Educational Trust.

  9. An Assessment of Patient Navigator Activities in Breast Cancer Patient Navigation Programs Using a Nine-Principle Framework

    PubMed Central

    Gunn, Christine M; Clark, Jack A; Battaglia, Tracy A; Freund, Karen M; Parker, Victoria A

    2014-01-01

    Objective To determine how closely a published model of navigation reflects the practice of navigation in breast cancer patient navigation programs. Data Source Observational field notes describing patient navigator activities collected from 10 purposefully sampled, foundation-funded breast cancer navigation programs in 2008–2009. Study Design An exploratory study evaluated a model framework for patient navigation published by Harold Freeman by using an a priori coding scheme based on model domains. Data Collection Field notes were compiled and coded. Inductive codes were added during analysis to characterize activities not included in the original model. Principal Findings Programs were consistent with individual-level principles representing tasks focused on individual patients. There was variation with respect to program-level principles that related to program organization and structure. Program characteristics such as the use of volunteer or clinical navigators were identified as contributors to patterns of model concordance. Conclusions This research provides a framework for defining the navigator role as focused on eliminating barriers through the provision of individual-level interventions. The diversity observed at the program level in these programs was a reflection of implementation according to target population. Further guidance may be required to assist patient navigation programs to define and tailor goals and measurement to community needs. PMID:24820445

  10. Satellite broadcasting system study

    NASA Technical Reports Server (NTRS)

    1972-01-01

    The study to develop a system model and computer program representative of broadcasting satellite systems employing community-type receiving terminals is reported. The program provides a user-oriented tool for evaluating performance/cost tradeoffs, synthesizing minimum cost systems for a given set of system requirements, and performing sensitivity analyses to identify critical parameters and technology. The performance/ costing philosophy and what is meant by a minimum cost system is shown graphically. Topics discussed include: main line control program, ground segment model, space segment model, cost models and launch vehicle selection. Several examples of minimum cost systems resulting from the computer program are presented. A listing of the computer program is also included.

  11. Multi-gene genetic programming based predictive models for municipal solid waste gasification in a fluidized bed gasifier.

    PubMed

    Pandey, Daya Shankar; Pan, Indranil; Das, Saptarshi; Leahy, James J; Kwapinski, Witold

    2015-03-01

    A multi-gene genetic programming technique is proposed as a new method to predict syngas yield production and the lower heating value for municipal solid waste gasification in a fluidized bed gasifier. The study shows that the predicted outputs of the municipal solid waste gasification process are in good agreement with the experimental dataset and also generalise well to validation (untrained) data. Published experimental datasets are used for model training and validation purposes. The results show the effectiveness of the genetic programming technique for solving complex nonlinear regression problems. The multi-gene genetic programming are also compared with a single-gene genetic programming model to show the relative merits and demerits of the technique. This study demonstrates that the genetic programming based data-driven modelling strategy can be a good candidate for developing models for other types of fuels as well. Copyright © 2014 Elsevier Ltd. All rights reserved.

  12. Design for success: Identifying a process for transitioning to an intensive online course delivery model in health professions education.

    PubMed

    McDonald, Paige L; Harwood, Kenneth J; Butler, Joan T; Schlumpf, Karen S; Eschmann, Carson W; Drago, Daniela

    2018-12-01

    Intensive courses (ICs), or accelerated courses, are gaining popularity in medical and health professions education, particularly as programs adopt e-learning models to negotiate challenges of flexibility, space, cost, and time. In 2014, the Department of Clinical Research and Leadership (CRL) at the George Washington University School of Medicine and Health Sciences began the process of transitioning two online 15-week graduate programs to an IC model. Within a year, a third program also transitioned to this model. A literature review yielded little guidance on the process of transitioning from 15-week, traditional models of delivery to IC models, particularly in online learning environments. Correspondingly, this paper describes the process by which CRL transitioned three online graduate programs to an IC model and details best practices for course design and facilitation resulting from our iterative redesign process. Finally, we present lessons-learned for the benefit of other medical and health professions' programs contemplating similar transitions. CRL: Department of Clinical Research and Leadership; HSCI: Health Sciences; IC: Intensive course; PD: Program director; QM: Quality Matters.

  13. Design for success: Identifying a process for transitioning to an intensive online course delivery model in health professions education

    PubMed Central

    McDonald, Paige L.; Harwood, Kenneth J.; Butler, Joan T.; Schlumpf, Karen S.; Eschmann, Carson W.; Drago, Daniela

    2018-01-01

    ABSTRACT Intensive courses (ICs), or accelerated courses, are gaining popularity in medical and health professions education, particularly as programs adopt e-learning models to negotiate challenges of flexibility, space, cost, and time. In 2014, the Department of Clinical Research and Leadership (CRL) at the George Washington University School of Medicine and Health Sciences began the process of transitioning two online 15-week graduate programs to an IC model. Within a year, a third program also transitioned to this model. A literature review yielded little guidance on the process of transitioning from 15-week, traditional models of delivery to IC models, particularly in online learning environments. Correspondingly, this paper describes the process by which CRL transitioned three online graduate programs to an IC model and details best practices for course design and facilitation resulting from our iterative redesign process. Finally, we present lessons-learned for the benefit of other medical and health professionsʼ programs contemplating similar transitions. Abbreviations: CRL: Department of Clinical Research and Leadership; HSCI: Health Sciences; IC: Intensive course; PD: Program director; QM: Quality Matters PMID:29277143

  14. Flexible Language Constructs for Large Parallel Programs

    DOE PAGES

    Rosing, Matt; Schnabel, Robert

    1994-01-01

    The goal of the research described in this article is to develop flexible language constructs for writing large data parallel numerical programs for distributed memory (multiple instruction multiple data [MIMD]) multiprocessors. Previously, several models have been developed to support synchronization and communication. Models for global synchronization include single instruction multiple data (SIMD), single program multiple data (SPMD), and sequential programs annotated with data distribution statements. The two primary models for communication include implicit communication based on shared memory and explicit communication based on messages. None of these models by themselves seem sufficient to permit the natural and efficient expression ofmore » the variety of algorithms that occur in large scientific computations. In this article, we give an overview of a new language that combines many of these programming models in a clean manner. This is done in a modular fashion such that different models can be combined to support large programs. Within a module, the selection of a model depends on the algorithm and its efficiency requirements. In this article, we give an overview of the language and discuss some of the critical implementation details.« less

  15. Predoctoral Program Models in Dentistry for the Handicapped: The University of Tennessee Program.

    ERIC Educational Resources Information Center

    Wessels, Kenneth E.

    1980-01-01

    A model program in dentistry for the handicapped offered at the University of Tennessee and supported by a Robert Wood Johnson Foundation grant is described. The program includes didactic instruction, observation and seminar, and clinical practice. (JMF)

  16. Resource Allocation Modelling in Vocational Rehabilitation: A Prototype Developed with the Michigan and Rhode Island VR Agencies.

    ERIC Educational Resources Information Center

    Leff, H. Stephen; Turner, Ralph R.

    This report focuses on the use of linear programming models to address the issues of how vocational rehabilitation (VR) resources should be allocated in order to maximize program efficiency within given resource constraints. A general introduction to linear programming models is first presented that describes the major types of models available,…

  17. A Comprehensive Model for Developing and Evaluating Study Abroad Programs in Counselor Education

    ERIC Educational Resources Information Center

    Santos, Syntia Dinora

    2014-01-01

    This paper introduces a model to guide the process of designing and evaluating study abroad programs, addressing particular stages and influential factors. The main purpose of the model is to serve as a basic structure for those who want to develop their own program or evaluate previous cultural immersion experiences. The model is based on the…

  18. Modeling of Wave Propagation in the Osaka Sedimentary Basin during the 2013 Awaji Island Earthquake (Mw5.8)

    NASA Astrophysics Data System (ADS)

    Asano, K.; Sekiguchi, H.; Iwata, T.; Yoshimi, M.; Hayashida, T.; Saomoto, H.; Horikawa, H.

    2013-12-01

    The three-dimensional velocity structure model for the Osaka sedimentary basin, southwest Japan is developed and improved based on many kinds of geophysical explorations for decades (e.g., Kagawa et al., 1993; Horikawa et al., 2003; Iwata et al., 2008). Recently, our project (Sekiguchi et al., 2013) developed a new three-dimensional velocity model for strong motion prediction of the Uemachi fault earthquake in the Osaka basin considering both geophysical and geological information by adding newly obtained exploration data such as reflection surveys, microtremor surveys, and receiver function analysis (hereafter we call UMC2013 model) . On April 13, 2013, an inland earthquake of Mw5.8 occurred in Awaji Island, which is close to the southwestern boundary of the aftershock area of the 1995 Kobe earthquake. The strong ground motions are densely observed at more than 100 stations in the basin. The ground motion lasted longer than four minutes in the Osaka urban area where its bedrock depth is about 1-2 km. This long-duration ground motions are mainly due to the surface waves excited in this sedimentary basin whereas the magnitude of this earthquake is moderate and the rupture duration is expected to be less than 5 s. In this study, we modeled long-period (more than 2s) ground motions during this earthquake to check the performance of the present UMC2013 model and to obtain a better constraint on the attenuation factor of sedimentary part of the basin. The seismic wave propagation in the region including the source and the Osaka basin is modeled by the finite difference method using the staggered grid solving the elasto-dynamic equations. The domain of 90km×85km×25.5km is modeled and discretized with a grid spacing of 50 m. Since the minimum S-wave velocity of the UMC2013 model is about 250 m/s, this calculation is valid up to the period of about 1 s. The effect of attenuation is included in the form of Q(f)=Q0(T0/T) proposed by Graves (1996). A PML is implemented in the side and bottom of the domain. It is parallelized by the MPI and OpenMP. We computed for 120000 steps with a time step of 0.0025 s, which equals to 300s from the origin time. The source is represented by a point source having the focal mechanism determined by the F-net, NIED and the duration of its source time function is set to 3.1s referring to the waveform fitting at the rock stations outside the basin. In the previous studies for the Osaka basin (Horikawa et al., 2003; Kawabe and Kamae, 2006; Iwaki and Iwata, 2011), Q0 at a reference period T0 is given by a function of S-wave velocity; Q0=αVs. We fixed T0 at 5s, and tested Q0 value changing α from 0.1 to 1.0. Comparing the envelope of synthetic velocity waveforms with that of observed waveforms in the basin, α of 0.3 fits to the observation well, whereas the difference from 0.2 to 0.5 is not significant. The simulation explains the characteristics of observed seismic waves propagating inside the basin in terms of duration and amplitude at most stations. The response velocity spectra and dominant period would be demonstrated to see the areal performance of present velocity model (UMC2013), and we will discuss how later phases generates and propagates based on the simulated wave field. Acknowledgements: We used strong motion data from K-NET/KiK-net, JMA, Osaka Prefectural Government, BRI, and CEORKA.

  19. A multistage stochastic programming model for a multi-period strategic expansion of biofuel supply chain under evolving uncertainties

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Xie, Fei; Huang, Yongxi

    Here, we develop a multistage, stochastic mixed-integer model to support biofuel supply chain expansion under evolving uncertainties. By utilizing the block-separable recourse property, we reformulate the multistage program in an equivalent two-stage program and solve it using an enhanced nested decomposition method with maximal non-dominated cuts. We conduct extensive numerical experiments and demonstrate the application of the model and algorithm in a case study based on the South Carolina settings. The value of multistage stochastic programming method is also explored by comparing the model solution with the counterparts of an expected value based deterministic model and a two-stage stochastic model.

  20. A multistage stochastic programming model for a multi-period strategic expansion of biofuel supply chain under evolving uncertainties

    DOE PAGES

    Xie, Fei; Huang, Yongxi

    2018-02-04

    Here, we develop a multistage, stochastic mixed-integer model to support biofuel supply chain expansion under evolving uncertainties. By utilizing the block-separable recourse property, we reformulate the multistage program in an equivalent two-stage program and solve it using an enhanced nested decomposition method with maximal non-dominated cuts. We conduct extensive numerical experiments and demonstrate the application of the model and algorithm in a case study based on the South Carolina settings. The value of multistage stochastic programming method is also explored by comparing the model solution with the counterparts of an expected value based deterministic model and a two-stage stochastic model.

  1. User's manual for interactive LINEAR: A FORTRAN program to derive linear aircraft models

    NASA Technical Reports Server (NTRS)

    Antoniewicz, Robert F.; Duke, Eugene L.; Patterson, Brian P.

    1988-01-01

    An interactive FORTRAN program that provides the user with a powerful and flexible tool for the linearization of aircraft aerodynamic models is documented in this report. The program LINEAR numerically determines a linear system model using nonlinear equations of motion and a user-supplied linear or nonlinear aerodynamic model. The nonlinear equations of motion used are six-degree-of-freedom equations with stationary atmosphere and flat, nonrotating earth assumptions. The system model determined by LINEAR consists of matrices for both the state and observation equations. The program has been designed to allow easy selection and definition of the state, control, and observation variables to be used in a particular model.

  2. Efficient in-situ visualization of unsteady flows in climate simulation

    NASA Astrophysics Data System (ADS)

    Vetter, Michael; Olbrich, Stephan

    2017-04-01

    The simulation of climate data tends to produce very large data sets, which hardly can be processed in classical post-processing visualization applications. Typically, the visualization pipeline consisting of the processes data generation, visualization mapping and rendering is distributed into two parts over the network or separated via file transfer. Within most traditional post-processing scenarios the simulation is done on a supercomputer whereas the data analysis and visualization is done on a graphics workstation. That way temporary data sets with huge volume have to be transferred over the network, which leads to bandwidth bottlenecks and volume limitations. The solution to this issue is the avoidance of temporary storage, or at least significant reduction of data complexity. Within the Climate Visualization Lab - as part of the Cluster of Excellence "Integrated Climate System Analysis and Prediction" (CliSAP) at the University of Hamburg, in cooperation with the German Climate Computing Center (DKRZ) - we develop and integrate an in-situ approach. Our software framework DSVR is based on the separation of the process chain between the mapping and the rendering processes. It couples the mapping process directly to the simulation by calling methods of a parallelized data extraction library, which create a time-based sequence of geometric 3D scenes. This sequence is stored on a special streaming server with an interactive post-filtering option and then played-out asynchronously in a separate 3D viewer application. Since the rendering is part of this viewer application, the scenes can be navigated interactively. In contrast to other in-situ approaches where 2D images are created as part of the simulation or synchronous co-visualization takes place, our method supports interaction in 3D space and in time, as well as fixed frame rates. To integrate in-situ processing based on our DSVR framework and methods in the ICON climate model, we are continuously evolving the data structures and mapping algorithms of the framework to support the ICON model's native grid structures, since DSVR originally was designed for rectilinear grids only. We now have implemented a new output module to ICON to take advantage of the DSVR visualization. The visualization can be configured as most output modules by using a specific namelist and is exemplarily integrated within the non-hydrostatic atmospheric model time loop. With the integration of a DSVR based in-situ pathline extraction within ICON, a further milestone is reached. The pathline algorithm as well as the grid data structures have been optimized for the domain decomposition used for the parallelization of ICON based on MPI and OpenMP. The software implementation and evaluation is done on the supercomputers at DKRZ. In principle, the data complexity is reduced from O(n3) to O(m), where n is the grid resolution and m the number of supporting point of all pathlines. The stability and scalability evaluation is done using Atmospheric Model Intercomparison Project (AMIP) runs. We will give a short introduction in our software framework, as well as a short overview on the implementation and usage of DSVR within ICON. Furthermore, we will present visualization and evaluation results of sample applications.

  3. Using the Eclipse Parallel Tools Platform to Assist Earth Science Model Development and Optimization on High Performance Computers

    NASA Astrophysics Data System (ADS)

    Alameda, J. C.

    2011-12-01

    Development and optimization of computational science models, particularly on high performance computers, and with the advent of ubiquitous multicore processor systems, practically on every system, has been accomplished with basic software tools, typically, command-line based compilers, debuggers, performance tools that have not changed substantially from the days of serial and early vector computers. However, model complexity, including the complexity added by modern message passing libraries such as MPI, and the need for hybrid code models (such as openMP and MPI) to be able to take full advantage of high performance computers with an increasing core count per shared memory node, has made development and optimization of such codes an increasingly arduous task. Additional architectural developments, such as many-core processors, only complicate the situation further. In this paper, we describe how our NSF-funded project, "SI2-SSI: A Productive and Accessible Development Workbench for HPC Applications Using the Eclipse Parallel Tools Platform" (WHPC) seeks to improve the Eclipse Parallel Tools Platform, an environment designed to support scientific code development targeted at a diverse set of high performance computing systems. Our WHPC project to improve Eclipse PTP takes an application-centric view to improve PTP. We are using a set of scientific applications, each with a variety of challenges, and using PTP to drive further improvements to both the scientific application, as well as to understand shortcomings in Eclipse PTP from an application developer perspective, to drive our list of improvements we seek to make. We are also partnering with performance tool providers, to drive higher quality performance tool integration. We have partnered with the Cactus group at Louisiana State University to improve Eclipse's ability to work with computational frameworks and extremely complex build systems, as well as to develop educational materials to incorporate into computational science and engineering codes. Finally, we are partnering with the lead PTP developers at IBM, to ensure we are as effective as possible within the Eclipse community development. We are also conducting training and outreach to our user community, including conference BOF sessions, monthly user calls, and an annual user meeting, so that we can best inform the improvements we make to Eclipse PTP. With these activities we endeavor to encourage use of modern software engineering practices, as enabled through the Eclipse IDE, with computational science and engineering applications. These practices include proper use of source code repositories, tracking and rectifying issues, measuring and monitoring code performance changes against both optimizations as well as ever-changing software stacks and configurations on HPC systems, as well as ultimately encouraging development and maintenance of testing suites -- things that have become commonplace in many software endeavors, but have lagged in the development of science applications. We view that the challenge with the increased complexity of both HPC systems and science applications demands the use of better software engineering methods, preferably enabled by modern tools such as Eclipse PTP, to help the computational science community thrive as we evolve the HPC landscape.

  4. SCI model structure determination program (OSR) user's guide. [optimal subset regression

    NASA Technical Reports Server (NTRS)

    1979-01-01

    The computer program, OSR (Optimal Subset Regression) which estimates models for rotorcraft body and rotor force and moment coefficients is described. The technique used is based on the subset regression algorithm. Given time histories of aerodynamic coefficients, aerodynamic variables, and control inputs, the program computes correlation between various time histories. The model structure determination is based on these correlations. Inputs and outputs of the program are given.

  5. Optimal Repair And Replacement Policy For A System With Multiple Components

    DTIC Science & Technology

    2016-06-17

    Numerical Demonstration To implement the linear program, we use the Python Programming Language (PSF 2016) with the Pyomo optimization modeling language...opre.1040.0133. Hart, W.E., C. Laird, J. Watson, D.L. Woodruff. 2012. Pyomo–optimization modeling in python , vol. 67. Springer Science & Business...Media. Hart, W.E., J. Watson, D.L. Woodruff. 2011. Pyomo: modeling and solving mathematical programs in python . Mathematical Programming Computation 3(3

  6. Program Planners’ Perspectives of Promotora Roles, Recruitment, and Selection

    PubMed Central

    Koskan, Alexis; Hilfinger Messias, DeAnne K.; Friedman, Daniela B.; Brandt, Heather M.; Walsemann, Katrina M.

    2013-01-01

    Objective Program planners work with promotoras (the Spanish term for female community health workers) to reduce health disparities among underserved populations. Based on the Role-Outcomes Linkage Evaluation Model for Community Health Workers (ROLES) conceptual model, we explored how program planners conceptualized the promotora role and the approaches and strategies they used to recruit, select, and sustain promotoras. Design We conducted semi-structured, in-depth interviews with a purposive convenience sample of 24 program planners, program coordinators, promotora recruiters, research principal investigators, and other individuals who worked closely with promotoras on United States-based health programs for Hispanic women (ages 18 and older). Results Planners conceptualized the promotora role based on their personal experiences and their understanding of the underlying philosophical tenets of the promotora approach. Recruitment and selection methods reflected planners’ conceptualizations and experiences of promotoras as paid staff or volunteers. Participants described a variety of program planning and implementation methods. They focused on sustainability of the programs, the intended health behavior changes or activities, and the individual promotoras. Conclusion To strengthen health programs employing the promotora delivery model, job descriptions should delineate role expectations and boundaries and better guide promotora evaluations. We suggest including additional components such as information on funding sources, program type and delivery, and sustainability outcomes to enhance the ROLES conceptual model. The expanded model can be used to guide program planners in the planning, implementing, and evaluating of promotora health programs. PMID:23039847

  7. Model Checking JAVA Programs Using Java Pathfinder

    NASA Technical Reports Server (NTRS)

    Havelund, Klaus; Pressburger, Thomas

    2000-01-01

    This paper describes a translator called JAVA PATHFINDER from JAVA to PROMELA, the "programming language" of the SPIN model checker. The purpose is to establish a framework for verification and debugging of JAVA programs based on model checking. This work should be seen in a broader attempt to make formal methods applicable "in the loop" of programming within NASA's areas such as space, aviation, and robotics. Our main goal is to create automated formal methods such that programmers themselves can apply these in their daily work (in the loop) without the need for specialists to manually reformulate a program into a different notation in order to analyze the program. This work is a continuation of an effort to formally verify, using SPIN, a multi-threaded operating system programmed in Lisp for the Deep-Space 1 spacecraft, and of previous work in applying existing model checkers and theorem provers to real applications.

  8. Execution models for mapping programs onto distributed memory parallel computers

    NASA Technical Reports Server (NTRS)

    Sussman, Alan

    1992-01-01

    The problem of exploiting the parallelism available in a program to efficiently employ the resources of the target machine is addressed. The problem is discussed in the context of building a mapping compiler for a distributed memory parallel machine. The paper describes using execution models to drive the process of mapping a program in the most efficient way onto a particular machine. Through analysis of the execution models for several mapping techniques for one class of programs, we show that the selection of the best technique for a particular program instance can make a significant difference in performance. On the other hand, the results of benchmarks from an implementation of a mapping compiler show that our execution models are accurate enough to select the best mapping technique for a given program.

  9. Computer programs for forward and inverse modeling of acoustic and electromagnetic data

    USGS Publications Warehouse

    Ellefsen, Karl J.

    2011-01-01

    A suite of computer programs was developed by U.S. Geological Survey personnel for forward and inverse modeling of acoustic and electromagnetic data. This report describes the computer resources that are needed to execute the programs, the installation of the programs, the program designs, some tests of their accuracy, and some suggested improvements.

  10. A Value-Added Approach to Selecting the Best Master of Business Administration (MBA) Program

    ERIC Educational Resources Information Center

    Fisher, Dorothy M.; Kiang, Melody; Fisher, Steven A.

    2007-01-01

    Although numerous studies rank master of business administration (MBA) programs, prospective students' selection of the best MBA program is a formidable task. In this study, the authors used a linear-programming-based model called data envelopment analysis (DEA) to evaluate MBA programs. The DEA model connects costs to benefits to evaluate the…

  11. Pedagogy and Processes for a Computer Programming Outreach Workshop--The Bridge to College Model

    ERIC Educational Resources Information Center

    Tangney, Brendan; Oldham, Elizabeth; Conneely, Claire; Barrett, Stephen; Lawlor, John

    2010-01-01

    This paper describes a model for computer programming outreach workshops aimed at second-level students (ages 15-16). Participants engage in a series of programming activities based on the Scratch visual programming language, and a very strong group-based pedagogy is followed. Participants are not required to have any prior programming experience.…

  12. Challenges in Disseminating Model Programs: A Qualitative Analysis of the Strengthening Washington DC Families Program

    ERIC Educational Resources Information Center

    Fox, Danielle Polizzi; Gottfredson, Denise C.; Kumpfer, Karol K.; Beatty, Penny D.

    2004-01-01

    This article discusses the challenges faced when a popular model program, the Strengthening Families Program, which in the past has been implemented on a smaller scale in single organizations, moves to a larger, multiorganization endeavor. On the basis of 42 interviews conducted with program staff, the results highlight two main themes that…

  13. Compare and Contrast Program Planning Models

    ERIC Educational Resources Information Center

    Baskas, Richard S.

    2011-01-01

    This paper will examine the differences and similarities between two program planning models, Tyler and Caffarella, to reveal their strengths and weaknesses. When adults are involved in training sessions, there are various program planning models that can be used, depending on the goal of the training session. Researchers developed these models…

  14. Reverse engineering of legacy agricultural phenology modeling system

    USDA-ARS?s Scientific Manuscript database

    A program which implements predictive phenology modeling is a valuable tool for growers and scientists. Such a program was created in the late 1980's by the creators of general phenology modeling as proof of their techniques. However, this first program could not continue to meet the needs of the fi...

  15. A Conceptual Framework for Institutional Research in Community Colleges.

    ERIC Educational Resources Information Center

    Alfred, Richard L.; Ivens, Stephen H.

    This paper defines a conceptual model for institutional research in the community college and identifies sources of information, programs, and services that provide data necessary for implementation of the model. The model contains four specific subsystems: goal setting, program development, program review, and cost effectiveness. Each subsystem…

  16. Social Cognitive Theory Recommendations for Improving Modeling in Adolescent Substance Abuse Prevention Programs.

    ERIC Educational Resources Information Center

    Cleaveland, Bonnie L.

    1994-01-01

    Current state-of-the-art substance abuse prevention programs are mostly social cognitive theory based. However, there are few publications which review specifically how modeling is applied to adolescent substance abuse prevention programs. This article reviews theoretical considerations for implementing modeling for this purpose. (Author/LKS)

  17. A Model for Evaluating Development Programs. Miscellaneous Report.

    ERIC Educational Resources Information Center

    Burton, John E., Jr.; Rogers, David L.

    Taking the position that the Classical Experimental Evaluation (CEE) Model does not do justice to the process of acquiring information necessary for decision making re planning, programming, implementing, and recycling program activities, this paper presents the Inductive, System-Process (ISP) evaluation model as an alternative to be used in…

  18. Field modeling of heat transfer in atrium

    NASA Astrophysics Data System (ADS)

    Nedryshkin, Oleg; Gravit, Marina; Bushuev, Nikolay

    2017-10-01

    The results of calculating fire risk are an important element in the system of modern fire safety assessment. The article reviews the work on the mathematical modeling of fire in the room. A comparison of different calculation models in the programs of fire risk assessment and fire modeling was performed. The results of full-scale fire tests and fire modeling in the FDS program are presented. The analysis of empirical and theoretical data on fire modeling is made, a conclusion is made about the modeling accuracy in the FDS program.

  19. The NASA/MSFC global reference atmospheric model: 1990 version (GRAM-90). Part 2: Program/data listings

    NASA Technical Reports Server (NTRS)

    Justus, C. G.; Alyea, F. N.; Cunnold, D. M.; Jeffries, W. R., III; Johnson, D. L.

    1991-01-01

    A new (1990) version of the NASA/MSFC Global Reference Atmospheric Model (GRAM-90) was completed and the program and key data base listing are presented. GRAM-90 incorporate extensive new data, mostly collected under the Middle Atmosphere Program, to produce a completely revised middle atmosphere model (20 to 120 km). At altitudes greater than 120 km, GRAM-90 uses the NASA Marshall Engineering Thermosphere model. Complete listings of all program and major data bases are presented. Also, a test case is included.

  20. A Structural Evaluation of a Large-Scale Quasi-Experimental Microfinance Initiative.

    PubMed

    Kaboski, Joseph P; Townsend, Robert M

    2011-09-01

    This paper uses a structural model to understand, predict, and evaluate the impact of an exogenous microcredit intervention program, the Thai Million Baht Village Fund program. We model household decisions in the face of borrowing constraints, income uncertainty, and high-yield indivisible investment opportunities. After estimation of parameters using pre-program data, we evaluate the model's ability to predict and interpret the impact of the village fund intervention. Simulations from the model mirror the data in yielding a greater increase in consumption than credit, which is interpreted as evidence of credit constraints. A cost-benefit analysis using the model indicates that some households value the program much more than its per household cost, but overall the program costs 20 percent more than the sum of these benefits.

  1. Core Professionalism Education in Surgery: A Systematic Review.

    PubMed

    Sarıoğlu Büke, Akile; Karabilgin Öztürkçü, Özlem Sürel; Yılmaz, Yusuf; Sayek, İskender

    2018-03-15

    Professionalism education is one of the major elements of surgical residency education. To evaluate the studies on core professionalism education programs in surgical professionalism education. Systematic review. This systematic literature review was performed to analyze core professionalism programs for surgical residency education published in English with at least three of the following features: program developmental model/instructional design method, aims and competencies, methods of teaching, methods of assessment, and program evaluation model or method. A total of 27083 articles were retrieved using EBSCOHOST, PubMed, Science Direct, Web of Science, and manual search. Eight articles met the selection criteria. The instructional design method was presented in only one article, which described the Analysis, Design, Development, Implementation, and Evaluation model. Six articles were based on the Accreditation Council for Graduate Medical Education criterion, although there was significant variability in content. The most common teaching method was role modeling with scenario- and case-based learning. A wide range of assessment methods for evaluating professionalism education were reported. The Kirkpatrick model was reported in one article as a method for program evaluation. It is suggested that for a core surgical professionalism education program, developmental/instructional design model, aims and competencies, content, teaching methods, assessment methods, and program evaluation methods/models should be well defined, and the content should be comparable.

  2. Flightdeck Automation Problems (FLAP) Model for Safety Technology Portfolio Assessment

    NASA Technical Reports Server (NTRS)

    Ancel, Ersin; Shih, Ann T.

    2014-01-01

    NASA's Aviation Safety Program (AvSP) develops and advances methodologies and technologies to improve air transportation safety. The Safety Analysis and Integration Team (SAIT) conducts a safety technology portfolio assessment (PA) to analyze the program content, to examine the benefits and risks of products with respect to program goals, and to support programmatic decision making. The PA process includes systematic identification of current and future safety risks as well as tracking several quantitative and qualitative metrics to ensure the program goals are addressing prominent safety risks accurately and effectively. One of the metrics within the PA process involves using quantitative aviation safety models to gauge the impact of the safety products. This paper demonstrates the role of aviation safety modeling by providing model outputs and evaluating a sample of portfolio elements using the Flightdeck Automation Problems (FLAP) model. The model enables not only ranking of the quantitative relative risk reduction impact of all portfolio elements, but also highlighting the areas with high potential impact via sensitivity and gap analyses in support of the program office. Although the model outputs are preliminary and products are notional, the process shown in this paper is essential to a comprehensive PA of NASA's safety products in the current program and future programs/projects.

  3. GUI to Facilitate Research on Biological Damage from Radiation

    NASA Technical Reports Server (NTRS)

    Cucinotta, Frances A.; Ponomarev, Artem Lvovich

    2010-01-01

    A graphical-user-interface (GUI) computer program has been developed to facilitate research on the damage caused by highly energetic particles and photons impinging on living organisms. The program brings together, into one computational workspace, computer codes that have been developed over the years, plus codes that will be developed during the foreseeable future, to address diverse aspects of radiation damage. These include codes that implement radiation-track models, codes for biophysical models of breakage of deoxyribonucleic acid (DNA) by radiation, pattern-recognition programs for extracting quantitative information from biological assays, and image-processing programs that aid visualization of DNA breaks. The radiation-track models are based on transport models of interactions of radiation with matter and solution of the Boltzmann transport equation by use of both theoretical and numerical models. The biophysical models of breakage of DNA by radiation include biopolymer coarse-grained and atomistic models of DNA, stochastic- process models of deposition of energy, and Markov-based probabilistic models of placement of double-strand breaks in DNA. The program is designed for use in the NT, 95, 98, 2000, ME, and XP variants of the Windows operating system.

  4. Assessment of Evidence-based Management Training Program: Application of a Logic Model.

    PubMed

    Guo, Ruiling; Farnsworth, Tracy J; Hermanson, Patrick M

    2016-06-01

    The purposes of this study were to apply a logic model to plan and implement an evidence-based management (EBMgt) educational training program for healthcare administrators and to examine whether a logic model is a useful tool for evaluating the outcomes of the educational program. The logic model was used as a conceptual framework to guide the investigators in developing an EBMgt educational training program and evaluating the outcomes of the program. The major components of the logic model were constructed as inputs, outputs, and outcomes/impacts. The investigators delineated the logic model based on the results of the needs assessment survey. Two 3-hour training workshops were delivered to 30 participants. To assess the outcomes of the EBMgt educational program, pre- and post-tests and self-reflection surveys were conducted. The data were collected and analyzed descriptively and inferentially, using the IBM Statistical Package for the Social Sciences (SPSS) 22.0. A paired sample t-test was performed to compare the differences in participants' EBMgt knowledge and skills prior to and after the training. The assessment results showed that there was a statistically significant difference in participants' EBMgt knowledge and information searching skills before and after the training (p< 0.001). Participants' confidence in using the EBMgt approach for decision-making was significantly increased after the training workshops (p< 0.001). Eighty-three percent of participants indicated that the knowledge and skills they gained through the training program could be used for future management decision-making in their healthcare organizations. The overall evaluation results of the program were positive. It is suggested that the logic model is a useful tool for program planning, implementation, and evaluation, and it also improves the outcomes of the educational program.

  5. Analysis of academic programs: comparing nursing and other university majors in the application of a quality, potential and cost model.

    PubMed

    Booker, Kathy; Hilgenberg, Cheryl

    2010-01-01

    Nursing is often considered expensive in the cost analysis of academic programs. Yet nursing programs have the power to attract many students, and the national nursing shortage has resulted in a high demand for nurses. Methods to systematically assess programs across an entire university academic division are often dissimilar in technique and outcome. At a small, private, Midwestern university, a model for comprehensive program assessment, titled the Quality, Potential and Cost (QPC) model, was developed and applied to each major offered at the university through the collaborative effort of directors, chairs, deans, and the vice president for academic affairs. The QPC model provides a means of equalizing data so that single measures (such as cost) are not viewed in isolation. It also provides a common language to ensure that all academic leaders at an institution apply consistent methods for assessment of individual programs. The application of the QPC model allowed for consistent, fair assessments and the ability to allocate resources to programs according to strategic direction. In this article, the application of the QPC model to School of Nursing majors and other selected university majors will be illustrated. Copyright 2010 Elsevier Inc. All rights reserved.

  6. Combining Thermal And Structural Analyses

    NASA Technical Reports Server (NTRS)

    Winegar, Steven R.

    1990-01-01

    Computer code makes programs compatible so stresses and deformations calculated. Paper describes computer code combining thermal analysis with structural analysis. Called SNIP (for SINDA-NASTRAN Interfacing Program), code provides interface between finite-difference thermal model of system and finite-element structural model when no node-to-element correlation between models. Eliminates much manual work in converting temperature results of SINDA (Systems Improved Numerical Differencing Analyzer) program into thermal loads for NASTRAN (NASA Structural Analysis) program. Used to analyze concentrating reflectors for solar generation of electric power. Large thermal and structural models needed to predict distortion of surface shapes, and SNIP saves considerable time and effort in combining models.

  7. Planning Models for Tuberculosis Control Programs

    PubMed Central

    Chorba, Ronald W.; Sanders, J. L.

    1971-01-01

    A discrete-state, discrete-time simulation model of tuberculosis is presented, with submodels of preventive interventions. The model allows prediction of the prevalence of the disease over the simulation period. Preventive and control programs and their optimal budgets may be planned by using the model for cost-benefit analysis: costs are assigned to the program components and disease outcomes to determine the ratio of program expenditures to future savings on medical and socioeconomic costs of tuberculosis. Optimization is achieved by allocating funds in successive increments to alternative program components in simulation and identifying those components that lead to the greatest reduction in prevalence for the given level of expenditure. The method is applied to four hypothetical disease prevalence situations. PMID:4999448

  8. Program Development and External Assessment.

    ERIC Educational Resources Information Center

    Minnis, D. L.

    Although the development of new models is essential to the improvement of teacher preparation programs, the California system of accreditation of teacher education programs seems to hinder innovative program development and evaluation. External assessment in California is based on a discrepancy model which works best when applied to static…

  9. Unifying Model-Based and Reactive Programming within a Model-Based Executive

    NASA Technical Reports Server (NTRS)

    Williams, Brian C.; Gupta, Vineet; Norvig, Peter (Technical Monitor)

    1999-01-01

    Real-time, model-based, deduction has recently emerged as a vital component in AI's tool box for developing highly autonomous reactive systems. Yet one of the current hurdles towards developing model-based reactive systems is the number of methods simultaneously employed, and their corresponding melange of programming and modeling languages. This paper offers an important step towards unification. We introduce RMPL, a rich modeling language that combines probabilistic, constraint-based modeling with reactive programming constructs, while offering a simple semantics in terms of hidden state Markov processes. We introduce probabilistic, hierarchical constraint automata (PHCA), which allow Markov processes to be expressed in a compact representation that preserves the modularity of RMPL programs. Finally, a model-based executive, called Reactive Burton is described that exploits this compact encoding to perform efficIent simulation, belief state update and control sequence generation.

  10. Exploring performance and energy tradeoffs for irregular applications: A case study on the Tilera many-core architecture

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Panyala, Ajay; Chavarría-Miranda, Daniel; Manzano, Joseph B.

    High performance, parallel applications with irregular data accesses are becoming a critical workload class for modern systems. In particular, the execution of such workloads on emerging many-core systems is expected to be a significant component of applications in data mining, machine learning, scientific computing and graph analytics. However, power and energy constraints limit the capabilities of individual cores, memory hierarchy and on-chip interconnect of such systems, thus leading to architectural and software trade-os that must be understood in the context of the intended application’s behavior. Irregular applications are notoriously hard to optimize given their data-dependent access patterns, lack of structuredmore » locality and complex data structures and code patterns. We have ported two irregular applications, graph community detection using the Louvain method (Grappolo) and high-performance conjugate gradient (HPCCG), to the Tilera many-core system and have conducted a detailed study of platform-independent and platform-specific optimizations that improve their performance as well as reduce their overall energy consumption. To conduct this study, we employ an auto-tuning based approach that explores the optimization design space along three dimensions - memory layout schemes, GCC compiler flag choices and OpenMP loop scheduling options. We leverage MIT’s OpenTuner auto-tuning framework to explore and recommend energy optimal choices for different combinations of parameters. We then conduct an in-depth architectural characterization to understand the memory behavior of the selected workloads. Finally, we perform a correlation study to demonstrate the interplay between the hardware behavior and application characteristics. Using auto-tuning, we demonstrate whole-node energy savings and performance improvements of up to 49:6% and 60% relative to a baseline instantiation, and up to 31% and 45:4% relative to manually optimized variants.« less

  11. Parallelization of Lower-Upper Symmetric Gauss-Seidel Method for Chemically Reacting Flow

    NASA Technical Reports Server (NTRS)

    Yoon, Seokkwan; Jost, Gabriele; Chang, Sherry

    2005-01-01

    Development of technologies for exploration of the solar system has revived an interest in computational simulation of chemically reacting flows since planetary probe vehicles exhibit non-equilibrium phenomena during the atmospheric entry of a planet or a moon as well as the reentry to the Earth. Stability in combustion is essential for new propulsion systems. Numerical solution of real-gas flows often increases computational work by an order-of-magnitude compared to perfect gas flow partly because of the increased complexity of equations to solve. Recently, as part of Project Columbia, NASA has integrated a cluster of interconnected SGI Altix systems to provide a ten-fold increase in current supercomputing capacity that includes an SGI Origin system. Both the new and existing machines are based on cache coherent non-uniform memory access architecture. Lower-Upper Symmetric Gauss-Seidel (LU-SGS) relaxation method has been implemented into both perfect and real gas flow codes including Real-Gas Aerodynamic Simulator (RGAS). However, the vectorized RGAS code runs inefficiently on cache-based shared-memory machines such as SGI system. Parallelization of a Gauss-Seidel method is nontrivial due to its sequential nature. The LU-SGS method has been vectorized on an oblique plane in INS3D-LU code that has been one of the base codes for NAS Parallel benchmarks. The oblique plane has been called a hyperplane by computer scientists. It is straightforward to parallelize a Gauss-Seidel method by partitioning the hyperplanes once they are formed. Another way of parallelization is to schedule processors like a pipeline using software. Both hyperplane and pipeline methods have been implemented using openMP directives. The present paper reports the performance of the parallelized RGAS code on SGI Origin and Altix systems.

  12. Comparison of Acceleration Techniques for Selected Low-Level Bioinformatics Operations

    PubMed Central

    Langenkämper, Daniel; Jakobi, Tobias; Feld, Dustin; Jelonek, Lukas; Goesmann, Alexander; Nattkemper, Tim W.

    2016-01-01

    Within the recent years clock rates of modern processors stagnated while the demand for computing power continued to grow. This applied particularly for the fields of life sciences and bioinformatics, where new technologies keep on creating rapidly growing piles of raw data with increasing speed. The number of cores per processor increased in an attempt to compensate for slight increments of clock rates. This technological shift demands changes in software development, especially in the field of high performance computing where parallelization techniques are gaining in importance due to the pressing issue of large sized datasets generated by e.g., modern genomics. This paper presents an overview of state-of-the-art manual and automatic acceleration techniques and lists some applications employing these in different areas of sequence informatics. Furthermore, we provide examples for automatic acceleration of two use cases to show typical problems and gains of transforming a serial application to a parallel one. The paper should aid the reader in deciding for a certain techniques for the problem at hand. We compare four different state-of-the-art automatic acceleration approaches (OpenMP, PluTo-SICA, PPCG, and OpenACC). Their performance as well as their applicability for selected use cases is discussed. While optimizations targeting the CPU worked better in the complex k-mer use case, optimizers for Graphics Processing Units (GPUs) performed better in the matrix multiplication example. But performance is only superior at a certain problem size due to data migration overhead. We show that automatic code parallelization is feasible with current compiler software and yields significant increases in execution speed. Automatic optimizers for CPU are mature and usually no additional manual adjustment is required. In contrast, some automatic parallelizers targeting GPUs still lack maturity and are limited to simple statements and structures. PMID:26904094

  13. Technical note: RabbitCT--an open platform for benchmarking 3D cone-beam reconstruction algorithms.

    PubMed

    Rohkohl, C; Keck, B; Hofmann, H G; Hornegger, J

    2009-09-01

    Fast 3D cone beam reconstruction is mandatory for many clinical workflows. For that reason, researchers and industry work hard on hardware-optimized 3D reconstruction. Backprojection is a major component of many reconstruction algorithms that require a projection of each voxel onto the projection data, including data interpolation, before updating the voxel value. This step is the bottleneck of most reconstruction algorithms and the focus of optimization in recent publications. A crucial limitation, however, of these publications is that the presented results are not comparable to each other. This is mainly due to variations in data acquisitions, preprocessing, and chosen geometries and the lack of a common publicly available test dataset. The authors provide such a standardized dataset that allows for substantial comparison of hardware accelerated backprojection methods. They developed an open platform RabbitCT (www.rabbitCT.com) for worldwide comparison in backprojection performance and ranking on different architectures using a specific high resolution C-arm CT dataset of a rabbit. This includes a sophisticated benchmark interface, a prototype implementation in C++, and image quality measures. At the time of writing, six backprojection implementations are already listed on the website. Optimizations include multithreading using Intel threading building blocks and OpenMP, vectorization using SSE, and computation on the GPU using CUDA 2.0. There is a need for objectively comparing backprojection implementations for reconstruction algorithms. RabbitCT aims to provide a solution to this problem by offering an open platform with fair chances for all participants. The authors are looking forward to a growing community and await feedback regarding future evaluations of novel software- and hardware-based acceleration schemes.

  14. Comparison of Acceleration Techniques for Selected Low-Level Bioinformatics Operations.

    PubMed

    Langenkämper, Daniel; Jakobi, Tobias; Feld, Dustin; Jelonek, Lukas; Goesmann, Alexander; Nattkemper, Tim W

    2016-01-01

    Within the recent years clock rates of modern processors stagnated while the demand for computing power continued to grow. This applied particularly for the fields of life sciences and bioinformatics, where new technologies keep on creating rapidly growing piles of raw data with increasing speed. The number of cores per processor increased in an attempt to compensate for slight increments of clock rates. This technological shift demands changes in software development, especially in the field of high performance computing where parallelization techniques are gaining in importance due to the pressing issue of large sized datasets generated by e.g., modern genomics. This paper presents an overview of state-of-the-art manual and automatic acceleration techniques and lists some applications employing these in different areas of sequence informatics. Furthermore, we provide examples for automatic acceleration of two use cases to show typical problems and gains of transforming a serial application to a parallel one. The paper should aid the reader in deciding for a certain techniques for the problem at hand. We compare four different state-of-the-art automatic acceleration approaches (OpenMP, PluTo-SICA, PPCG, and OpenACC). Their performance as well as their applicability for selected use cases is discussed. While optimizations targeting the CPU worked better in the complex k-mer use case, optimizers for Graphics Processing Units (GPUs) performed better in the matrix multiplication example. But performance is only superior at a certain problem size due to data migration overhead. We show that automatic code parallelization is feasible with current compiler software and yields significant increases in execution speed. Automatic optimizers for CPU are mature and usually no additional manual adjustment is required. In contrast, some automatic parallelizers targeting GPUs still lack maturity and are limited to simple statements and structures.

  15. A Growth Model for the Academic Program Life Cycle (APLC): A Theoretical and Empirical Analysis. IR Applications, Volume 33

    ERIC Educational Resources Information Center

    Acquah, Edward H. K.

    2012-01-01

    The academic program life cycle (APLC) concept states each program's life flows through several stages: introduction, growth, maturity, and decline. A mixed-influence diffusion growth model is fitted to annual enrollment data on academic programs to analyze the factors determining progress of academic programs through their life cycles. The…

  16. Computer programs for calculation of sting pitch and roll angles required to obtain angles of attack and sideslip on wind tunnel models

    NASA Technical Reports Server (NTRS)

    Peterson, John B., Jr.

    1988-01-01

    Two programs have been developed to calculate the pitch and roll angles of a wind-tunnel sting drive system that will position a model at the desired angle of attack and and angle of sideslip in the wind tunnel. These programs account for the effects of sting offset angles, sting bending angles and wind-tunnel stream flow angles. In addition, the second program incorporates inputs from on-board accelerometers that measure model pitch and roll with respect to gravity. The programs are presented in the report and a description of the numerical operation of the programs with a definition of the variables used in the programs is given.

  17. A data-input program (MFI2005) for the U.S. Geological Survey modular groundwater model (MODFLOW-2005) and parameter estimation program (UCODE_2005)

    USGS Publications Warehouse

    Harbaugh, Arien W.

    2011-01-01

    The MFI2005 data-input (entry) program was developed for use with the U.S. Geological Survey modular three-dimensional finite-difference groundwater model, MODFLOW-2005. MFI2005 runs on personal computers and is designed to be easy to use; data are entered interactively through a series of display screens. MFI2005 supports parameter estimation using the UCODE_2005 program for parameter estimation. Data for MODPATH, a particle-tracking program for use with MODFLOW-2005, also can be entered using MFI2005. MFI2005 can be used in conjunction with other data-input programs so that the different parts of a model dataset can be entered by using the most suitable program.

  18. Programming of a flexible computer simulation to visualize pharmacokinetic-pharmacodynamic models.

    PubMed

    Lötsch, J; Kobal, G; Geisslinger, G

    2004-01-01

    Teaching pharmacokinetic-pharmacodynamic (PK/PD) models can be made more effective using computer simulations. We propose the programming of educational PK or PK/PD computer simulations as an alternative to the use of pre-built simulation software. This approach has the advantage of adaptability to non-standard or complicated PK or PK/PD models. Simplicity of the programming procedure was achieved by selecting the LabVIEW programming environment. An intuitive user interface to visualize the time courses of drug concentrations or effects can be obtained with pre-built elements. The environment uses a wiring analogy that resembles electrical circuit diagrams rather than abstract programming code. The goal of high interactivity of the simulation was attained by allowing the program to run in continuously repeating loops. This makes the program behave flexibly to the user input. The programming is described with the aid of a 2-compartment PK simulation. Examples of more sophisticated simulation programs are also given where the PK/PD simulation shows drug input, concentrations in plasma, and at effect site and the effects themselves as a function of time. A multi-compartmental model of morphine, including metabolite kinetics and effects is also included. The programs are available for download from the World Wide Web at http:// www. klinik.uni-frankfurt.de/zpharm/klin/ PKPDsimulation/content.html. For pharmacokineticists who only program occasionally, there is the possibility of building the computer simulation, together with the flexible interactive simulation algorithm for clinical pharmacological teaching in the field of PK/PD models.

  19. Development, implementation, and evaluation of a community- and hospital-based respiratory syncytial virus prophylaxis program.

    PubMed

    Bracht, Marianne; Heffer, Michael; O'Brien, Karel

    2005-02-01

    To implement and deliver a respiratory syncytial virus prophylaxis (RSVP) program in response to the Canadian Pediatric Society recommendations. A novel program was designed to provide inpatient RSVP for at-risk infants cared for in 1 tertiary care newborn intensive care unit (NICU). This inpatient program was part of a coordinated approach to RSVP, designed and implemented by 3 hospitals. An RSVP program logic model was created and used by a multidisciplinary team to evaluate the in-house program and identify areas of program activity requiring improvement. Following the 2000 to 2001 RSV season, a compliance and outcomes audit was performed in the tertiary center; 193 infants were enrolled in the RSVP program and 162 infants had received RSVP in the NICU [Mean = 1.64 doses]. Telephone follow-up with the parents of discharged infants identified that 159 infants (98%) had successfully completed their full course of RSVP. Using the RSVP program logic model, 5 areas for program improvement were identified including infant recruitment, patient transfer/discharge processes, product procurement, preparation/distribution/administration of doses, and healthcare team communication. Interdisciplinary collaboration is an important factor in the success of the RSVP program and has supported a consistent model of care for the delivery of RSVP. The program logic model provided a useful structure to systematically review the RSVP program in this organization.

  20. An Exemplary Program in Higher Education for Chemists, Engineers, and Chemistry Teachers.

    ERIC Educational Resources Information Center

    Ayers, Jerry B.; And Others

    This paper presents the rationale, structure, and specifications for a model program for the preparation of chemists, chemical engineers, and high school chemistry teachers. The model (an application of systems technology to program development in higher education) is based on the structure provided by the Georgia Educational Model Specifications…

Top